Thank you for your answer, I will try it.

2009/3/4 yanky young <[email protected]>

> yes, good tips. Luke is the scalpel of nutch developers. you can try it.
>
> good luck
>
> yanky
>
>
> 2009/3/4 Jasper Kamperman <[email protected]>
>
> > This does seem strange. In cases like this I find the best approach is to
> > use Luke to
> > see what's in the index -- what do the fields in the Lucene Document look
> > like, is there
> > maybe a truncation or did the page not get parsed right?
> >
> >
> > On Mar 3, 2009, at 6:20 PM, yanky young wrote:
> >
> >  sorry, i have no idea about this question. i guess there must be some
> kind
> >> of index leakage in nutch indexing process. some words must be ignored
> in
> >> indexing process. but why? i don't know either. hope someone else can
> >> answer
> >> your question.
> >>
> >> good luck
> >>
> >> yanky
> >>
> >>
> >> 2009/3/4 Yves Yu <[email protected]>
> >>
> >>  Hi,
> >>>
> >>> And, these is another question if you don't feel boring ~~)
> >>> for example
> >>>
> >>> in
> >>>
> >>>
> >>>
> http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10109&from=ePortal_NewsDetail_FromHome
> >>>
> >>> there is a phase "The summit will provide a good opportunity", I can
> find
> >>> this page by the word "good", but if I add words to search, ex: search
> >>> "opportunity" or "good opportunity", I found nothing.
> >>>
> >>> why?
> >>>
> >>> Yves
> >>>
> >>>
> >>> 2009/3/4 yanky young <[email protected]>
> >>>
> >>>  Hi:
> >>>>
> >>>> because they are actually the same page, you can only fine one. here
> is
> >>>> what
> >>>> i see when i use wget to fetch http://app02.laopdr.gov.la/:
> >>>>
> >>>> C:\Documents and Settings\yanky>wget http://app02.laopdr.gov.la
> >>>> --2009-03-03 23:41:19--  http://app02.laopdr.gov.la/
> >>>> Resolving app02.laopdr.gov.la... 203.110.66.105
> >>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected.
> >>>> HTTP request sent, awaiting response... 302 Moved Temporarily
> >>>> Location: http://app02.laopdr.gov.la/ePortal [following]
> >>>> --2009-03-03 23:41:20--  http://app02.laopdr.gov.la/ePortal
> >>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected.
> >>>> HTTP request sent, awaiting response... 302 Moved Temporarily
> >>>> Location: http://app02.laopdr.gov.la/ePortal/ [following]
> >>>> --2009-03-03 23:41:20--  http://app02.laopdr.gov.la/ePortal/
> >>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected.
> >>>> HTTP request sent, awaiting response... 302 Moved Temporarily
> >>>> Location:
> >>>>
> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_
> >>>> US [following]
> >>>> --2009-03-03 23:41:21--
> >>>> http://app02.laopdr.gov.la/ePortal/home/home.action?req
> >>>> uest_locale=en_US
> >>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected.
> >>>> HTTP request sent, awaiting response... 200 OK
> >>>> Length: unspecified [text/html]
> >>>> Saving to: `home.act...@request_locale=en_us'
> >>>>
> >>>> you must see that through several steps of 302 status,
> >>>> http://app02.laopdr.gov.la arrives at
> >>>>
> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
> >>>>
> >>> ,
> >>>
> >>>> so
> >>>> when nutch fetches http://app02.laopdr.gov.la, it actually fetches
> >>>>
> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
> >>>>
> >>> ,
> >>>
> >>>> so
> >>>> finally only the page content of
> >>>>
> >>>>
> >>>
> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_USis
> >>>
> >>>> fetched and indexed.
> >>>>
> >>>> that doesn't have anything to do with dynamic pages. it is about how
> >>>>
> >>> nutch
> >>>
> >>>> process 302 status.
> >>>>
> >>>> good luck
> >>>>
> >>>> yanky
> >>>>
> >>>> 2009/3/4 Yves Yu <[email protected]>
> >>>>
> >>>>  thank you for your answer.
> >>>>> I'm feeling strange because http://app02.laopdr.gov.la/ just as same
> >>>>>
> >>>> as
> >>>
> >>>>
> >>>>>
> >>>
> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
> >>>
> >>>> but I cannot find it.
> >>>>>
> >>>>> you could see a few frames such as "Hot Event", "Businees" in
> >>>>>
> >>>>>
> >>>
> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
> >>>
> >>>> when I copy a few words in these frames, I cannot find this homepage.
> >>>>> but nutch can find the page which in "more>>" by same words.
> >>>>>
> >>>>> I can see both http://app02.laopdr.gov.la/  and
> >>>>>
> >>>>>
> >>>
> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
> >>>
> >>>> in my fetch log, but I just cannot find the page.
> >>>>>
> >>>>> I'm doubting about dynamic pages... is that reasonable?
> >>>>>
> >>>>> 2009/3/3 yanky young <[email protected]>
> >>>>> - 显示引用文字 -
> >>>>>
> >>>>>  Hi:
> >>>>>>
> >>>>>> Why do u think nutch can't find
> >>>>>>
> >>>>>>
> >>>>
> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
> >>>>
> >>>>>
> >>>>>> Actually http://app02.laopdr.gov.la/ is the same page as
> >>>>>>
> >>>>>>
> >>>>
> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
> >>>>
> >>>>>
> >>>>>> if you find http://app02.laopdr.gov.la  in your log, the page you
> >>>>>>
> >>>>> said
> >>>
> >>>> must
> >>>>>> be downloaded..
> >>>>>>
> >>>>>> good luck
> >>>>>>
> >>>>>> yanky
> >>>>>>
> >>>>>> 2009/3/3 Yves Yu <[email protected]>
> >>>>>>
> >>>>>>  Hi, all,
> >>>>>>>
> >>>>>>> I met a situation, need help, thank you in advance.
> >>>>>>> I added
> >>>>>>> http://app02.laopdr.gov.la/
> >>>>>>> into urls.txt
> >>>>>>>
> >>>>>>> nutch can find
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10109&from=ePortal_NewsDetail_FromHome
> >>>
> >>>>
> >>>>>>> but nutch cannot find
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>
> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
> >>>
> >>>>
> >>>>>>> anybody has any idea?
> >>>>>>>
> >>>>>>> Yves
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >
>

Reply via email to