Thank you for your answer, I will try it. 2009/3/4 yanky young <[email protected]>
> yes, good tips. Luke is the scalpel of nutch developers. you can try it. > > good luck > > yanky > > > 2009/3/4 Jasper Kamperman <[email protected]> > > > This does seem strange. In cases like this I find the best approach is to > > use Luke to > > see what's in the index -- what do the fields in the Lucene Document look > > like, is there > > maybe a truncation or did the page not get parsed right? > > > > > > On Mar 3, 2009, at 6:20 PM, yanky young wrote: > > > > sorry, i have no idea about this question. i guess there must be some > kind > >> of index leakage in nutch indexing process. some words must be ignored > in > >> indexing process. but why? i don't know either. hope someone else can > >> answer > >> your question. > >> > >> good luck > >> > >> yanky > >> > >> > >> 2009/3/4 Yves Yu <[email protected]> > >> > >> Hi, > >>> > >>> And, these is another question if you don't feel boring ~~) > >>> for example > >>> > >>> in > >>> > >>> > >>> > http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10109&from=ePortal_NewsDetail_FromHome > >>> > >>> there is a phase "The summit will provide a good opportunity", I can > find > >>> this page by the word "good", but if I add words to search, ex: search > >>> "opportunity" or "good opportunity", I found nothing. > >>> > >>> why? > >>> > >>> Yves > >>> > >>> > >>> 2009/3/4 yanky young <[email protected]> > >>> > >>> Hi: > >>>> > >>>> because they are actually the same page, you can only fine one. here > is > >>>> what > >>>> i see when i use wget to fetch http://app02.laopdr.gov.la/: > >>>> > >>>> C:\Documents and Settings\yanky>wget http://app02.laopdr.gov.la > >>>> --2009-03-03 23:41:19-- http://app02.laopdr.gov.la/ > >>>> Resolving app02.laopdr.gov.la... 203.110.66.105 > >>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected. > >>>> HTTP request sent, awaiting response... 302 Moved Temporarily > >>>> Location: http://app02.laopdr.gov.la/ePortal [following] > >>>> --2009-03-03 23:41:20-- http://app02.laopdr.gov.la/ePortal > >>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected. > >>>> HTTP request sent, awaiting response... 302 Moved Temporarily > >>>> Location: http://app02.laopdr.gov.la/ePortal/ [following] > >>>> --2009-03-03 23:41:20-- http://app02.laopdr.gov.la/ePortal/ > >>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected. > >>>> HTTP request sent, awaiting response... 302 Moved Temporarily > >>>> Location: > >>>> > http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_ > >>>> US [following] > >>>> --2009-03-03 23:41:21-- > >>>> http://app02.laopdr.gov.la/ePortal/home/home.action?req > >>>> uest_locale=en_US > >>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected. > >>>> HTTP request sent, awaiting response... 200 OK > >>>> Length: unspecified [text/html] > >>>> Saving to: `home.act...@request_locale=en_us' > >>>> > >>>> you must see that through several steps of 302 status, > >>>> http://app02.laopdr.gov.la arrives at > >>>> > http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US > >>>> > >>> , > >>> > >>>> so > >>>> when nutch fetches http://app02.laopdr.gov.la, it actually fetches > >>>> > http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US > >>>> > >>> , > >>> > >>>> so > >>>> finally only the page content of > >>>> > >>>> > >>> > http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_USis > >>> > >>>> fetched and indexed. > >>>> > >>>> that doesn't have anything to do with dynamic pages. it is about how > >>>> > >>> nutch > >>> > >>>> process 302 status. > >>>> > >>>> good luck > >>>> > >>>> yanky > >>>> > >>>> 2009/3/4 Yves Yu <[email protected]> > >>>> > >>>> thank you for your answer. > >>>>> I'm feeling strange because http://app02.laopdr.gov.la/ just as same > >>>>> > >>>> as > >>> > >>>> > >>>>> > >>> > http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US > >>> > >>>> but I cannot find it. > >>>>> > >>>>> you could see a few frames such as "Hot Event", "Businees" in > >>>>> > >>>>> > >>> > http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US > >>> > >>>> when I copy a few words in these frames, I cannot find this homepage. > >>>>> but nutch can find the page which in "more>>" by same words. > >>>>> > >>>>> I can see both http://app02.laopdr.gov.la/ and > >>>>> > >>>>> > >>> > http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US > >>> > >>>> in my fetch log, but I just cannot find the page. > >>>>> > >>>>> I'm doubting about dynamic pages... is that reasonable? > >>>>> > >>>>> 2009/3/3 yanky young <[email protected]> > >>>>> - 显示引用文字 - > >>>>> > >>>>> Hi: > >>>>>> > >>>>>> Why do u think nutch can't find > >>>>>> > >>>>>> > >>>> > http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US > >>>> > >>>>> > >>>>>> Actually http://app02.laopdr.gov.la/ is the same page as > >>>>>> > >>>>>> > >>>> > http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US > >>>> > >>>>> > >>>>>> if you find http://app02.laopdr.gov.la in your log, the page you > >>>>>> > >>>>> said > >>> > >>>> must > >>>>>> be downloaded.. > >>>>>> > >>>>>> good luck > >>>>>> > >>>>>> yanky > >>>>>> > >>>>>> 2009/3/3 Yves Yu <[email protected]> > >>>>>> > >>>>>> Hi, all, > >>>>>>> > >>>>>>> I met a situation, need help, thank you in advance. > >>>>>>> I added > >>>>>>> http://app02.laopdr.gov.la/ > >>>>>>> into urls.txt > >>>>>>> > >>>>>>> nutch can find > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10109&from=ePortal_NewsDetail_FromHome > >>> > >>>> > >>>>>>> but nutch cannot find > >>>>>>> > >>>>>>> > >>>>> > >>> > http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US > >>> > >>>> > >>>>>>> anybody has any idea? > >>>>>>> > >>>>>>> Yves > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > > >
