yes, good tips. Luke is the scalpel of nutch developers. you can try it. good luck
yanky 2009/3/4 Jasper Kamperman <[email protected]> > This does seem strange. In cases like this I find the best approach is to > use Luke to > see what's in the index -- what do the fields in the Lucene Document look > like, is there > maybe a truncation or did the page not get parsed right? > > > On Mar 3, 2009, at 6:20 PM, yanky young wrote: > > sorry, i have no idea about this question. i guess there must be some kind >> of index leakage in nutch indexing process. some words must be ignored in >> indexing process. but why? i don't know either. hope someone else can >> answer >> your question. >> >> good luck >> >> yanky >> >> >> 2009/3/4 Yves Yu <[email protected]> >> >> Hi, >>> >>> And, these is another question if you don't feel boring ~~) >>> for example >>> >>> in >>> >>> >>> http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10109&from=ePortal_NewsDetail_FromHome >>> >>> there is a phase "The summit will provide a good opportunity", I can find >>> this page by the word "good", but if I add words to search, ex: search >>> "opportunity" or "good opportunity", I found nothing. >>> >>> why? >>> >>> Yves >>> >>> >>> 2009/3/4 yanky young <[email protected]> >>> >>> Hi: >>>> >>>> because they are actually the same page, you can only fine one. here is >>>> what >>>> i see when i use wget to fetch http://app02.laopdr.gov.la/: >>>> >>>> C:\Documents and Settings\yanky>wget http://app02.laopdr.gov.la >>>> --2009-03-03 23:41:19-- http://app02.laopdr.gov.la/ >>>> Resolving app02.laopdr.gov.la... 203.110.66.105 >>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected. >>>> HTTP request sent, awaiting response... 302 Moved Temporarily >>>> Location: http://app02.laopdr.gov.la/ePortal [following] >>>> --2009-03-03 23:41:20-- http://app02.laopdr.gov.la/ePortal >>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected. >>>> HTTP request sent, awaiting response... 302 Moved Temporarily >>>> Location: http://app02.laopdr.gov.la/ePortal/ [following] >>>> --2009-03-03 23:41:20-- http://app02.laopdr.gov.la/ePortal/ >>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected. >>>> HTTP request sent, awaiting response... 302 Moved Temporarily >>>> Location: >>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_ >>>> US [following] >>>> --2009-03-03 23:41:21-- >>>> http://app02.laopdr.gov.la/ePortal/home/home.action?req >>>> uest_locale=en_US >>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected. >>>> HTTP request sent, awaiting response... 200 OK >>>> Length: unspecified [text/html] >>>> Saving to: `home.act...@request_locale=en_us' >>>> >>>> you must see that through several steps of 302 status, >>>> http://app02.laopdr.gov.la arrives at >>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US >>>> >>> , >>> >>>> so >>>> when nutch fetches http://app02.laopdr.gov.la, it actually fetches >>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US >>>> >>> , >>> >>>> so >>>> finally only the page content of >>>> >>>> >>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_USis >>> >>>> fetched and indexed. >>>> >>>> that doesn't have anything to do with dynamic pages. it is about how >>>> >>> nutch >>> >>>> process 302 status. >>>> >>>> good luck >>>> >>>> yanky >>>> >>>> 2009/3/4 Yves Yu <[email protected]> >>>> >>>> thank you for your answer. >>>>> I'm feeling strange because http://app02.laopdr.gov.la/ just as same >>>>> >>>> as >>> >>>> >>>>> >>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US >>> >>>> but I cannot find it. >>>>> >>>>> you could see a few frames such as "Hot Event", "Businees" in >>>>> >>>>> >>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US >>> >>>> when I copy a few words in these frames, I cannot find this homepage. >>>>> but nutch can find the page which in "more>>" by same words. >>>>> >>>>> I can see both http://app02.laopdr.gov.la/ and >>>>> >>>>> >>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US >>> >>>> in my fetch log, but I just cannot find the page. >>>>> >>>>> I'm doubting about dynamic pages... is that reasonable? >>>>> >>>>> 2009/3/3 yanky young <[email protected]> >>>>> - 显示引用文字 - >>>>> >>>>> Hi: >>>>>> >>>>>> Why do u think nutch can't find >>>>>> >>>>>> >>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US >>>> >>>>> >>>>>> Actually http://app02.laopdr.gov.la/ is the same page as >>>>>> >>>>>> >>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US >>>> >>>>> >>>>>> if you find http://app02.laopdr.gov.la in your log, the page you >>>>>> >>>>> said >>> >>>> must >>>>>> be downloaded.. >>>>>> >>>>>> good luck >>>>>> >>>>>> yanky >>>>>> >>>>>> 2009/3/3 Yves Yu <[email protected]> >>>>>> >>>>>> Hi, all, >>>>>>> >>>>>>> I met a situation, need help, thank you in advance. >>>>>>> I added >>>>>>> http://app02.laopdr.gov.la/ >>>>>>> into urls.txt >>>>>>> >>>>>>> nutch can find >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10109&from=ePortal_NewsDetail_FromHome >>> >>>> >>>>>>> but nutch cannot find >>>>>>> >>>>>>> >>>>> >>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US >>> >>>> >>>>>>> anybody has any idea? >>>>>>> >>>>>>> Yves >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >
