yes, good tips. Luke is the scalpel of nutch developers. you can try it.

good luck

yanky


2009/3/4 Jasper Kamperman <[email protected]>

> This does seem strange. In cases like this I find the best approach is to
> use Luke to
> see what's in the index -- what do the fields in the Lucene Document look
> like, is there
> maybe a truncation or did the page not get parsed right?
>
>
> On Mar 3, 2009, at 6:20 PM, yanky young wrote:
>
>  sorry, i have no idea about this question. i guess there must be some kind
>> of index leakage in nutch indexing process. some words must be ignored in
>> indexing process. but why? i don't know either. hope someone else can
>> answer
>> your question.
>>
>> good luck
>>
>> yanky
>>
>>
>> 2009/3/4 Yves Yu <[email protected]>
>>
>>  Hi,
>>>
>>> And, these is another question if you don't feel boring ~~)
>>> for example
>>>
>>> in
>>>
>>>
>>> http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10109&from=ePortal_NewsDetail_FromHome
>>>
>>> there is a phase "The summit will provide a good opportunity", I can find
>>> this page by the word "good", but if I add words to search, ex: search
>>> "opportunity" or "good opportunity", I found nothing.
>>>
>>> why?
>>>
>>> Yves
>>>
>>>
>>> 2009/3/4 yanky young <[email protected]>
>>>
>>>  Hi:
>>>>
>>>> because they are actually the same page, you can only fine one. here is
>>>> what
>>>> i see when i use wget to fetch http://app02.laopdr.gov.la/:
>>>>
>>>> C:\Documents and Settings\yanky>wget http://app02.laopdr.gov.la
>>>> --2009-03-03 23:41:19--  http://app02.laopdr.gov.la/
>>>> Resolving app02.laopdr.gov.la... 203.110.66.105
>>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected.
>>>> HTTP request sent, awaiting response... 302 Moved Temporarily
>>>> Location: http://app02.laopdr.gov.la/ePortal [following]
>>>> --2009-03-03 23:41:20--  http://app02.laopdr.gov.la/ePortal
>>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected.
>>>> HTTP request sent, awaiting response... 302 Moved Temporarily
>>>> Location: http://app02.laopdr.gov.la/ePortal/ [following]
>>>> --2009-03-03 23:41:20--  http://app02.laopdr.gov.la/ePortal/
>>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected.
>>>> HTTP request sent, awaiting response... 302 Moved Temporarily
>>>> Location:
>>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_
>>>> US [following]
>>>> --2009-03-03 23:41:21--
>>>> http://app02.laopdr.gov.la/ePortal/home/home.action?req
>>>> uest_locale=en_US
>>>> Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected.
>>>> HTTP request sent, awaiting response... 200 OK
>>>> Length: unspecified [text/html]
>>>> Saving to: `home.act...@request_locale=en_us'
>>>>
>>>> you must see that through several steps of 302 status,
>>>> http://app02.laopdr.gov.la arrives at
>>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
>>>>
>>> ,
>>>
>>>> so
>>>> when nutch fetches http://app02.laopdr.gov.la, it actually fetches
>>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
>>>>
>>> ,
>>>
>>>> so
>>>> finally only the page content of
>>>>
>>>>
>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_USis
>>>
>>>> fetched and indexed.
>>>>
>>>> that doesn't have anything to do with dynamic pages. it is about how
>>>>
>>> nutch
>>>
>>>> process 302 status.
>>>>
>>>> good luck
>>>>
>>>> yanky
>>>>
>>>> 2009/3/4 Yves Yu <[email protected]>
>>>>
>>>>  thank you for your answer.
>>>>> I'm feeling strange because http://app02.laopdr.gov.la/ just as same
>>>>>
>>>> as
>>>
>>>>
>>>>>
>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
>>>
>>>> but I cannot find it.
>>>>>
>>>>> you could see a few frames such as "Hot Event", "Businees" in
>>>>>
>>>>>
>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
>>>
>>>> when I copy a few words in these frames, I cannot find this homepage.
>>>>> but nutch can find the page which in "more>>" by same words.
>>>>>
>>>>> I can see both http://app02.laopdr.gov.la/  and
>>>>>
>>>>>
>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
>>>
>>>> in my fetch log, but I just cannot find the page.
>>>>>
>>>>> I'm doubting about dynamic pages... is that reasonable?
>>>>>
>>>>> 2009/3/3 yanky young <[email protected]>
>>>>> - 显示引用文字 -
>>>>>
>>>>>  Hi:
>>>>>>
>>>>>> Why do u think nutch can't find
>>>>>>
>>>>>>
>>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
>>>>
>>>>>
>>>>>> Actually http://app02.laopdr.gov.la/ is the same page as
>>>>>>
>>>>>>
>>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
>>>>
>>>>>
>>>>>> if you find http://app02.laopdr.gov.la  in your log, the page you
>>>>>>
>>>>> said
>>>
>>>> must
>>>>>> be downloaded..
>>>>>>
>>>>>> good luck
>>>>>>
>>>>>> yanky
>>>>>>
>>>>>> 2009/3/3 Yves Yu <[email protected]>
>>>>>>
>>>>>>  Hi, all,
>>>>>>>
>>>>>>> I met a situation, need help, thank you in advance.
>>>>>>> I added
>>>>>>> http://app02.laopdr.gov.la/
>>>>>>> into urls.txt
>>>>>>>
>>>>>>> nutch can find
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>> http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10109&from=ePortal_NewsDetail_FromHome
>>>
>>>>
>>>>>>> but nutch cannot find
>>>>>>>
>>>>>>>
>>>>>
>>> http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
>>>
>>>>
>>>>>>> anybody has any idea?
>>>>>>>
>>>>>>> Yves
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Reply via email to