This does seem strange. In cases like this I find the best approach is to use Luke to see what's in the index -- what do the fields in the Lucene Document look like, is there
maybe a truncation or did the page not get parsed right?

On Mar 3, 2009, at 6:20 PM, yanky young wrote:

sorry, i have no idea about this question. i guess there must be some kind of index leakage in nutch indexing process. some words must be ignored in indexing process. but why? i don't know either. hope someone else can answer
your question.

good luck

yanky


2009/3/4 Yves Yu <[email protected]>

Hi,

And, these is another question if you don't feel boring ~~)
for example

in

http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10109&from=ePortal_NewsDetail_FromHome

there is a phase "The summit will provide a good opportunity", I can find this page by the word "good", but if I add words to search, ex: search
"opportunity" or "good opportunity", I found nothing.

why?

Yves


2009/3/4 yanky young <[email protected]>

Hi:

because they are actually the same page, you can only fine one. here is
what
i see when i use wget to fetch http://app02.laopdr.gov.la/:

C:\Documents and Settings\yanky>wget http://app02.laopdr.gov.la
--2009-03-03 23:41:19--  http://app02.laopdr.gov.la/
Resolving app02.laopdr.gov.la... 203.110.66.105
Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: http://app02.laopdr.gov.la/ePortal [following]
--2009-03-03 23:41:20--  http://app02.laopdr.gov.la/ePortal
Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: http://app02.laopdr.gov.la/ePortal/ [following]
--2009-03-03 23:41:20--  http://app02.laopdr.gov.la/ePortal/
Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location:
http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_
US [following]
--2009-03-03 23:41:21--
http://app02.laopdr.gov.la/ePortal/home/home.action?req
uest_locale=en_US
Connecting to app02.laopdr.gov.la|203.110.66.105|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `home.act...@request_locale=en_us'

you must see that through several steps of 302 status,
http://app02.laopdr.gov.la arrives at
http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
,
so
when nutch fetches http://app02.laopdr.gov.la, it actually fetches
http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
,
so
finally only the page content of

http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_USis
fetched and indexed.

that doesn't have anything to do with dynamic pages. it is about how
nutch
process 302 status.

good luck

yanky

2009/3/4 Yves Yu <[email protected]>

thank you for your answer.
I'm feeling strange because http://app02.laopdr.gov.la/ just as same
as

http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
but I cannot find it.

you could see a few frames such as "Hot Event", "Businees" in

http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
when I copy a few words in these frames, I cannot find this homepage.
but nutch can find the page which in "more>>" by same words.

I can see both http://app02.laopdr.gov.la/  and

http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US
in my fetch log, but I just cannot find the page.

I'm doubting about dynamic pages... is that reasonable?

2009/3/3 yanky young <[email protected]>
- 显示引用文字 -

Hi:

Why do u think nutch can't find

http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US

Actually http://app02.laopdr.gov.la/ is the same page as

http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US

if you find http://app02.laopdr.gov.la  in your log, the page you
said
must
be downloaded..

good luck

yanky

2009/3/3 Yves Yu <[email protected]>

Hi, all,

I met a situation, need help, thank you in advance.
I added
http://app02.laopdr.gov.la/
into urls.txt

nutch can find





http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10109&from=ePortal_NewsDetail_FromHome

but nutch cannot find


http://app02.laopdr.gov.la/ePortal/home/home.action?request_locale=en_US

anybody has any idea?

Yves






Reply via email to