Potential Bug: Index documents with incorrect segment numbers

igor.k Thu, 07 Jan 2010 15:07:32 -0800

Hey Guys,

I've been running various crawls with Nutch and noticed some strange
behavior.
When examining an index with Luke, I noticed that for some documents, the
segment number is incorrect.
This seems to occur very rarely.


Example:
A document in the index will have a url : "www.sample.com", title : " Sample
Title", segment  :"20100107210040425", etc.

When in actuality, the content regarding that document will actually be in a
different segment.

I noticed this when crawling http://www.cnn.com with the following
parameters, allowing for all subdomains of cnn.com in the
crawl-urlfilter.txt


./nutch crawl urls -dir crawled -depth 4 -topN 25 

Here are some urls that demonstrate this behavior.

http://money.cnn.com/markets/nasdaq/
http://money.cnn.com/markets/sandp/ 
http://www.cnn.com/

I'm using Nutch-1.0 with a standard configuration.

Has anyone else noticed this?


Thanks,
-Igor
-- 
View this message in context: 
http://old.nabble.com/Potential-Bug%3A-Index-documents-with-incorrect-segment-numbers-tp27068662p27068662.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Potential Bug: Index documents with incorrect segment numbers

Reply via email to