Hey Guys, I've been running various crawls with Nutch and noticed some strange behavior. When examining an index with Luke, I noticed that for some documents, the segment number is incorrect. This seems to occur very rarely.
Example: A document in the index will have a url : "www.sample.com", title : " Sample Title", segment :"20100107210040425", etc. When in actuality, the content regarding that document will actually be in a different segment. I noticed this when crawling http://www.cnn.com with the following parameters, allowing for all subdomains of cnn.com in the crawl-urlfilter.txt ./nutch crawl urls -dir crawled -depth 4 -topN 25 Here are some urls that demonstrate this behavior. http://money.cnn.com/markets/nasdaq/ http://money.cnn.com/markets/sandp/ http://www.cnn.com/ I'm using Nutch-1.0 with a standard configuration. Has anyone else noticed this? Thanks, -Igor -- View this message in context: http://old.nabble.com/Potential-Bug%3A-Index-documents-with-incorrect-segment-numbers-tp27068662p27068662.html Sent from the Nutch - Dev mailing list archive at Nabble.com.