I crawled a website - out of 1000 links, 100 failed to fetch because they were invalid links or network error or some other type of error. When I did a search for any of these 100 failed pages - I didn't find it in the search results.
So, I think nutch deletes these kind of urls from the index, but some page info about them still exists in the segments. Is this right? If so, will having these failed pages in the segments affect any performance? I am assuming, no. But, any clarity about this matter is appreciated. Will the size of the segments (storing these failed pages info takes some space) affect the performance too? Also, I am curious on more inner details on how indexing works in nutch - how does nutch index an segment, what does it index and what does it not index? I could use Luke. Any other tools or techniques? Thanks. -- View this message in context: http://www.nabble.com/Failed-Fetch-Pages---Index-Verification-and-Optimization-tf4564385.html#a13027939 Sent from the Nutch - Dev mailing list archive at Nabble.com.
