Thanks for the response, Andrzej. >Mergesegs also performs dedup. If you compare the list of urls in the index >based on the original input segments, but AFTER dedup, and in the index >built from the merged segment, are they different?
I should have specified. I didn't run index after merging. I just did bin/nutch mergesegs -dir mydb/segments (no -i or -ds options). Then I immediately do a segread on the new merged segment. And the list of URLs are different -- mostly missing URLs, but also some "new" URLs. I find the addition of new URLs in the merged segments especially puzzling. Where do they come from? Is segread lying to me about what's in the original segments? I checked the segread output on the deleted URLs and I don't find anything strange in their status. I have a feeling that the mergesegs dedup is what is causing the problem since when I commented out this code, the list of urls is the same before and after merging. It's possible that I have some sort of corruption in the original segments that is causing unpredictable behavior in the mergesegs dedup code. >Could you perhaps provide a minimal fetchlist + exact steps you took, to >illustrate and reproduce the problem? I don't have a minimal fetchlist right now. I'll see if I can get one together. I wouldn't be surprised if the problem only occurred after getting a significant number of pages. Thanks, Howie ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
