I'm on Nutch 0.7, and I just noticed recently that after merging segments, a lot of URLs that I thought should be there disappeared. I did a segread -dumpsort on the original segments and on the merged segment and found that I had lost 30% of my URLs.
Doing a diff on the url files, I found that some URLs were even resurrected (they didn't show up in the original segments, but showed up on the merged segment). I checked the logs and there was one small corrupted segment (not enough to account for the lost URLs), but mergesegs just seemed to ignore it and go on. I commented out the code in SegmentMergeTool.java that had to do with deleting duplicates, and the problem went away. I get the same set of URLs before and after merging. My plan for now is to locally comment out this deletion code, and use bin/nutch dedup on the merged index, but I was wondering if anyone else has seen this problem in either 0.7 or 0.8. Any ideas on why it might be happening? Thanks! Howie ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
