Howie Wang wrote: > I'm on Nutch 0.7, and I just noticed recently that after > merging segments, a lot of URLs that I thought should be > there disappeared. I did a segread -dumpsort on the > original segments and on the merged segment and found > that I had lost 30% of my URLs. > > Doing a diff on the url files, I found that some URLs were > even resurrected (they didn't show up in the original > segments, but showed up on the merged segment). > > I checked the logs and there was one small corrupted segment > (not enough to account for the lost URLs), but mergesegs just > seemed to ignore it and go on. > > I commented out the code in SegmentMergeTool.java that > had to do with deleting duplicates, and the problem went > away. I get the same set of URLs before and after merging. > > My plan for now is to locally comment out this deletion > code, and use bin/nutch dedup on the merged index, but > I was wondering if anyone else has seen this problem in either > 0.7 or 0.8. Any ideas on why it might be happening?
Mergesegs also performs dedup. If you compare the list of urls in the index based on the original input segments, but AFTER dedup, and in the index built from the merged segment, are they different? Could you perhaps provide a minimal fetchlist + exact steps you took, to illustrate and reproduce the problem? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
