Thanks Andrzej. I don't think my scenario would be applicable in real-life situations. However, it would be great to know where the root of the problem lies.
I have managed to dedup a larger index, and is working perfect. So your theory is correct. I guess it's a matter of digging a little deeper to eliminate this once and for all. Thanks. -----Original Message----- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: 01 February 2007 17:59 To: [email protected] Subject: Re: Dedup index error Hetal Shah wrote: > Another quick update: > > I ran Luke on the index, and part-00000 works fine, whereas part-00001 > comes up as corrupt or missing. Now seeing from the list of files in > both these directories, we know that there is nothing in part-00001 - > so why does it get generated? And if it does, why does dedup not handle it gracefully? > > I also ran a merge on the two indexes, and it worked fine. > > So that rests the case that both the indexes are corrupted. This > brings me to understand that since I only had two pages indexed and > the index was small, part-00001 came up with nothing, and dedup does not handle it???? > > Any thoughts? > There seems to be an issue with the document partitioning - it seems that for larger numbers of document the partitioning schema generates at least one document per partition, but in your case there were too few documents to fill the second partition ... I need to check where the problem originates - however, this should not happen if you index more documents than 2 * the number of reduce tasks. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
