Hetal Shah wrote:
Another quick update:

I ran Luke on the index, and part-00000 works fine, whereas part-00001 comes
up as corrupt or missing. Now seeing from the list of files in both these
directories, we know that there is nothing in part-00001 - so why does it
get generated? And if it does, why does dedup not handle it gracefully?

I also ran a merge on the two indexes, and it worked fine.
So that rests the case that both the indexes are corrupted. This brings me
to understand that since I only had two pages indexed and the index was
small, part-00001 came up with nothing, and dedup does not handle it????

Any thoughts?

There seems to be an issue with the document partitioning - it seems that for larger numbers of document the partitioning schema generates at least one document per partition, but in your case there were too few documents to fill the second partition ... I need to check where the problem originates - however, this should not happen if you index more documents than 2 * the number of reduce tasks.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to