Jesse Hires wrote:
What is "segments.gen" and "segments_2" ?
The warning I am getting happens when I dedup two indexes.

I create index1 and index2 through generate/fetch/index/...etc
index1 is an index of 1/2 the segments. index2 is an index of the other 1/2

The warning is happening on both datanodes.

The command I am running is "bin/nutch dedup crawl/index1 crawl/index2"

If segments.gen and segments_2 are supposed to be directories, then why are
they created as files?

They are created as files from the start
"bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb crawl/segments/XXX
crawl/segments/YYY"

I don't see any errors or warnings about creating the index.

The command that you quote above produces multiple partial indexes, located in crawl/index1/part-NNNNN and only in these subdirectories the Lucene indexes can be found. However, the deduplication process doesn't accept partial indexes, so you need to specify each /part-NNNN dir as an input to dedup.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to