Jesse Hires wrote:
What is "segments.gen" and "segments_2" ?
The warning I am getting happens when I dedup two indexes.
I create index1 and index2 through generate/fetch/index/...etc
index1 is an index of 1/2 the segments. index2 is an index of the other 1/2
The warning is happening on both datanodes.
The command I am running is "bin/nutch dedup crawl/index1 crawl/index2"
If segments.gen and segments_2 are supposed to be directories, then why are
they created as files?
They are created as files from the start
"bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb crawl/segments/XXX
crawl/segments/YYY"
I don't see any errors or warnings about creating the index.
The command that you quote above produces multiple partial indexes,
located in crawl/index1/part-NNNNN and only in these subdirectories the
Lucene indexes can be found. However, the deduplication process doesn't
accept partial indexes, so you need to specify each /part-NNNN dir as an
input to dedup.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com