# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup crawl/segments bogus


The dedup command works only on many indexes and not on one or many segments. The directory structure of an index looks like:
index/part-00000/SOME_LUCENE_FILES

Here is an example how is the structure of an crawl:
crawl/segments/20060702232437
crawl/segments/20060702233040
crawl/linkdb
crawl/indexes //this is the index of the two segments

Now you can run dedup: bin/nutch dedup crawl/indexes

If you run dedup on a folder which contains segments, an exception should be thrown. Look at your logfiles and verify that the dedup process runs whithout exeptions.

Marko

Reply via email to