# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup crawl/segments bogus
The dedup command works only on many indexes and not on one or many
segments. The directory structure of an index looks like:
index/part-00000/SOME_LUCENE_FILES
Here is an example how is the structure of an crawl:
crawl/segments/20060702232437
crawl/segments/20060702233040
crawl/linkdb
crawl/indexes //this is the index of the two segments
Now you can run dedup: bin/nutch dedup crawl/indexes
If you run dedup on a folder which contains segments, an exception
should be thrown. Look at your logfiles and verify that the dedup
process runs whithout exeptions.
Marko