Re: [Nutch-general] deleting URL duplicates - never actually deleted?

Marko Bauhardt Sun, 02 Jul 2006 14:37:17 -0700

>
> # De-duplicate indexes
> # "bogus" argument is ignored but needed due to
> # a bug in the number of args expected
> bin/nutch dedup crawl/segments bogus
>


The dedup command works only on many indexes and not on one or many  
segments. The directory structure of an index looks like:
index/part-00000/SOME_LUCENE_FILES

Here is an example how is the structure of an crawl:
crawl/segments/20060702232437
crawl/segments/20060702233040
crawl/linkdb
crawl/indexes //this is the index of the two segments

Now you can run dedup: bin/nutch dedup crawl/indexes

If you run dedup on a folder which contains segments, an exception  
should be thrown. Look at your logfiles and verify that the dedup  
process runs whithout exeptions.

Marko


Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] deleting URL duplicates - never actually deleted?

Reply via email to