Re: [Nutch-general] deleting URL duplicates - never actually deleted?

Marko Bauhardt Fri, 30 Jun 2006 02:57:54 -0700

  Do you delete the duplicates before you merge the index? Run first  
the merge command and then the dedup command.


But a better way is you create one index of all segments with the  
index command and then runs the dedup command of this one index.

Hope this Helps,
Marko


Am 29.06.2006 um 23:07 schrieb Honda-Search Administrator:

> Maybe someone can explain to me how this works.
>
> First, my setup.
>
> I create a fetchlist each night with FreeFetchlistTool and fetch  
> those pages.  It often contains the same URLS that are already in  
> the database, but this tool gets the newest copies of those URLs.
>
> I also run nutch dedup after everything is fetched, indexed, etc.   
> I then merge the segments using the following command:
>
> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>
> Every night the number of "duplicates" increases.  THis is so  
> because the duplicates from the day before are not actually deleted  
> (I assume).
>
> Is dedup removing them from some sort of master index and the  
> segments retain their original information?
>
> If so, is there a way to merge the segments into one (or whatever)  
> so that duplicate URLs do not exist?  Would mergesegs do this?
>
> Thanks for any help, and I hope my questionis clear.
>
> Matt
>
>


Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] deleting URL duplicates - never actually deleted?

Reply via email to