deleting URL duplicates - never actually deleted?

Honda-Search Administrator Thu, 29 Jun 2006 14:07:59 -0700

Maybe someone can explain to me how this works.

First, my setup.

I create a fetchlist each night with FreeFetchlistTool and fetch thosepages. It often contains the same URLS that are already in the database,but this tool gets the newest copies of those URLs.

I also run nutch dedup after everything is fetched, indexed, etc. I thenmerge the segments using the following command:


ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

Every night the number of "duplicates" increases. THis is so because theduplicates from the day before are not actually deleted (I assume).

Is dedup removing them from some sort of master index and the segmentsretain their original information?

If so, is there a way to merge the segments into one (or whatever) so thatduplicate URLs do not exist? Would mergesegs do this?


Thanks for any help, and I hope my questionis clear.

Matt

deleting URL duplicates - never actually deleted?

Reply via email to