Re: deleting URL duplicates - never actually deleted?

Honda-Search Administrator Fri, 30 Jun 2006 10:16:13 -0700

Marko,

Currently the shell command is as follows:


---
# index new segment
bin/nutch index $s1

# update the database
bin/nutch updatedb crawl/db $s1

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup crawl/segments bogus

# Merge indexes
ls -d crawl/segments/* | xargs bin/nutch merge crawl/index
---

Should I actually switch the last two commands around?

MattOriginal Message -----From: "Marko Bauhardt" <[EMAIL PROTECTED]>

To: <[email protected]>
Sent: Friday, June 30, 2006 2:57 AM
Subject: Re: deleting URL duplicates - never actually deleted?

Do you delete the duplicates before you merge the index? Run firstthe merge command and then the dedup command.
But a better way is you create one index of all segments with theindex command and then runs the dedup command of this one index.
Hope this Helps,
Marko


Am 29.06.2006 um 23:07 schrieb Honda-Search Administrator:
Maybe someone can explain to me how this works.

First, my setup.
I create a fetchlist each night with FreeFetchlistTool and fetchthose pages. It often contains the same URLS that are already inthe database, but this tool gets the newest copies of those URLs.
I also run nutch dedup after everything is fetched, indexed, etc.I then merge the segments using the following command:
ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
Every night the number of "duplicates" increases. THis is sobecause the duplicates from the day before are not actually deleted(I assume).
Is dedup removing them from some sort of master index and thesegments retain their original information?
If so, is there a way to merge the segments into one (or whatever)so that duplicate URLs do not exist? Would mergesegs do this?
Thanks for any help, and I hope my questionis clear.

Matt

Re: deleting URL duplicates - never actually deleted?

Reply via email to