Marko, Currently the shell command is as follows:
--- # index new segment bin/nutch index $s1 # update the database bin/nutch updatedb crawl/db $s1 # De-duplicate indexes # "bogus" argument is ignored but needed due to # a bug in the number of args expected bin/nutch dedup crawl/segments bogus # Merge indexes ls -d crawl/segments/* | xargs bin/nutch merge crawl/index --- Should I actually switch the last two commands around? Matt Original Message ----- From: "Marko Bauhardt" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Friday, June 30, 2006 2:57 AM Subject: Re: deleting URL duplicates - never actually deleted? > > Do you delete the duplicates before you merge the index? Run first > the merge command and then the dedup command. > > But a better way is you create one index of all segments with the > index command and then runs the dedup command of this one index. > > Hope this Helps, > Marko > > > Am 29.06.2006 um 23:07 schrieb Honda-Search Administrator: > >> Maybe someone can explain to me how this works. >> >> First, my setup. >> >> I create a fetchlist each night with FreeFetchlistTool and fetch >> those pages. It often contains the same URLS that are already in >> the database, but this tool gets the newest copies of those URLs. >> >> I also run nutch dedup after everything is fetched, indexed, etc. >> I then merge the segments using the following command: >> >> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir >> >> Every night the number of "duplicates" increases. THis is so >> because the duplicates from the day before are not actually deleted >> (I assume). >> >> Is dedup removing them from some sort of master index and the >> segments retain their original information? >> >> If so, is there a way to merge the segments into one (or whatever) >> so that duplicate URLs do not exist? Would mergesegs do this? >> >> Thanks for any help, and I hope my questionis clear. >> >> Matt >> >> > > > Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
