Re: unable to remove duplicates

J S Mon, 13 Jun 2005 04:39:04 -0700

Hi Piotr,

Thanks for replying. I understand the terminology better now! I wasreferring to url duplicates. I've rerun the crawl and I'm still gettingthem:


050613 123307 * Optimizing index...
050613 123307 * Moving index to NFS if needed...

050613 123307 DONE indexing segment 20050613123024: total 3 records in 0.242s (Infinity rec/s).

050613 123307 done indexing

050613 123307 indexing segment:/www/nutch-nightly/planetbp.tmp/segments/20050613123053

050613 123307 * Opening segment 20050613123053
050613 123307 * Indexing segment 20050613123053
050613 123307 * Optimizing index...
050613 123307 * Moving index to NFS if needed...

050613 123307 DONE indexing segment 20050613123053: total 0 records in 0.021s (NaN rec/s).

050613 123307 done indexing
050613 123307 Reading url hashes...
050613 123308 Sorting url hashes...
050613 123308 Deleting url duplicates...
050613 123308 Deleted 0 url duplicates.
050613 123308 Reading content hashes...
050613 123309 Sorting content hashes...
050613 123309 Deleting content duplicates...
050613 123309 Deleted 267 content duplicates.
050613 123310 Duplicate deletion complete locally.  Now returning to NFS...
050613 123310 DeleteDuplicates complete
050613 123310 Merging segment indexes...

Hello,
It looks like deduplication process removed some (309) duplicates. Theywere content duplicates - so different url but identical page content.There were no url duplicates (every url was different). So what you realymean by "duplicate pages" taht are returned by your seach?
Do they have identical urls or identical content?
One more thing to remember is that nutch deduplication currently removespages that have identical content - even smallest difference in page source(including url,comments etc) will be treated as difference.
So please verify if pages you see as duplicates are really identical.
Regards
Piotr



J S wrote:
Hi,
Some of my searches return duplicate pages, so I wanted to remove these.I'm not exactly sure how to do this but tried the command below, andrestarted Tomcat, but still got the same results.
I'm using the Nutch-Nightly from about 2 weeks ago. Just wondered if I'mdoing something wrong here?
Thanks.
$ nutch dedup -local -workingdir /www/nutch/planetbp/www/nutch/planetbp/segments
run java in /usr/j2sdk1.4.2_03
050612 092326 Clearing old deletions in/www/nutch/planetbp/segments/20050611171518/index(/www/nutch/planetbp/segments/20050611171518/index)
050612 092326 Clearing old deletions in/www/nutch/planetbp/segments/20050611171700/index(/www/nutch/planetbp/segments/20050611171700/index)
050612 092326 Clearing old deletions in/www/nutch/planetbp/segments/20050611173224/index(/www/nutch/planetbp/segments/20050611173224/index)
050612 092326 Clearing old deletions in/www/nutch/planetbp/segments/20050611181430/index(/www/nutch/planetbp/segments/20050611181430/index)
050612 092326 Clearing old deletions in/www/nutch/planetbp/segments/20050611184455/index(/www/nutch/planetbp/segments/20050611184455/index)
050612 092327 Clearing old deletions in/www/nutch/planetbp/segments/20050611185714/index(/www/nutch/planetbp/segments/20050611185714/index)
050612 092327 Clearing old deletions in/www/nutch/planetbp/segments/20050611190051/index(/www/nutch/planetbp/segments/20050611190051/index)
050612 092327 Clearing old deletions in/www/nutch/planetbp/segments/20050611190155/index(/www/nutch/planetbp/segments/20050611190155/index)
050612 092327 Reading url hashes...
050612 092328 parsing file:/www/nutch-nightly/conf/nutch-default.xml
050612 092329 parsing file:/www/nutch-nightly/conf/nutch-site.xml
050612 092330 Sorting url hashes...
050612 092331 Deleting url duplicates...
050612 092331 Deleted 0 url duplicates.
050612 092331 Reading content hashes...
050612 092331 Sorting content hashes...
050612 092331 Deleting content duplicates...
050612 092331 Deleted 309 content duplicates.
050612 092332 Duplicate deletion complete locally. Now returning toNFS...
050612 092332 DeleteDuplicates complete
$

Re: unable to remove duplicates

Reply via email to