[Nutch Wiki] Update of "bin/nutch solrdedup" by LewisJohnMcgibbney

Apache Wiki Sat, 02 Jul 2011 20:41:47 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "bin/nutch solrdedup" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch%20solrdedup

Comment:
Update to reflect Nutch 1.3 API

New page:
Solrdedup is an alias for org.apache.nutch.indexer.solr.SolrDeleteDuplicates

As the name suggests this is a utility class for deleting duplicate documents 
from within a solr index. 

The algorithm goes like follows:

'''Preparation''':
Query the solr server for the number of documents (say, N), Partition N among M 
map tasks. For example, if we have two map tasks the first map task will deal 
with solr documents from 0 - (N / 2 - 1) and the second will deal with 
documents from (N / 2) to (N - 1). This can be thought of as a linearly 
executing divide and conquer algorithm.

'''MapReduce''':
 * Map: Identity map where keys are digests and values are {@link SolrRecord} 
instances(which contain id, boost and timestamp)

 * Reduce: After map, {@link SolrRecord}s with the same digest will be grouped 
together. Now, of these documents with the same digests, delete all of them 
except the one with the highest score (boost field). If two (or more) documents 
have the same score, then the document with the latest timestamp is kept. 
Again, every other is deleted from solr index.

Note that unlike {@link DeleteDuplicates} we assume that two documents in a 
solr index will never have the same URL. So this class only deals with 
documents with '''different''' URLs but the same digest.

Usage:
{{{
bin/nutch solrdedup <solr url>
}}}

'''<solr url>''': Luckily all of the hard work is encapsulated within the class 
therefore the onyl parameter we pass is our SolrURL e.g. 
''http://localhost:8983/solr/''


CommandLineOptions

[Nutch Wiki] Update of "bin/nutch solrdedup" by LewisJohnMcgibbney

Reply via email to