[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488738#comment-13488738
 ] 

Julien Nioche commented on NUTCH-1480:
--------------------------------------

OK thanks. What about having a mechanism for specifying a way of distributing 
the docs with the replicate-to-all being one of the options? Could do 
consistent hashing maybe? I expect that most people would want to shard.

off topic re-deduplication : I think we've hit the limits of the current 
mechanism which I assume was based on the one we had when Nutch was managing 
its own Lucene indices. It's not reasonable to pump ALL the docs from SOLR into 
Hadoop to dedup and I'd rather have map reduce jobs to find the duplicates 
based on the crawldb and send the deletion commands to SOLR. And this would 
work for ElasticSearch as well. Am pretty sure there is a JIRA for this 
somewhere 
                
> SolrIndexer to write to multiple servers.
> -----------------------------------------
>
>                 Key: NUTCH-1480
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1480
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to