[
https://issues.apache.org/jira/browse/SOLR-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280280#comment-13280280
]
Mark Miller commented on SOLR-3473:
-----------------------------------
bq. To work around the problem of having the digest field as ID, could it not
simply issue a deleteByQuery for the digest prior to adding it? Would that
cause significant overhead for very large systems with many updates?
Yeah, that might be an option - I don't know that it will be great perf wise,
or race airtight wise, but it may a viable option.
bq. We would, from Nutch' point of view, certainly want to avoid changing the
ID from URL to digest.
Ah, interesting. If you are enforcing uniqueness by digest though, is this
really a problem? It would only have to be in the Solr world that the id was
the digest - and you could even call it something else and have an id:url field
as well. Just thinking out loud.
Or, perhaps we could make it so you could pick the hash field? Then hash on
digest. If you are using overwrite=true, this should work right?
Or perhaps someone else has some ideas...
> Distributed deduplication broken
> --------------------------------
>
> Key: SOLR-3473
> URL: https://issues.apache.org/jira/browse/SOLR-3473
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud, update
> Affects Versions: 4.0
> Reporter: Markus Jelsma
> Fix For: 4.0
>
>
> Solr's deduplication via the SignatureUpdateProcessor is broken for
> distributed updates on SolrCloud.
> Mark Miller:
> {quote}
> Looking again at the SignatureUpdateProcessor code, I think that indeed this
> won't currently work with distrib updates. Could you file a JIRA issue for
> that? The problem is that we convert update commands into solr documents -
> and that can cause a loss of info if an update proc modifies the update
> command.
> I think the reason that you see a multiple values error when you try the
> other order is because of the lack of a document clone (the other issue I
> mentioned a few emails back). Addressing that won't solve your issue though -
> we have to come up with a way to propagate the currently lost info on the
> update command.
> {quote}
> Please see the ML thread for the full discussion:
> http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]