Joel Nothman created SOLR-4017:
----------------------------------
Summary: Signatures for deduplication should be Analyzers
Key: SOLR-4017
URL: https://issues.apache.org/jira/browse/SOLR-4017
Project: Solr
Issue Type: Improvement
Components: update
Affects Versions: 4.0
Environment: N/A
Reporter: Joel Nothman
At present, signatures for deduplication are constructed from the raw text of a
specified set of fields. This means they may not take advantage of the
normalization provided by Analyzers: stripping of HTML, tokenization, diacritic
normalization, stemming or stop-removal, etc. It would also allow a token-based
signature like the TextProfileSignature to consider character or token ngrams
where appropriate.
Instead of handling this task with a special SignatureUpdateProcessorFactory,
it seems one could do (almost) the same with CloneFieldUpdateProcessorFactory,
and the appropriate *SignatureAnalyzer which outputs a single (or indeed,
multiple!) Term: a hash. (I am not familiar enough to know whether the
{{overwriteDupes}} option would require a further UpdateProcessor.)
The current approach may be more efficient for most cases, so could be retained
for efficiency compatibility.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]