Joel Nothman created SOLR-4017:
----------------------------------

             Summary: Signatures for deduplication should be Analyzers
                 Key: SOLR-4017
                 URL: https://issues.apache.org/jira/browse/SOLR-4017
             Project: Solr
          Issue Type: Improvement
          Components: update
    Affects Versions: 4.0
         Environment: N/A
            Reporter: Joel Nothman


At present, signatures for deduplication are constructed from the raw text of a 
specified set of fields. This means they may not take advantage of the 
normalization provided by Analyzers: stripping of HTML, tokenization, diacritic 
normalization, stemming or stop-removal, etc. It would also allow a token-based 
signature like the TextProfileSignature to consider character or token ngrams 
where appropriate.

Instead of handling this task with a special SignatureUpdateProcessorFactory, 
it seems one could do (almost) the same with CloneFieldUpdateProcessorFactory, 
and the appropriate *SignatureAnalyzer which outputs a single (or indeed, 
multiple!) Term: a hash. (I am not familiar enough to know whether the 
{{overwriteDupes}} option would require a further UpdateProcessor.)

The current approach may be more efficient for most cases, so could be retained 
for efficiency compatibility.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to