[jira] [Commented] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

Adrien Grand (JIRA) Wed, 19 Aug 2015 09:43:10 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703316#comment-14703316
 ]


Adrien Grand commented on LUCENE-6747:
--------------------------------------

If you could tolerate that these fingerprints are not be reliable identifiers 
of your input, I'm wondering that we could make it more efficient by just using 
a hash function that doesn't depend on the order of its inputs?

Otherwise this looks rather good to me. Instead of taking the min offset and 
the max offset as offsets for the final token, I'm wondering that it might make 
more sense to use 0 and the final offset (the one returned after end() has been 
called) instead so that we don't treat token chars differently depending on 
whether they appear before/after the tokens or in the middle? By the way even 
with the current approach, we don't need to call Math.min/max: As tokens are 
supposed to be emitted in order, the start offset would be the start offset of 
the first token and the end offset would be the end offset of the last token.

> FingerprintFilter - a TokenFilter for clustering/linking purposes
> -----------------------------------------------------------------
>
>                 Key: LUCENE-6747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6747
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Mark Harwood
>            Priority: Minor
>         Attachments: fingerprintv1.patch
>
>
> A TokenFilter that emits a single token which is a sorted, de-duplicated set 
> of the input tokens.
> This approach to normalizing text is used in tools like OpenRefine[1] and 
> elsewhere [2] to help in clustering or linking texts.
> The implementation proposed here has a an upper limit on the size of the 
> combined token which is output.
> [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
> [2] 
> https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

Reply via email to