[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

Uwe Schindler (JIRA) Wed, 13 Feb 2013 05:32:17 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13577556#comment-13577556
 ]


Uwe Schindler commented on LUCENE-4766:
---------------------------------------

bq. Clinton, I think you can trash the offset attribute reference in there 
entirely just don't mess with them at all.

That's part of a bigger problem in the current code. The idea of this filter is 
to make from one input Token multiple output Tokens. To make this work correct, 
the *new* output tokens must be produced based on the original token (means the 
filter must reset the new produced token to a clean state, otherwise it might 
happen that unrelated and unknown attributes stay alive with wrong values - 
especiall if later TokenFilter change attributes, e.g. a Synonymfilter is 
inserting more synonyms). The problem Clinton had was that he had to re-set the 
offset attribute (although he does not change it); but he missed possible other 
attributes on the stream he does not know about.

If you look at other filters doing similar things like Synonymfilter, WDF, the 
way it has to work is like that:
- The first token emmitted is the "original one, maybe modified
- All "inserted Tokens" are cloned from the original (first) token, use 
captureState/restoreState to do that. This will initialize the attribute source 
to the exact same token like the original (unmodified one). After you called 
restoreState, you can *modify* the attribute (like term text) and 
setPositionIncrement(0). You can then leave the the offset (and other unknown 
attributes that may be on the token stream) unchanged - don't reference them at 
all.
                
> Pattern token filter which emits a token for every capturing group
> ------------------------------------------------------------------
>
>                 Key: LUCENE-4766
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4766
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.1
>            Reporter: Clinton Gormley
>            Assignee: Simon Willnauer
>            Priority: Minor
>              Labels: analysis, feature, lucene
>             Fix For: 4.2
>
>         Attachments: LUCENE-4766.patch, LUCENE-4766.patch, LUCENE-4766.patch, 
> LUCENE-4766.patch
>
>
> The PatternTokenizer either functions by splitting on matches, or allows you 
> to specify a single capture group.  This is insufficient for my needs. Quite 
> often I want to capture multiple overlapping tokens in the same position.
> I've written a pattern token filter which accepts multiple patterns and emits 
> tokens for every capturing group that is matched in any pattern.
> Patterns are not anchored to the beginning and end of the string, so each 
> pattern can produce multiple matches.
> For instance a pattern like :
> {code}
>     "(([a-z]+)(\d*))"
> {code}
> when matched against: 
> {code}
>     "abc123def456"
> {code}
> would produce the tokens:
> {code}
>     abc123, abc, 123, def456, def, 456
> {code}
> Multiple patterns can be applied, eg these patterns could be used for 
> camelCase analysis:
> {code}
>     "([A-Z]{2,})",
>     "(?<![A-Z])([A-Z][a-z]+)",
>     "(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)",
>     "([0-9]+)"
> {code}
> When matched against the string "letsPartyLIKEits1999_dude", they would 
> produce the tokens:
> {code}
>     lets, Party, LIKE, its, 1999, dude
> {code}
> If no token is emitted, the original token is preserved. 
> If the preserveOriginal flag is true, it will output the full original token 
> (ie "letsPartyLIKEits1999_dude") in addition to any matching tokens (but in 
> this case, if a matching token is identical to the original, it will only 
> emit one copy of the full token).
> Multiple patterns are required to allow overlapping captures, but also means 
> that patterns are less dense and easier to understand.
> This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

Reply via email to