[jira] [Closed] (LUCENE-5784) CommonTermsQuery HighFreq MUST not applied if lowFreq terms

2014-07-07 Thread Clinton Gormley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clinton Gormley closed LUCENE-5784.
---

Resolution: Not a Problem

 CommonTermsQuery HighFreq MUST not applied if lowFreq terms
 ---

 Key: LUCENE-5784
 URL: https://issues.apache.org/jira/browse/LUCENE-5784
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/query/scoring
Affects Versions: 4.8.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
 Attachments: common_terms.patch


 When a CommonTermsQuery has high and low frequency terms,  the highFreq terms 
 Boolean query is always added as a SHOULD clause, even if highFreqOccur is 
 set to MUST:
 new CommonTermsQuery(Occur.MUST, Occur.MUST,0.1);
 My patch sets the top level Boolean query's minimum should match to 1 to 
 ensure that the SHOULD clause must match.  Not sure if this is the correct 
 approach, or if it should just add the highFreq query as a MUST clause 
 instead?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5784) CommonTermsQuery HighFreq MUST not applied if lowFreq terms

2014-07-07 Thread Clinton Gormley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053735#comment-14053735
 ] 

Clinton Gormley commented on LUCENE-5784:
-

Having talked to Simon offline, it appears I misunderstood the intent of the 
high freq occur/minShouldMatch. These are not intended to control matching 
(bool query already does a good job here) but only to control when the high 
freq terms should be used for scoring, ie as a tie breaker.

So if the high freq occur is must, then a document is matched based on the low 
freq terms, but it's score is affected only if all the high freq terms are 
present.


 CommonTermsQuery HighFreq MUST not applied if lowFreq terms
 ---

 Key: LUCENE-5784
 URL: https://issues.apache.org/jira/browse/LUCENE-5784
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/query/scoring
Affects Versions: 4.8.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
 Attachments: common_terms.patch


 When a CommonTermsQuery has high and low frequency terms,  the highFreq terms 
 Boolean query is always added as a SHOULD clause, even if highFreqOccur is 
 set to MUST:
 new CommonTermsQuery(Occur.MUST, Occur.MUST,0.1);
 My patch sets the top level Boolean query's minimum should match to 1 to 
 ensure that the SHOULD clause must match.  Not sure if this is the correct 
 approach, or if it should just add the highFreq query as a MUST clause 
 instead?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5784) CommonTermsQuery HighFreq MUST not applied if lowFreq terms

2014-06-22 Thread Clinton Gormley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clinton Gormley updated LUCENE-5784:


Attachment: common_terms.patch

 CommonTermsQuery HighFreq MUST not applied if lowFreq terms
 ---

 Key: LUCENE-5784
 URL: https://issues.apache.org/jira/browse/LUCENE-5784
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/query/scoring
Affects Versions: 4.8.1
Reporter: Clinton Gormley
Priority: Minor
 Attachments: common_terms.patch


 When a CommonTermsQuery has high and low frequency terms,  the highFreq terms 
 Boolean query is always added as a SHOULD clause, even if highFreqOccur is 
 set to MUST:
 new CommonTermsQuery(Occur.MUST, Occur.MUST,0.1);
 My patch sets the top level Boolean query's minimum should match to 1 to 
 ensure that the SHOULD clause must match.  Not sure if this is the correct 
 approach, or if it should just add the highFreq query as a MUST clause 
 instead?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5784) CommonTermsQuery HighFreq MUST not applied if lowFreq terms

2014-06-22 Thread Clinton Gormley (JIRA)
Clinton Gormley created LUCENE-5784:
---

 Summary: CommonTermsQuery HighFreq MUST not applied if lowFreq 
terms
 Key: LUCENE-5784
 URL: https://issues.apache.org/jira/browse/LUCENE-5784
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/query/scoring
Affects Versions: 4.8.1
Reporter: Clinton Gormley
Priority: Minor
 Attachments: common_terms.patch

When a CommonTermsQuery has high and low frequency terms,  the highFreq terms 
Boolean query is always added as a SHOULD clause, even if highFreqOccur is set 
to MUST:

new CommonTermsQuery(Occur.MUST, Occur.MUST,0.1);

My patch sets the top level Boolean query's minimum should match to 1 to ensure 
that the SHOULD clause must match.  Not sure if this is the correct approach, 
or if it should just add the highFreq query as a MUST clause instead?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-13 Thread Clinton Gormley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clinton Gormley updated LUCENE-4766:


Attachment: LUCENE-4766.patch

New patch which preserves the offsets of the original token. Includes Simons 
patch to create the filter factory

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-13 Thread Clinton Gormley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clinton Gormley updated LUCENE-4766:


Attachment: LUCENE-4766.patch

The charOffsetEnd now uses the correct offsetAttr.endOffset() and, added a test 
using checkRandomData()

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch, LUCENE-4766.patch, 
 LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Clinton Gormley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575774#comment-13575774
 ] 

Clinton Gormley commented on LUCENE-4766:
-

Is it OK for a tokenizer to create multiple tokens in the same positions but 
with different offsets? eg 

{code}
foobar - [foobar,foo,bar] with positions [1,1,1] and startOffsets 
[0,0,3]?
{code}

Having a look at the wordDelimiter token filter, with preserveOriginal set to 
true, it increments the position for each new offset, eg: 

{code}
fooBar - [fooBar,foo,Bar] with positions [1,1,2] and startOffsets 
[0,0,3].
{code}

I'm asking because I'm not sure exactly how positions and offsets get used 
elsewhere, and so what the correct behaviour is. From my naive understanding, 
the wordDelimiter filter can produce spurious results with phrase searches, eg 
matching fooBar Bar

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Clinton Gormley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575811#comment-13575811
 ] 

Clinton Gormley commented on LUCENE-4766:
-

OK, so I should redo this as a tokenizer, and set positionLengths correctly.

One issue is that, because there are multiple patterns, the emitted tokens can 
overlap, eg:

{code}
   foobarbaz - foo, foobar, oba, bar, baz
{code}

in which case I think I would need to emit:

{code}
positions: 1, 1, 2, 3, 5
position lengths:  2, 4, 2, 2, 1
start offsets: 0, 0, 0, 0, 0
end offsets:   3, 6, 3, 3, 3
{code}

Is this correct? It's starting to look quite complex...

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-10 Thread Clinton Gormley (JIRA)
Clinton Gormley created LUCENE-4766:
---

 Summary: Pattern token filter which emits a token for every 
capturing group
 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Priority: Minor
 Fix For: 4.2


The PatternTokenizer either functions by splitting on matches, or allows you to 
specify a single capture group.  This is insufficient for my needs. Quite often 
I want to capture multiple overlapping tokens in the same position.

I've written a pattern token filter which accepts multiple patterns and emits 
tokens for every capturing group that is matched in any pattern.
Patterns are not anchored to the beginning and end of the string, so each 
pattern can produce multiple matches.

For instance a pattern like (([a-z]+)(\d*)) when matched against 
abc123def456 would produce the tokens:

abc123, abc, 123, def456, def, 456

Multiple patterns can be applied, eg these patterns could be used for camelCase 
analysis:

([A-Z]{2,}),
(?![A-Z])([A-Z][a-z]+),
(?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
([0-9]+)

When matched against the string letsPartyLIKEits1999_dude, they would produce 
the tokens:

lets, Party, LIKE, its, 1999, dude

If no token is emitted, the original token is preserved. 
If the preserveOriginal flag is true, it will output the full original token 
(ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
this case, if a matching token is identical to the original, it will only emit 
one copy of the full token).

Multiple patterns are required to allow overlapping captures, but also means 
that patterns are less dense and easier to understand.

This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-10 Thread Clinton Gormley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clinton Gormley updated LUCENE-4766:


Attachment: LUCENE-4766.patch

Patch implementing 
org.apache.lucene.analysis.pattern.PatternCaptureGroupTokenFilter and 
org.apache.lucene.analysis.pattern.TestPatternCaptureGroupTokenFilter

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like (([a-z]+)(\d*)) when matched against 
 abc123def456 would produce the tokens:
 abc123, abc, 123, def456, def, 456
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 lets, Party, LIKE, its, 1999, dude
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-10 Thread Clinton Gormley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clinton Gormley updated LUCENE-4766:


Description: 
The PatternTokenizer either functions by splitting on matches, or allows you to 
specify a single capture group.  This is insufficient for my needs. Quite often 
I want to capture multiple overlapping tokens in the same position.

I've written a pattern token filter which accepts multiple patterns and emits 
tokens for every capturing group that is matched in any pattern.
Patterns are not anchored to the beginning and end of the string, so each 
pattern can produce multiple matches.

For instance a pattern like :

{code}
(([a-z]+)(\d*))
{code}

when matched against: 

{code}
abc123def456
{code}

would produce the tokens:

{code}
abc123, abc, 123, def456, def, 456
{code}

Multiple patterns can be applied, eg these patterns could be used for camelCase 
analysis:

{code}
([A-Z]{2,}),
(?![A-Z])([A-Z][a-z]+),
(?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
([0-9]+)
{code}

When matched against the string letsPartyLIKEits1999_dude, they would produce 
the tokens:

{code}
lets, Party, LIKE, its, 1999, dude
{code}

If no token is emitted, the original token is preserved. 
If the preserveOriginal flag is true, it will output the full original token 
(ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
this case, if a matching token is identical to the original, it will only emit 
one copy of the full token).

Multiple patterns are required to allow overlapping captures, but also means 
that patterns are less dense and easier to understand.

This is my first Java code, so apologies if I'm doing something stupid.

  was:
The PatternTokenizer either functions by splitting on matches, or allows you to 
specify a single capture group.  This is insufficient for my needs. Quite often 
I want to capture multiple overlapping tokens in the same position.

I've written a pattern token filter which accepts multiple patterns and emits 
tokens for every capturing group that is matched in any pattern.
Patterns are not anchored to the beginning and end of the string, so each 
pattern can produce multiple matches.

For instance a pattern like (([a-z]+)(\d*)) when matched against 
abc123def456 would produce the tokens:

abc123, abc, 123, def456, def, 456

Multiple patterns can be applied, eg these patterns could be used for camelCase 
analysis:

([A-Z]{2,}),
(?![A-Z])([A-Z][a-z]+),
(?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
([0-9]+)

When matched against the string letsPartyLIKEits1999_dude, they would produce 
the tokens:

lets, Party, LIKE, its, 1999, dude

If no token is emitted, the original token is preserved. 
If the preserveOriginal flag is true, it will output the full original token 
(ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
this case, if a matching token is identical to the original, it will only emit 
one copy of the full token).

Multiple patterns are required to allow overlapping captures, but also means 
that patterns are less dense and easier to understand.

This is my first Java code, so apologies if I'm doing something stupid.


 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If