[jira] [Closed] (LUCENE-5784) CommonTermsQuery HighFreq MUST not applied if lowFreq terms
[ https://issues.apache.org/jira/browse/LUCENE-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Clinton Gormley closed LUCENE-5784. --- Resolution: Not a Problem CommonTermsQuery HighFreq MUST not applied if lowFreq terms --- Key: LUCENE-5784 URL: https://issues.apache.org/jira/browse/LUCENE-5784 Project: Lucene - Core Issue Type: Bug Components: core/query/scoring Affects Versions: 4.8.1 Reporter: Clinton Gormley Assignee: Simon Willnauer Priority: Minor Attachments: common_terms.patch When a CommonTermsQuery has high and low frequency terms, the highFreq terms Boolean query is always added as a SHOULD clause, even if highFreqOccur is set to MUST: new CommonTermsQuery(Occur.MUST, Occur.MUST,0.1); My patch sets the top level Boolean query's minimum should match to 1 to ensure that the SHOULD clause must match. Not sure if this is the correct approach, or if it should just add the highFreq query as a MUST clause instead? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5784) CommonTermsQuery HighFreq MUST not applied if lowFreq terms
[ https://issues.apache.org/jira/browse/LUCENE-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053735#comment-14053735 ] Clinton Gormley commented on LUCENE-5784: - Having talked to Simon offline, it appears I misunderstood the intent of the high freq occur/minShouldMatch. These are not intended to control matching (bool query already does a good job here) but only to control when the high freq terms should be used for scoring, ie as a tie breaker. So if the high freq occur is must, then a document is matched based on the low freq terms, but it's score is affected only if all the high freq terms are present. CommonTermsQuery HighFreq MUST not applied if lowFreq terms --- Key: LUCENE-5784 URL: https://issues.apache.org/jira/browse/LUCENE-5784 Project: Lucene - Core Issue Type: Bug Components: core/query/scoring Affects Versions: 4.8.1 Reporter: Clinton Gormley Assignee: Simon Willnauer Priority: Minor Attachments: common_terms.patch When a CommonTermsQuery has high and low frequency terms, the highFreq terms Boolean query is always added as a SHOULD clause, even if highFreqOccur is set to MUST: new CommonTermsQuery(Occur.MUST, Occur.MUST,0.1); My patch sets the top level Boolean query's minimum should match to 1 to ensure that the SHOULD clause must match. Not sure if this is the correct approach, or if it should just add the highFreq query as a MUST clause instead? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5784) CommonTermsQuery HighFreq MUST not applied if lowFreq terms
[ https://issues.apache.org/jira/browse/LUCENE-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Clinton Gormley updated LUCENE-5784: Attachment: common_terms.patch CommonTermsQuery HighFreq MUST not applied if lowFreq terms --- Key: LUCENE-5784 URL: https://issues.apache.org/jira/browse/LUCENE-5784 Project: Lucene - Core Issue Type: Bug Components: core/query/scoring Affects Versions: 4.8.1 Reporter: Clinton Gormley Priority: Minor Attachments: common_terms.patch When a CommonTermsQuery has high and low frequency terms, the highFreq terms Boolean query is always added as a SHOULD clause, even if highFreqOccur is set to MUST: new CommonTermsQuery(Occur.MUST, Occur.MUST,0.1); My patch sets the top level Boolean query's minimum should match to 1 to ensure that the SHOULD clause must match. Not sure if this is the correct approach, or if it should just add the highFreq query as a MUST clause instead? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5784) CommonTermsQuery HighFreq MUST not applied if lowFreq terms
Clinton Gormley created LUCENE-5784: --- Summary: CommonTermsQuery HighFreq MUST not applied if lowFreq terms Key: LUCENE-5784 URL: https://issues.apache.org/jira/browse/LUCENE-5784 Project: Lucene - Core Issue Type: Bug Components: core/query/scoring Affects Versions: 4.8.1 Reporter: Clinton Gormley Priority: Minor Attachments: common_terms.patch When a CommonTermsQuery has high and low frequency terms, the highFreq terms Boolean query is always added as a SHOULD clause, even if highFreqOccur is set to MUST: new CommonTermsQuery(Occur.MUST, Occur.MUST,0.1); My patch sets the top level Boolean query's minimum should match to 1 to ensure that the SHOULD clause must match. Not sure if this is the correct approach, or if it should just add the highFreq query as a MUST clause instead? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4766) Pattern token filter which emits a token for every capturing group
[ https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Clinton Gormley updated LUCENE-4766: Attachment: LUCENE-4766.patch New patch which preserves the offsets of the original token. Includes Simons patch to create the filter factory Pattern token filter which emits a token for every capturing group -- Key: LUCENE-4766 URL: https://issues.apache.org/jira/browse/LUCENE-4766 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.1 Reporter: Clinton Gormley Assignee: Simon Willnauer Priority: Minor Labels: analysis, feature, lucene Fix For: 4.2 Attachments: LUCENE-4766.patch, LUCENE-4766.patch, LUCENE-4766.patch The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position. I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern. Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches. For instance a pattern like : {code} (([a-z]+)(\d*)) {code} when matched against: {code} abc123def456 {code} would produce the tokens: {code} abc123, abc, 123, def456, def, 456 {code} Multiple patterns can be applied, eg these patterns could be used for camelCase analysis: {code} ([A-Z]{2,}), (?![A-Z])([A-Z][a-z]+), (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+), ([0-9]+) {code} When matched against the string letsPartyLIKEits1999_dude, they would produce the tokens: {code} lets, Party, LIKE, its, 1999, dude {code} If no token is emitted, the original token is preserved. If the preserveOriginal flag is true, it will output the full original token (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in this case, if a matching token is identical to the original, it will only emit one copy of the full token). Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand. This is my first Java code, so apologies if I'm doing something stupid. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4766) Pattern token filter which emits a token for every capturing group
[ https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Clinton Gormley updated LUCENE-4766: Attachment: LUCENE-4766.patch The charOffsetEnd now uses the correct offsetAttr.endOffset() and, added a test using checkRandomData() Pattern token filter which emits a token for every capturing group -- Key: LUCENE-4766 URL: https://issues.apache.org/jira/browse/LUCENE-4766 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.1 Reporter: Clinton Gormley Assignee: Simon Willnauer Priority: Minor Labels: analysis, feature, lucene Fix For: 4.2 Attachments: LUCENE-4766.patch, LUCENE-4766.patch, LUCENE-4766.patch, LUCENE-4766.patch The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position. I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern. Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches. For instance a pattern like : {code} (([a-z]+)(\d*)) {code} when matched against: {code} abc123def456 {code} would produce the tokens: {code} abc123, abc, 123, def456, def, 456 {code} Multiple patterns can be applied, eg these patterns could be used for camelCase analysis: {code} ([A-Z]{2,}), (?![A-Z])([A-Z][a-z]+), (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+), ([0-9]+) {code} When matched against the string letsPartyLIKEits1999_dude, they would produce the tokens: {code} lets, Party, LIKE, its, 1999, dude {code} If no token is emitted, the original token is preserved. If the preserveOriginal flag is true, it will output the full original token (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in this case, if a matching token is identical to the original, it will only emit one copy of the full token). Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand. This is my first Java code, so apologies if I'm doing something stupid. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group
[ https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575774#comment-13575774 ] Clinton Gormley commented on LUCENE-4766: - Is it OK for a tokenizer to create multiple tokens in the same positions but with different offsets? eg {code} foobar - [foobar,foo,bar] with positions [1,1,1] and startOffsets [0,0,3]? {code} Having a look at the wordDelimiter token filter, with preserveOriginal set to true, it increments the position for each new offset, eg: {code} fooBar - [fooBar,foo,Bar] with positions [1,1,2] and startOffsets [0,0,3]. {code} I'm asking because I'm not sure exactly how positions and offsets get used elsewhere, and so what the correct behaviour is. From my naive understanding, the wordDelimiter filter can produce spurious results with phrase searches, eg matching fooBar Bar Pattern token filter which emits a token for every capturing group -- Key: LUCENE-4766 URL: https://issues.apache.org/jira/browse/LUCENE-4766 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.1 Reporter: Clinton Gormley Assignee: Simon Willnauer Priority: Minor Labels: analysis, feature, lucene Fix For: 4.2 Attachments: LUCENE-4766.patch, LUCENE-4766.patch The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position. I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern. Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches. For instance a pattern like : {code} (([a-z]+)(\d*)) {code} when matched against: {code} abc123def456 {code} would produce the tokens: {code} abc123, abc, 123, def456, def, 456 {code} Multiple patterns can be applied, eg these patterns could be used for camelCase analysis: {code} ([A-Z]{2,}), (?![A-Z])([A-Z][a-z]+), (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+), ([0-9]+) {code} When matched against the string letsPartyLIKEits1999_dude, they would produce the tokens: {code} lets, Party, LIKE, its, 1999, dude {code} If no token is emitted, the original token is preserved. If the preserveOriginal flag is true, it will output the full original token (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in this case, if a matching token is identical to the original, it will only emit one copy of the full token). Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand. This is my first Java code, so apologies if I'm doing something stupid. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group
[ https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575811#comment-13575811 ] Clinton Gormley commented on LUCENE-4766: - OK, so I should redo this as a tokenizer, and set positionLengths correctly. One issue is that, because there are multiple patterns, the emitted tokens can overlap, eg: {code} foobarbaz - foo, foobar, oba, bar, baz {code} in which case I think I would need to emit: {code} positions: 1, 1, 2, 3, 5 position lengths: 2, 4, 2, 2, 1 start offsets: 0, 0, 0, 0, 0 end offsets: 3, 6, 3, 3, 3 {code} Is this correct? It's starting to look quite complex... Pattern token filter which emits a token for every capturing group -- Key: LUCENE-4766 URL: https://issues.apache.org/jira/browse/LUCENE-4766 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.1 Reporter: Clinton Gormley Assignee: Simon Willnauer Priority: Minor Labels: analysis, feature, lucene Fix For: 4.2 Attachments: LUCENE-4766.patch, LUCENE-4766.patch The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position. I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern. Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches. For instance a pattern like : {code} (([a-z]+)(\d*)) {code} when matched against: {code} abc123def456 {code} would produce the tokens: {code} abc123, abc, 123, def456, def, 456 {code} Multiple patterns can be applied, eg these patterns could be used for camelCase analysis: {code} ([A-Z]{2,}), (?![A-Z])([A-Z][a-z]+), (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+), ([0-9]+) {code} When matched against the string letsPartyLIKEits1999_dude, they would produce the tokens: {code} lets, Party, LIKE, its, 1999, dude {code} If no token is emitted, the original token is preserved. If the preserveOriginal flag is true, it will output the full original token (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in this case, if a matching token is identical to the original, it will only emit one copy of the full token). Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand. This is my first Java code, so apologies if I'm doing something stupid. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4766) Pattern token filter which emits a token for every capturing group
Clinton Gormley created LUCENE-4766: --- Summary: Pattern token filter which emits a token for every capturing group Key: LUCENE-4766 URL: https://issues.apache.org/jira/browse/LUCENE-4766 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.1 Reporter: Clinton Gormley Priority: Minor Fix For: 4.2 The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position. I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern. Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches. For instance a pattern like (([a-z]+)(\d*)) when matched against abc123def456 would produce the tokens: abc123, abc, 123, def456, def, 456 Multiple patterns can be applied, eg these patterns could be used for camelCase analysis: ([A-Z]{2,}), (?![A-Z])([A-Z][a-z]+), (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+), ([0-9]+) When matched against the string letsPartyLIKEits1999_dude, they would produce the tokens: lets, Party, LIKE, its, 1999, dude If no token is emitted, the original token is preserved. If the preserveOriginal flag is true, it will output the full original token (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in this case, if a matching token is identical to the original, it will only emit one copy of the full token). Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand. This is my first Java code, so apologies if I'm doing something stupid. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4766) Pattern token filter which emits a token for every capturing group
[ https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Clinton Gormley updated LUCENE-4766: Attachment: LUCENE-4766.patch Patch implementing org.apache.lucene.analysis.pattern.PatternCaptureGroupTokenFilter and org.apache.lucene.analysis.pattern.TestPatternCaptureGroupTokenFilter Pattern token filter which emits a token for every capturing group -- Key: LUCENE-4766 URL: https://issues.apache.org/jira/browse/LUCENE-4766 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.1 Reporter: Clinton Gormley Priority: Minor Labels: analysis, feature, lucene Fix For: 4.2 Attachments: LUCENE-4766.patch The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position. I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern. Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches. For instance a pattern like (([a-z]+)(\d*)) when matched against abc123def456 would produce the tokens: abc123, abc, 123, def456, def, 456 Multiple patterns can be applied, eg these patterns could be used for camelCase analysis: ([A-Z]{2,}), (?![A-Z])([A-Z][a-z]+), (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+), ([0-9]+) When matched against the string letsPartyLIKEits1999_dude, they would produce the tokens: lets, Party, LIKE, its, 1999, dude If no token is emitted, the original token is preserved. If the preserveOriginal flag is true, it will output the full original token (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in this case, if a matching token is identical to the original, it will only emit one copy of the full token). Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand. This is my first Java code, so apologies if I'm doing something stupid. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4766) Pattern token filter which emits a token for every capturing group
[ https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Clinton Gormley updated LUCENE-4766: Description: The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position. I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern. Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches. For instance a pattern like : {code} (([a-z]+)(\d*)) {code} when matched against: {code} abc123def456 {code} would produce the tokens: {code} abc123, abc, 123, def456, def, 456 {code} Multiple patterns can be applied, eg these patterns could be used for camelCase analysis: {code} ([A-Z]{2,}), (?![A-Z])([A-Z][a-z]+), (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+), ([0-9]+) {code} When matched against the string letsPartyLIKEits1999_dude, they would produce the tokens: {code} lets, Party, LIKE, its, 1999, dude {code} If no token is emitted, the original token is preserved. If the preserveOriginal flag is true, it will output the full original token (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in this case, if a matching token is identical to the original, it will only emit one copy of the full token). Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand. This is my first Java code, so apologies if I'm doing something stupid. was: The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position. I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern. Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches. For instance a pattern like (([a-z]+)(\d*)) when matched against abc123def456 would produce the tokens: abc123, abc, 123, def456, def, 456 Multiple patterns can be applied, eg these patterns could be used for camelCase analysis: ([A-Z]{2,}), (?![A-Z])([A-Z][a-z]+), (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+), ([0-9]+) When matched against the string letsPartyLIKEits1999_dude, they would produce the tokens: lets, Party, LIKE, its, 1999, dude If no token is emitted, the original token is preserved. If the preserveOriginal flag is true, it will output the full original token (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in this case, if a matching token is identical to the original, it will only emit one copy of the full token). Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand. This is my first Java code, so apologies if I'm doing something stupid. Pattern token filter which emits a token for every capturing group -- Key: LUCENE-4766 URL: https://issues.apache.org/jira/browse/LUCENE-4766 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.1 Reporter: Clinton Gormley Priority: Minor Labels: analysis, feature, lucene Fix For: 4.2 Attachments: LUCENE-4766.patch The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position. I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern. Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches. For instance a pattern like : {code} (([a-z]+)(\d*)) {code} when matched against: {code} abc123def456 {code} would produce the tokens: {code} abc123, abc, 123, def456, def, 456 {code} Multiple patterns can be applied, eg these patterns could be used for camelCase analysis: {code} ([A-Z]{2,}), (?![A-Z])([A-Z][a-z]+), (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+), ([0-9]+) {code} When matched against the string letsPartyLIKEits1999_dude, they would produce the tokens: {code} lets, Party, LIKE, its, 1999, dude {code} If no token is emitted, the original token is preserved. If