[
https://issues.apache.org/jira/browse/LUCENE-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214100#comment-17214100
]
Uwe Schindler commented on LUCENE-9575:
---------------------------------------
One you have a pull request, I'm happy to review the tokenfilter!
> Add PatternTypingFilter
> -----------------------
>
> Key: LUCENE-9575
> URL: https://issues.apache.org/jira/browse/LUCENE-9575
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Gus Heck
> Assignee: Gus Heck
> Priority: Major
>
> One of the key asks when the Library of Congress was asking me to develop the
> Advanced Query Parser was to be able to recognize arbitrary patterns that
> included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they
> wanted 401k and 401(k) to match documents with either style reference, and
> NOT match documents that happen to have isolated 401 or k tokens (i.e. not
> documents about the http status code) And of course we wanted to give up as
> little of the text analysis features they were already using.
> This filter in conjunction with the filters from LUCENE-9572, LUCENE-9574 and
> one solr specific filter in SOLR-14597 that re-analyzes tokens with an
> arbitrary analyzer defined for a type in the solr schema, combine to achieve
> this.
> This filter has the job of spotting the patterns, and adding the intended
> synonym as at type to the token (from which minimal punctuation has been
> removed). It also sets flags on the token which are retained through the
> analysis chain, and at the very end the type is converted to a synonym and
> the original token(s) for that type are dropped avoiding the match on 401
> (for example)
> The pattern matching is specified in a file that looks like:
> {code}
> 2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2
> 2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3
> 2 C\+\+ ::: c_plus_plus
> {code}
> That file would match match legal reference patterns such as 401(k), 401k,
> 501(c)3 and C++ The format is:
> <flagsInt> <pattern> ::: <replacement>
> and groups in the pattern are substituted into the replacement so the first
> line above would create synonyms such as:
> {code}
> 401k --> legal2_401_k
> 401(k) --> legal2_401_k
> 503(c) --> legal2_503_c
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]