[ 
https://issues.apache.org/jira/browse/LUCENE-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271293#comment-17271293
 ] 

ASF subversion and git services commented on LUCENE-9575:
---------------------------------------------------------

Commit f942b2dd8a484879d806fcc4fa95c7393f348d9e in lucene-solr's branch 
refs/heads/master from Gus Heck
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f942b2d ]

@gus-asf LUCENE-9575 Provide a producer for PatternTypingRule in 
TestRandomChains (#2241)

LUCENE-9575 Provide a producer for PatternTypingRule in TestRandomChains to fix 
failure on seed 65EA739C95F40313


> Add PatternTypingFilter
> -----------------------
>
>                 Key: LUCENE-9575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9575
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Gus Heck
>            Assignee: Gus Heck
>            Priority: Major
>          Time Spent: 4h
>  Remaining Estimate: 0h
>
> One of the key asks when the Library of Congress was asking me to develop the 
> Advanced Query Parser was to be able to recognize arbitrary patterns that 
> included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they 
> wanted 401k and 401(k) to match documents with either style reference, and 
> NOT match documents that happen to have isolated 401 or k tokens (i.e. not 
> documents about the http status code) And of course we wanted to give up as 
> little of the text analysis features they were already using.
> This filter in conjunction with the filters from LUCENE-9572, LUCENE-9574 and 
> one solr specific filter in SOLR-14597 that re-analyzes tokens with an 
> arbitrary analyzer defined for a type in the solr schema, combine to achieve 
> this. 
> This filter has the job of spotting the patterns, and adding the intended 
> synonym as at type to the token (from which minimal punctuation has been 
> removed). It also sets flags on the token which are retained through the 
> analysis chain, and at the very end the type is converted to a synonym and 
> the original token(s) for that type are dropped avoiding the match on 401 
> (for example) 
> The pattern matching is specified in a file that looks like: 
> {code}
> 2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2
> 2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3
> 2 C\+\+ ::: c_plus_plus
> {code}
> That file would match match legal reference patterns such as 401(k), 401k, 
> 501(c)3 and C++ The format is:
> <flagsInt> <pattern> ::: <replacement>
> and groups in the pattern are substituted into the replacement so the first 
> line above would create synonyms such as:
> {code}
> 401k   --> legal2_401_k
> 401(k) --> legal2_401_k
> 503(c) --> legal2_503_c
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to