Michael Gibney created LUCENE-8972:
--------------------------------------

             Summary: CharFilter version ICUTransformFilter, to better support 
dictionary-based tokenization
                 Key: LUCENE-8972
                 URL: https://issues.apache.org/jira/browse/LUCENE-8972
             Project: Lucene - Core
          Issue Type: New Feature
          Components: modules/analysis
    Affects Versions: 8.2, master (9.0)
            Reporter: Michael Gibney


The ICU Transliteration API is currently exposed through Lucene only 
post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly 
dictionary-based) may assume pre-normalized input (e.g., for Chinese 
characters, there may be an assumption of traditional-only or simplified-only 
input characters, at the level of either all input, or 
per-dictionary-defined-token).

The potential usefulness of a CharFilter that exposes the ICU Transliteration 
API was suggested in a [thread on the Solr mailing 
list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201807.mbox/%3C4DAB7BA7-42A8-4009-8B49-60822B00DE7D%40wunderwood.org%3E],
 and my hope is that this issue can facilitate more detailed discussion of the 
proposed addition.

A concrete example of mixed traditional/simplified characters that are 
currently tokenized differently by the ICUTokenizer are:
 * 红楼梦 (SSS)
 * 紅樓夢 (TTT)
 * 紅楼夢 (TST)

The first two tokens (simplified-only and traditional-only, respectively) are 
included in the [CJ dictionary that backs 
ICUTokenizer|https://raw.githubusercontent.com/unicode-org/icu/release-62-1/icu4c/source/data/brkitr/dictionaries/cjdict.txt],
 but the last (a mixture of traditional and simplified characters) is not, and 
is not recognized as a token. Even _if_ we assume this to be an intentional 
omission from the dictionary that results in behavior that could be desirable 
for some use cases, there are surely some use cases that would benefit from a 
more permissive dictionary-based tokenization strategy (such as could be 
supported by pre-tokenizer transliteration).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to