[
https://issues.apache.org/jira/browse/LUCENE-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Gibney updated LUCENE-8972:
-----------------------------------
Summary: CharFilter version of ICUTransformFilter, to better support
dictionary-based tokenization (was: CharFilter version ICUTransformFilter, to
better support dictionary-based tokenization)
> CharFilter version of ICUTransformFilter, to better support dictionary-based
> tokenization
> -----------------------------------------------------------------------------------------
>
> Key: LUCENE-8972
> URL: https://issues.apache.org/jira/browse/LUCENE-8972
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Affects Versions: master (9.0), 8.2
> Reporter: Michael Gibney
> Priority: Minor
>
> The ICU Transliteration API is currently exposed through Lucene only
> post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly
> dictionary-based) may assume pre-normalized input (e.g., for Chinese
> characters, there may be an assumption of traditional-only or simplified-only
> input characters, at the level of either all input, or
> per-dictionary-defined-token).
> The potential usefulness of a CharFilter that exposes the ICU Transliteration
> API was suggested in a [thread on the Solr mailing
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201807.mbox/%3C4DAB7BA7-42A8-4009-8B49-60822B00DE7D%40wunderwood.org%3E],
> and my hope is that this issue can facilitate more detailed discussion of
> the proposed addition.
> A concrete example of mixed traditional/simplified characters that are
> currently tokenized differently by the ICUTokenizer are:
> * 红楼梦 (SSS)
> * 紅樓夢 (TTT)
> * 紅楼夢 (TST)
> The first two tokens (simplified-only and traditional-only, respectively) are
> included in the [CJ dictionary that backs
> ICUTokenizer|https://raw.githubusercontent.com/unicode-org/icu/release-62-1/icu4c/source/data/brkitr/dictionaries/cjdict.txt],
> but the last (a mixture of traditional and simplified characters) is not,
> and is not recognized as a token. Even _if_ we assume this to be an
> intentional omission from the dictionary that results in behavior that could
> be desirable for some use cases, there are surely some use cases that would
> benefit from a more permissive dictionary-based tokenization strategy (such
> as could be supported by pre-tokenizer transliteration).
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]