[
https://issues.apache.org/jira/browse/LUCENE-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926589#comment-16926589
]
Robert Muir commented on LUCENE-8972:
-
I agree its a good idea, a couple thoughts about the impl you linked to:
* its not clear to me the incremental conversion works for all the cases. I
think this is easily solved with tests (especially test helpers like
checkRandomData should "spoon-feed" reader data in various amounts). It also
seems like it eventually just reads/transforms entire document in RAM, this is
important to avoid for large documents. Maybe use of apis such as
finishTransliteration/getMaximumContextLength is helpful there.
* the tokenfilter has a hack to give better performance on common inputs.
particularly by avoiding a lot of cpu when the input doesn't match the filter
anyway (e.g. latin-1 in your example). Otherwise its painfully sloow. See
the code where it says "this is cheating".
> CharFilter version of ICUTransformFilter, to better support dictionary-based
> tokenization
> -
>
> Key: LUCENE-8972
> URL: https://issues.apache.org/jira/browse/LUCENE-8972
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
>
> The ICU Transliteration API is currently exposed through Lucene only
> post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly
> dictionary-based) may assume pre-normalized input (e.g., for Chinese
> characters, there may be an assumption of traditional-only or simplified-only
> input characters, at the level of either all input, or
> per-dictionary-defined-token).
> The potential usefulness of a CharFilter that exposes the ICU Transliteration
> API was suggested in a [thread on the Solr mailing
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201807.mbox/%3C4DAB7BA7-42A8-4009-8B49-60822B00DE7D%40wunderwood.org%3E],
> and my hope is that this issue can facilitate more detailed discussion of
> the proposed addition.
> A concrete example of mixed traditional/simplified characters that are
> currently tokenized differently by the ICUTokenizer are:
> * 红楼梦 (SSS)
> * 紅樓夢 (TTT)
> * 紅楼夢 (TST)
> The first two tokens (simplified-only and traditional-only, respectively) are
> included in the [CJ dictionary that backs
> ICUTokenizer|https://raw.githubusercontent.com/unicode-org/icu/release-62-1/icu4c/source/data/brkitr/dictionaries/cjdict.txt],
> but the last (a mixture of traditional and simplified characters) is not,
> and is not recognized as a token. Even _if_ we assume this to be an
> intentional omission from the dictionary that results in behavior that could
> be desirable for some use cases, there are surely some use cases that would
> benefit from a more permissive dictionary-based tokenization strategy (such
> as could be supported by pre-tokenizer transliteration).
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org