[jira] [Commented] (LUCENE-8972) CharFilter version of ICUTransformFilter, to better support dictionary-based tokenization

2019-09-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926589#comment-16926589
 ] 

Robert Muir commented on LUCENE-8972:
-

I agree its a good idea, a couple thoughts about the impl you linked to:
* its not clear to me the incremental conversion works for all the cases. I 
think this is easily solved with tests (especially test helpers like 
checkRandomData should "spoon-feed" reader data in various amounts). It also 
seems like it eventually just reads/transforms entire document in RAM, this is 
important to avoid for large documents. Maybe use of apis such as 
finishTransliteration/getMaximumContextLength is helpful there.
* the tokenfilter has a hack to give better performance on common inputs. 
particularly by avoiding a lot of cpu when the input doesn't match the filter 
anyway (e.g. latin-1 in your example). Otherwise its painfully sloow. See 
the code where it says "this is cheating".

> CharFilter version of ICUTransformFilter, to better support dictionary-based 
> tokenization
> -
>
> Key: LUCENE-8972
> URL: https://issues.apache.org/jira/browse/LUCENE-8972
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
>
> The ICU Transliteration API is currently exposed through Lucene only 
> post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly 
> dictionary-based) may assume pre-normalized input (e.g., for Chinese 
> characters, there may be an assumption of traditional-only or simplified-only 
> input characters, at the level of either all input, or 
> per-dictionary-defined-token).
> The potential usefulness of a CharFilter that exposes the ICU Transliteration 
> API was suggested in a [thread on the Solr mailing 
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201807.mbox/%3C4DAB7BA7-42A8-4009-8B49-60822B00DE7D%40wunderwood.org%3E],
>  and my hope is that this issue can facilitate more detailed discussion of 
> the proposed addition.
> A concrete example of mixed traditional/simplified characters that are 
> currently tokenized differently by the ICUTokenizer are:
>  * 红楼梦 (SSS)
>  * 紅樓夢 (TTT)
>  * 紅楼夢 (TST)
> The first two tokens (simplified-only and traditional-only, respectively) are 
> included in the [CJ dictionary that backs 
> ICUTokenizer|https://raw.githubusercontent.com/unicode-org/icu/release-62-1/icu4c/source/data/brkitr/dictionaries/cjdict.txt],
>  but the last (a mixture of traditional and simplified characters) is not, 
> and is not recognized as a token. Even _if_ we assume this to be an 
> intentional omission from the dictionary that results in behavior that could 
> be desirable for some use cases, there are surely some use cases that would 
> benefit from a more permissive dictionary-based tokenization strategy (such 
> as could be supported by pre-tokenizer transliteration).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8972) CharFilter version of ICUTransformFilter, to better support dictionary-based tokenization

2019-09-09 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925841#comment-16925841
 ] 

Michael Gibney commented on LUCENE-8972:


For consideration, I believe this issue has already been tackled by [~cbeer] 
and [~mejackreed]; the resulting implementation can be found 
[here|https://github.com/sul-dlss/CJKFilterUtils/blob/master/src/main/java/edu/stanford/lucene/analysis/ICUTransformCharFilter.java].

> CharFilter version of ICUTransformFilter, to better support dictionary-based 
> tokenization
> -
>
> Key: LUCENE-8972
> URL: https://issues.apache.org/jira/browse/LUCENE-8972
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
>
> The ICU Transliteration API is currently exposed through Lucene only 
> post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly 
> dictionary-based) may assume pre-normalized input (e.g., for Chinese 
> characters, there may be an assumption of traditional-only or simplified-only 
> input characters, at the level of either all input, or 
> per-dictionary-defined-token).
> The potential usefulness of a CharFilter that exposes the ICU Transliteration 
> API was suggested in a [thread on the Solr mailing 
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201807.mbox/%3C4DAB7BA7-42A8-4009-8B49-60822B00DE7D%40wunderwood.org%3E],
>  and my hope is that this issue can facilitate more detailed discussion of 
> the proposed addition.
> A concrete example of mixed traditional/simplified characters that are 
> currently tokenized differently by the ICUTokenizer are:
>  * 红楼梦 (SSS)
>  * 紅樓夢 (TTT)
>  * 紅楼夢 (TST)
> The first two tokens (simplified-only and traditional-only, respectively) are 
> included in the [CJ dictionary that backs 
> ICUTokenizer|https://raw.githubusercontent.com/unicode-org/icu/release-62-1/icu4c/source/data/brkitr/dictionaries/cjdict.txt],
>  but the last (a mixture of traditional and simplified characters) is not, 
> and is not recognized as a token. Even _if_ we assume this to be an 
> intentional omission from the dictionary that results in behavior that could 
> be desirable for some use cases, there are surely some use cases that would 
> benefit from a more permissive dictionary-based tokenization strategy (such 
> as could be supported by pre-tokenizer transliteration).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org