[
https://issues.apache.org/jira/browse/LUCENE-4381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-4381:
--------------------------------
Attachment: LUCENE-4381.patch
A hacked up patch for testing:
I think its nice to offer the CJK dictionary-based stuff as an option? I'm not
sure how good results will be on average yet (maybe I can enlist Christian to
help investigate).
So as a test I just added a boolean option, which if enabled, keeps all
han/hiragana/katakana marked as "Chinese/Japanese" (uses the 15924 Japanese
code, but I overrode the toString to try to prevent confusion).
Seems to work ok: some trivial snippets from smartcn and kuromoji are analyzed
fine, and testRandomStrings is happy :)
> support unicode 6.2
> -------------------
>
> Key: LUCENE-4381
> URL: https://issues.apache.org/jira/browse/LUCENE-4381
> Project: Lucene - Core
> Issue Type: Task
> Components: modules/analysis
> Reporter: Robert Muir
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4381.patch
>
>
> ICU will release a new version in about a month.
> They have a version for testing
> (http://site.icu-project.org/download/milestone) already out with some
> interesting features, e.g. dictionary-based CJK segmentation.
> This issue is just to test it out/integrate the new stuff/etc. We should try
> out the automation Steve did as well.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]