[ 
https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741098#action_12741098
 ] 

Uwe Schindler commented on LUCENE-1793:
---------------------------------------

I would also strongly suggest to remove these custom charsets. They are not 
unicode conform, because they use char codepoint mappings that simply define an 
US ASCII char for some of the input chars. The problems begin with mixed 
language texts.
This strange (and wrong) mapping can also be seen in the tests: Tests load a 
KOI-8 file with encoding ISO-8859-1 (to get the native bytes as chars) and then 
map it. This is very bad!
The analyzers should really only work on unicode codepoints and nothing more. 
For backwards compatibility with old indexes (that are encoded using this 
strange mapping), we have to preserve the charsets for a while, but deprecate 
all of them and only leave UTF-16 as input (java chars).

You are right, to reduce index size, it would be good, to also support other 
encodings in addition to UTF-8 for storage of term text.

> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they 
> define things like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of 
> other charsets belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to