[jira] Commented: (LUCENE-1689) supplementary character handling

Robert Muir (JIRA) Mon, 16 Nov 2009 10:04:03 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778421#action_12778421
 ]


Robert Muir commented on LUCENE-1689:
-------------------------------------

Yonik, or anyone else, please let me know your thoughts on the following:

bq. I don't see a real back compat issue... I can't imagine anyone relying on 
the fact that >BMP chars wouldn't be lowercased. To rely on that would also be 
relying on undocumented behavior.
bq. Ah, OK. Actually it just occurred to me that this would also require 
reindexing, otherwise queries that hit documents in the past would mysteriously 
start missing them (for text outside the BMP).

what should be our approach here wrt index back compat?
For the issues mentioned here, I cant possibly see >BMP working currently for 
anyone, but you are right it will change results.

I don't want to break index back compat, just wanted to mention that 
introducing Unicode 4 support, still with API back compat, with no performance 
degradation, is going to be somewhat challenging already.
If we want to somehow support the "broken" analysis components for index back 
compat, then we have to also have a broken implementation available on top of 
the correct impl (probably using Version to handle this).
In my opinion, this would introduce a lot of complexity, I will help do it 
though, if we must.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, 
> LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1689) supplementary character handling

Reply via email to