Re: which unicode version is supported with lucene

Bernd Fehling Fri, 25 Feb 2011 05:49:12 -0800

So Solr trunk should already handle Unicode above BMP for field type string?
Strange...


Regards,
Bernd

Am 25.02.2011 14:40, schrieb Uwe Schindler:
> Solr trunk is using Lucene trunk since Lucene and Solr are merged.
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
>> -----Original Message-----
>> From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de]
>> Sent: Friday, February 25, 2011 2:19 PM
>> To: simon.willna...@gmail.com
>> Cc: java-user@lucene.apache.org
>> Subject: Re: which unicode version is supported with lucene
>>
>> Hi Simon,
>>
>> actually I'm working with Solr from trunk but followed the problem all the
>> way down to Lucene. I think Solr trunk is build with Lucene 3.0.3.
>>
>> My field is:
>> <field name="dcdescription" type="string" indexed="false" stored="true" />
>>
>> No analysis done at all, just stored the content for result display.
>> But the result is unpredictable and can end in invalid utf-8 code.
>>
>> Regards,
>> Bernd
>>
>>
>> Am 25.02.2011 13:43, schrieb Simon Willnauer:
>>> On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
>>> <bernd.fehl...@uni-bielefeld.de> wrote:
>>>> Hi Simon,
>>>>
>>>> thanks for the details.
>>>>
>>>> My platform supports and uses code above BMP (0x10000 and up).
>>>> So the limit is Lucene.
>>>> Don't know how to handle this problem.
>>>> May be deleting all code above BMP...???
>>>
>>> the code will work fine even if they are in you text. It will just not
>>> respect them maybe throw them away during tokenization etc. so it
>>> really depends what you are using on the analyzer side. maybe you can
>>> give us little more details on what you use for analysis. One option
>>> would be to build 3.1 from the source and use the analyzers from
>>> there?!
>>>
>>>>
>>>> Good to hear that Lucene 3.1 will come soon.
>>>> Any rough estimation when Lucene 3.1 will be available?
>>>
>>> I hope it will happen within the next 4 weeks
>>>
>>> simon
>>>
>>>>
>>>> Regards,
>>>> Bernd
>>>>
>>>> Am 25.02.2011 12:04, schrieb Simon Willnauer:
>>>>> Hey Bernd,
>>>>>
>>>>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
>>>>> <bernd.fehl...@uni-bielefeld.de> wrote:
>>>>>> Dear list,
>>>>>>
>>>>>> a very basic question about lucene, which version of unicode can be
>>>>>> handled (indexed and searched) with lucene?
>>>>>
>>>>> if you ask for what the indexer / query can handle then it is really
>>>>> what UTF-8 can handle. Strings passed to the writer / reader are
>>>>> converted to UTF-8 internally (rough picture). On Trunk we are
>>>>> indexing bytes only (UTF-8 bytes by default). so the question is
>>>>> really what you platform supports in terms of utilities / operations
>>>>> on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
>>>>> have the possibility to respect code points which are above the BMP.
>>>>> Lucene 2.9 still has Java 1.4 System Requirements that prevented us
>>>>> from moving forward to Unicode 4.0. If you look at Character.java
>>>>> all methods have been converted to operate on UTF-32 code points
>>>>> instead of UTF-16 code points in Java 1.4.
>>>>>
>>>>> Since 3.0 is a Java Generics / move to Java 1.5 only release these
>>>>> APIs are not in use yet in the latest released version. Lucene 3.1
>>>>> holds a largely converted Analyzer / TokenFilter / Tokenizer
>>>>> codebase (I think there are one or two which still have problems, I
>>>>> should check... Robert did we fix all NGram stuff?).
>>>>>
>>>>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
>>>>> support characters within the BMP <= 0xFFFF. 3.1 (to be released
>>>>> soon I hope) will fix most of the problems and includes ICU based
>>>>> analysis for full Unicode 5 support.
>>>>>
>>>>> hope that helps
>>>>>
>>>>> simon
>>>>>>
>>>>>> It looks like lucene can only handle the very old Unicode 2.0 but
>>>>>> not the newer 3.1 version (4 byte utf-8 unicode).
>>>>>>
>>>>>> Is that true?
>>>>>>
>>>>>> Regards,
>>>>>> Bernd
>>>>>>
>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: which unicode version is supported with lucene

Reply via email to