So Solr trunk should already handle Unicode above BMP for field type string? Strange...
Regards, Bernd Am 25.02.2011 14:40, schrieb Uwe Schindler: > Solr trunk is using Lucene trunk since Lucene and Solr are merged. > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > >> -----Original Message----- >> From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] >> Sent: Friday, February 25, 2011 2:19 PM >> To: simon.willna...@gmail.com >> Cc: java-user@lucene.apache.org >> Subject: Re: which unicode version is supported with lucene >> >> Hi Simon, >> >> actually I'm working with Solr from trunk but followed the problem all the >> way down to Lucene. I think Solr trunk is build with Lucene 3.0.3. >> >> My field is: >> <field name="dcdescription" type="string" indexed="false" stored="true" /> >> >> No analysis done at all, just stored the content for result display. >> But the result is unpredictable and can end in invalid utf-8 code. >> >> Regards, >> Bernd >> >> >> Am 25.02.2011 13:43, schrieb Simon Willnauer: >>> On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling >>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>> Hi Simon, >>>> >>>> thanks for the details. >>>> >>>> My platform supports and uses code above BMP (0x10000 and up). >>>> So the limit is Lucene. >>>> Don't know how to handle this problem. >>>> May be deleting all code above BMP...??? >>> >>> the code will work fine even if they are in you text. It will just not >>> respect them maybe throw them away during tokenization etc. so it >>> really depends what you are using on the analyzer side. maybe you can >>> give us little more details on what you use for analysis. One option >>> would be to build 3.1 from the source and use the analyzers from >>> there?! >>> >>>> >>>> Good to hear that Lucene 3.1 will come soon. >>>> Any rough estimation when Lucene 3.1 will be available? >>> >>> I hope it will happen within the next 4 weeks >>> >>> simon >>> >>>> >>>> Regards, >>>> Bernd >>>> >>>> Am 25.02.2011 12:04, schrieb Simon Willnauer: >>>>> Hey Bernd, >>>>> >>>>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling >>>>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>>>> Dear list, >>>>>> >>>>>> a very basic question about lucene, which version of unicode can be >>>>>> handled (indexed and searched) with lucene? >>>>> >>>>> if you ask for what the indexer / query can handle then it is really >>>>> what UTF-8 can handle. Strings passed to the writer / reader are >>>>> converted to UTF-8 internally (rough picture). On Trunk we are >>>>> indexing bytes only (UTF-8 bytes by default). so the question is >>>>> really what you platform supports in terms of utilities / operations >>>>> on characters and strings. Since Lucene 3.0 we are on Java 1.5 and >>>>> have the possibility to respect code points which are above the BMP. >>>>> Lucene 2.9 still has Java 1.4 System Requirements that prevented us >>>>> from moving forward to Unicode 4.0. If you look at Character.java >>>>> all methods have been converted to operate on UTF-32 code points >>>>> instead of UTF-16 code points in Java 1.4. >>>>> >>>>> Since 3.0 is a Java Generics / move to Java 1.5 only release these >>>>> APIs are not in use yet in the latest released version. Lucene 3.1 >>>>> holds a largely converted Analyzer / TokenFilter / Tokenizer >>>>> codebase (I think there are one or two which still have problems, I >>>>> should check... Robert did we fix all NGram stuff?). >>>>> >>>>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only >>>>> support characters within the BMP <= 0xFFFF. 3.1 (to be released >>>>> soon I hope) will fix most of the problems and includes ICU based >>>>> analysis for full Unicode 5 support. >>>>> >>>>> hope that helps >>>>> >>>>> simon >>>>>> >>>>>> It looks like lucene can only handle the very old Unicode 2.0 but >>>>>> not the newer 3.1 version (4 byte utf-8 unicode). >>>>>> >>>>>> Is that true? >>>>>> >>>>>> Regards, >>>>>> Bernd >>>>>> >>>> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org