If I understand you correctly you are asking if lucene can deal with encodings that use more than 16 bit. Well yes and no but mainly no. The support for unicode 4.0 was introduced in Java 1.5 and lucene core has still back-compat requirements for java 1.4. Lucene's analyzers make use of char[] all over the place which is a sequence of UTF-16 code unit not a code point. As I said the support for codepoints was introduced in 1.5 and I can remember that there is an issue which aims to implement support for upplementary characters (those above FFFF). Such a character is represented as 2 chars and the most of the analysis code will simply remove those characters. Have a look at this issue: https://issues.apache.org/jira/browse/LUCENE-1689 ( @ Robert are you working on this?)
I'm sure there will be support for that in lucene 3.1. Simon On Fri, Jul 31, 2009 at 4:08 PM, Michael Thomsen<mikerthom...@gmail.com> wrote: > Is Lucene capable of handling UCS4 data natively? > > Thanks, > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org