Re: Quick question about Lucene and UCS4

Simon Willnauer Fri, 31 Jul 2009 07:26:31 -0700

If I understand you correctly you are asking if lucene can deal with
encodings that use more than 16 bit. Well yes and no but mainly no.
The support for unicode 4.0 was introduced in Java 1.5 and lucene core
has still back-compat requirements for java 1.4. Lucene's analyzers
make use of char[] all over the place which is a sequence of UTF-16
code unit not a code point. As I said the support for codepoints was
introduced in 1.5 and I can remember that there is an issue which aims
to implement support for upplementary characters (those above FFFF).
Such a character is represented as 2 chars and the most of the
analysis code will simply remove those characters.
Have a look at this issue:
https://issues.apache.org/jira/browse/LUCENE-1689 ( @ Robert are you
working on this?)


I'm sure there will be support for that in lucene 3.1.

Simon
On Fri, Jul 31, 2009 at 4:08 PM, Michael Thomsen<[email protected]> wrote:
> Is Lucene capable of handling UCS4 data natively?
>
> Thanks,
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Quick question about Lucene and UCS4

Reply via email to