Fwd: Lucene does NOT use UTF-8.

Marvin Humphrey Sat, 27 Aug 2005 07:06:04 -0700

Greets,

Discussion moved from the users list as per suggestion...

-- Marvin Humphrey

Begin forwarded message:

From: Marvin Humphrey <[EMAIL PROTECTED]>
Date: August 26, 2005 9:18:21 PM PDT
To: java-user@lucene.apache.org, [EMAIL PROTECTED]
Subject: Lucene does NOT use UTF-8.
Reply-To: java-user@lucene.apache.org

Greets,

[crossposted to java-user@lucene.apache.org and [EMAIL PROTECTED]

I've delved into the matter of Lucene and UTF-8 a little further, andI am discouraged by what I believe I've uncovered.

Lucene should not be advertising that it uses "standard UTF-8" -- oreven UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8. Thetwo distinguishing characteristics of "Modified UTF-8" are thetreatment of codepoints above the BMP (which are written as surrogatepairs), and the encoding of null bytes as 1100 0000 1000 0000 ratherthan 0000 0000. Both of these became illegal as of Unicode 3.1(IIRC), because they are not shortest-form and non-shortest-formUTF-8 presents a security risk.

The documentation should really state that Lucene stores strings in aJava-only adulteration of UTF-8, unsuitable for interchange. SincePerl uses true shortest-form UTF-8 as its native encoding, Plucenewould have to jump through two efficiency-killing hoops in order towrite files that would not choke Lucene: instead of writing out itstrue, legal UTF-8 directly, it would be necessary to first translateto UTF-16, then duplicate the Lucene encoding algorithm fromOutputStream. In theory.

Below you will find a simple Perl script which illustrates whathappens when Perl encounters malformed UTF-8. Run it (you need Perl5.8 or higher) and you will see why even if I thought it was a goodidea to emulate the Java hack for encoding "Modified UTF-8", tryingto make it work in practice would be a nightmare.

If Plucene were to write legal UTF-8 strings to its index files, JavaLucene would misbehave and possibly blow up any time a stringcontained either a 4-byte character or a null byte. On the flipside, Perl will spew warnings like crazy and possibly blow upwhenever it encounters a Lucene-encoded null or surrogate pair. Thepotential blowups are due to the fact that Lucene and Plucene willnot agree on how many characters a string contains, resulting inoverruns or underruns.

I am hoping that the answer to this will be a fix to the encodingmechanism in Lucene so that it really does use legal UTF-8. The mostefficient way to go about this has not yet presented itself.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#----------------------------------------

#!/usr/bin/perl
use strict;
use warnings;

# illegal_null.plx -- Perl complains about non-shortest-form null.

my $data = "foo\xC0\x80\n";

open (my $virtual_filehandle, "+<:utf8", \$data);
print <$virtual_filehandle>;




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Fwd: Lucene does NOT use UTF-8.

Reply via email to