Re: Lucene does NOT use UTF-8.

Marvin Humphrey Sun, 28 Aug 2005 18:56:55 -0700

Ken Krugler sent a reply to the user list.  In an effort to keep all
the developers informed, I'm sending my reply to the developer list
and including his entire original post below my sig.


Ken writes...

> Since a null in the
> middle of a string is rare, as is a character outside of the BMP, a
> quick scan of the text should be sufficient to determine if it can be
> written as-is.

Let's see.  I think we are looking at two scans, (one index(), one
regex), or a regex that uses alternation.  I strongly suspect two scans
are faster.

    if (  (index($string, "\xC0\x80") != -1)
       or ($string =~ /[\F0-\xF7]/ ) # only exists in 4-byte UTF-8
    ) {
        # Process string...
    }

That would tell us whether the string needed to be specially encoded for
Java's sake on output.  Yes, I suspect that's considerably more
efficient than always converting first to UTF-16 and then to "Modified
UTF-8".

It's also completely unnecessary, as you'll see from the patch below,
so I'm going to press ahead and make these XS ports of InputStream and
OutputStream work with legal UTF-8.

It would actually make a lot more sense for Plucene if the integer at
the head of a string measured *bytes* instead of either Unicode code
points or Java chars.  Then it's just a straight up copy!  No scanning
OR decoding required.

(Hmm... I wonder if there's a way to make Lucene work quickly if the
VInt were redefined to be "length in bytes"...)

Speaking of which, the Lucene file formats document also says this...

"Lucene writes strings as a VInt representing the length,followed by

    the character data."

The ambiguity of the word "length" in this sentence left me scratching
my head.  Length in bytes or length in UTF-8 characters?  Of course
the real answer is... neither. :\

It's length in Java chars, or, if you prefer to further Sun's
disinformation campaign, ;) "Modified UTF-8 characters".  If the Lucene
docs had stated "Java chars" explicitly, I would have had a better idea
about why the value of that VInt is what it is -- a Java-specific
quirk at odds with a widely-accepted standard -- and about what
it was going to take to adhere to the spec.

> I'd need to look at the code more, but using something other than the
> Java serialized format would probably incur a performance penalty for
> the Java implementation. Or at least make it harder to handle the
> strings using the standard Java serialization support.

I believe that the following true-UTF-8 replacement for the
readChars function is at least as fast as the current implementation,
unless your text contains characters outside the BMP.  It's incomplete,
because my Java expertise is quite limited, but it should be
conceptually sound.  The algo is adapted from C code supplied by the
Unicode consortium.

http://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c

  static final byte[] TRAILING_BYTES_FOR_UTF8 = {
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
      2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
  };

  public final void readChars(char[] buffer, int start, int length)
       throws IOException {
    int end = start + length; // No longer a final int.
    for (int i = start; i < end; i++) {
      int b = readByte();   // NOTE: b changed from byte to int.
      switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
        case 0:
          buffer[i] = (char)(b & 0x7F);
          break;
        case 1:
          buffer[i] = (char)(((b & 0x1F) << 6)
            | (readByte() & 0x3F));
          //buffer[i] = (char)(((b & 0x1F) << 6)
          //  | (readByte() & 0x3F));
          break;
        case 2:
          buffer[i] = (char)(((b & 0x0F) << 12)
            | ((readByte() & 0x3F) << 6)
            |  (readByte() & 0x3F));
          break;
        case 3:
          int utf32 = (((b & 0x0F) << 18)
            | ((readByte() & 0x3F) << 12)
            | ((readByte() & 0x3F) << 6)
            |  (readByte() & 0x3F));
          // These are just for illustration.
          int firstSurrogate  = (utf32 >> 10) + 0xD7C0;
          int secondSurrogate = (utf32 & 0x03FF) + 0xDC00;
          // If the current buffer isn't long enough,
          // create a new buffer with length one greater than
          // the current buffer, copy the entire contents,
          // enter the first surrogate, increment both i and end,
          // enter the second surrogate.
          // This is extremely inefficient, but also
          // likely to be invoked extremely rarely.
          // Problem: In Perl I'd do this with references, and
          // in C I'd do it with pointers.  Not sure how to
          // make it work in Java.
          break;
      }
    }
  }


Initial benchmarking experiments appear to indicate negligible impact
on performance.

> So I doubt
> this would be a slam-dunk in the Lucene community.

I appreciate your willingness to at least weigth the matter, and I
understand the potential reluctance.  Hopefully the comparable
performance of the standards-compliant code above will render the issue
moot, and the next release of Lucene will use legal UTF-8.

Best,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

================================================================

From: Ken Krugler <[EMAIL PROTECTED]>
Date: August 27, 2005 2:11:34 PM PDT
To: java-user@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.
Reply-To: java-user@lucene.apache.org

I've delved into the matter of Lucene and UTF-8 a little further,and I am discouraged by what I believe I've uncovered.
Lucene should not be advertising that it uses "standard UTF-8" --or even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.

Unfortunately this is how Sun documents the format they use forserialized strings.

The two distinguishing characteristics of "Modified UTF-8" are thetreatment of codepoints above the BMP (which are written assurrogate pairs), and the encoding of null bytes as 1100 0000 10000000 rather than 0000 0000. Both of these became illegal as ofUnicode 3.1 (IIRC), because they are not shortest-form and non-shortest-form UTF-8 presents a security risk.

For UTF-8 these were always invalid, but the standard wasn't veryclear about it. Unfortunately the fuzzy nature of the 1.0/2.0 specsencouraged some sloppy implementations.

The documentation should really state that Lucene stores strings ina Java-only adulteration of UTF-8,

Yes, good point. I don't know who's in charge of that page, but itshould be fixed.

unsuitable for interchange.


Other than as an internal representation for Java serialization.

Since Perl uses true shortest-form UTF-8 as its native encoding,Plucene would have to jump through two efficiency-killing hoops inorder to write files that would not choke Lucene: instead ofwriting out its true, legal UTF-8 directly, it would be necessaryto first translate to UTF-16, then duplicate the Lucene encodingalgorithm from OutputStream. In theory.

Actually I don't think it would be all that bad. Since a null in themiddle of a string is rare, as is a character outside of the BMP, aquick scan of the text should be sufficient to determine if it can bewritten as-is.

The ICU project has C code that can be used to quickly walk a string.I believe these would find/report such invalid code points, if youuse the safe (versus faster unsafe) versions.

Below you will find a simple Perl script which illustrates whathappens when Perl encounters malformed UTF-8. Run it (you needPerl 5.8 or higher) and you will see why even if I thought it was agood idea to emulate the Java hack for encoding "Modified UTF-8",trying to make it work in practice would be a nightmare.
If Plucene were to write legal UTF-8 strings to its index files,Java Lucene would misbehave and possibly blow up any time a stringcontained either a 4-byte character or a null byte. On the flipside, Perl will spew warnings like crazy and possibly blow upwhenever it encounters a Lucene-encoded null or surrogate pair.The potential blowups are due to the fact that Lucene and Plucenewill not agree on how many characters a string contains, resultingin overruns or underruns.
I am hoping that the answer to this will be a fix to the encodingmechanism in Lucene so that it really does use legal UTF-8. Themost efficient way to go about this has not yet presented itself.

I'd need to look at the code more, but using something other than theJava serialized format would probably incur a performance penalty forthe Java implementation. Or at least make it harder to handle thestrings using the standard Java serialization support. So I doubtthis would be a slam-dunk in the Lucene community.


-- Ken

#----------------------------------------

#!/usr/bin/perl
use strict;
use warnings;

# illegal_null.plx -- Perl complains about non-shortest-form null.

my $data = "foo\xC0\x80\n";

open (my $virtual_filehandle, "+<:utf8", \$data);
print <$virtual_filehandle>;


--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

Reply via email to