I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue.... anyone raising a hand?
Otis --- Marvin Humphrey <[EMAIL PROTECTED]> wrote: > Ken Krugler sent a reply to the user list. In an effort to keep all > the developers informed, I'm sending my reply to the developer list > and including his entire original post below my sig. > > Ken writes... > > > Since a null in the > > middle of a string is rare, as is a character outside of the BMP, > a > > quick scan of the text should be sufficient to determine if it can > be > > written as-is. > > Let's see. I think we are looking at two scans, (one index(), one > regex), or a regex that uses alternation. I strongly suspect two > scans > are faster. > > if ( (index($string, "\xC0\x80") != -1) > or ($string =~ /[\F0-\xF7]/ ) # only exists in 4-byte UTF-8 > ) { > # Process string... > } > > That would tell us whether the string needed to be specially encoded > for > Java's sake on output. Yes, I suspect that's considerably more > efficient than always converting first to UTF-16 and then to > "Modified > UTF-8". > > It's also completely unnecessary, as you'll see from the patch below, > so I'm going to press ahead and make these XS ports of InputStream > and > OutputStream work with legal UTF-8. > > It would actually make a lot more sense for Plucene if the integer at > the head of a string measured *bytes* instead of either Unicode code > points or Java chars. Then it's just a straight up copy! No > scanning > OR decoding required. > > (Hmm... I wonder if there's a way to make Lucene work quickly if the > VInt were redefined to be "length in bytes"...) > > Speaking of which, the Lucene file formats document also says this... > > "Lucene writes strings as a VInt representing the length, > followed by > the character data." > > The ambiguity of the word "length" in this sentence left me > scratching > my head. Length in bytes or length in UTF-8 characters? Of course > the real answer is... neither. :\ > > It's length in Java chars, or, if you prefer to further Sun's > disinformation campaign, ;) "Modified UTF-8 characters". If the > Lucene > docs had stated "Java chars" explicitly, I would have had a better > idea > about why the value of that VInt is what it is -- a Java-specific > quirk at odds with a widely-accepted standard -- and about what > it was going to take to adhere to the spec. > > > I'd need to look at the code more, but using something other than > the > > Java serialized format would probably incur a performance penalty > for > > the Java implementation. Or at least make it harder to handle the > > strings using the standard Java serialization support. > > I believe that the following true-UTF-8 replacement for the > readChars function is at least as fast as the current implementation, > unless your text contains characters outside the BMP. It's > incomplete, > because my Java expertise is quite limited, but it should be > conceptually sound. The algo is adapted from C code supplied by the > Unicode consortium. > > http://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c > > static final byte[] TRAILING_BYTES_FOR_UTF8 = { > 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, > 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, > 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, > 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, > 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, > 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, > 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, > 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, > 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, > 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, > 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, > 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, > 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, > 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, > 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, > 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5 > }; > > public final void readChars(char[] buffer, int start, int length) > throws IOException { > int end = start + length; // No longer a final int. > for (int i = start; i < end; i++) { > int b = readByte(); // NOTE: b changed from byte to int. > switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) { > case 0: > buffer[i] = (char)(b & 0x7F); > break; > case 1: > buffer[i] = (char)(((b & 0x1F) << 6) > | (readByte() & 0x3F)); > //buffer[i] = (char)(((b & 0x1F) << 6) > // | (readByte() & 0x3F)); > break; > case 2: > buffer[i] = (char)(((b & 0x0F) << 12) > | ((readByte() & 0x3F) << 6) > | (readByte() & 0x3F)); > break; > case 3: > int utf32 = (((b & 0x0F) << 18) > | ((readByte() & 0x3F) << 12) > | ((readByte() & 0x3F) << 6) > | (readByte() & 0x3F)); > // These are just for illustration. > int firstSurrogate = (utf32 >> 10) + 0xD7C0; > int secondSurrogate = (utf32 & 0x03FF) + 0xDC00; > // If the current buffer isn't long enough, > // create a new buffer with length one greater than > // the current buffer, copy the entire contents, > // enter the first surrogate, increment both i and end, > // enter the second surrogate. > // This is extremely inefficient, but also > // likely to be invoked extremely rarely. > // Problem: In Perl I'd do this with references, and > // in C I'd do it with pointers. Not sure how to > // make it work in Java. > break; > } > } > } > > > Initial benchmarking experiments appear to indicate negligible impact > on performance. > > > So I doubt > > this would be a slam-dunk in the Lucene community. > > I appreciate your willingness to at least weigth the matter, and I > understand the potential reluctance. Hopefully the comparable > performance of the standards-compliant code above will render the > issue > moot, and the next release of Lucene will use legal UTF-8. > > Best, > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > ================================================================ > > From: Ken Krugler <[EMAIL PROTECTED]> > Date: August 27, 2005 2:11:34 PM PDT > To: java-user@lucene.apache.org > Subject: Re: Lucene does NOT use UTF-8. > Reply-To: java-user@lucene.apache.org > > > > I've delved into the matter of Lucene and UTF-8 a little further, > > and I am discouraged by what I believe I've uncovered. > > > > Lucene should not be advertising that it uses "standard UTF-8" -- > > or even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8. > > > > Unfortunately this is how Sun documents the format they use for > serialized strings. > > > > The two distinguishing characteristics of "Modified UTF-8" are the > > > treatment of codepoints above the BMP (which are written as > > surrogate pairs), and the encoding of null bytes as 1100 0000 1000 > > > 0000 rather than 0000 0000. Both of these became illegal as of > > Unicode 3.1 (IIRC), because they are not shortest-form and non- > > shortest-form UTF-8 presents a security risk. > > > > For UTF-8 these were always invalid, but the standard wasn't very > clear about it. Unfortunately the fuzzy nature of the 1.0/2.0 specs > encouraged some sloppy implementations. > > > > The documentation should really state that Lucene stores strings in > > > a Java-only adulteration of UTF-8, > > > > Yes, good point. I don't know who's in charge of that page, but it > should be fixed. > > > > unsuitable for interchange. > > > > Other than as an internal representation for Java serialization. > > > > Since Perl uses true shortest-form UTF-8 as its native encoding, > > Plucene would have to jump through two efficiency-killing hoops in > > > order to write files that would not choke Lucene: instead of > > writing out its true, legal UTF-8 directly, it would be necessary > > to first translate to UTF-16, then duplicate the Lucene encoding > > algorithm from OutputStream. In theory. > > > > Actually I don't think it would be all that bad. Since a null in the > > middle of a string is rare, as is a character outside of the BMP, a > quick scan of the text should be sufficient to determine if it can be > > written as-is. > > The ICU project has C code that can be used to quickly walk a string. > > I believe these would find/report such invalid code points, if you > use the safe (versus faster unsafe) versions. > > > > Below you will find a simple Perl script which illustrates what > > happens when Perl encounters malformed UTF-8. Run it (you need > > Perl 5.8 or higher) and you will see why even if I thought it was a > > > good idea to emulate the Java hack for encoding "Modified UTF-8", > > trying to make it work in practice would be a nightmare. > > > > If Plucene were to write legal UTF-8 strings to its index files, > > Java Lucene would misbehave and possibly blow up any time a string > > > contained either a 4-byte character or a null byte. On the flip > > side, Perl will spew warnings like crazy and possibly blow up > > whenever it encounters a Lucene-encoded null or surrogate pair. > > The potential blowups are due to the fact that Lucene and Plucene > > will not agree on how many characters a string contains, resulting > > > in overruns or underruns. > > > > I am hoping that the answer to this will be a fix to the encoding > > mechanism in Lucene so that it really does use legal UTF-8. The > > most efficient way to go about this has not yet presented itself. > > > > I'd need to look at the code more, but using something other than the > > Java serialized format would probably incur a performance penalty for > > the Java implementation. Or at least make it harder to handle the > strings using the standard Java serialization support. So I doubt > this would be a slam-dunk in the Lucene community. > > -- Ken > > > > > #---------------------------------------- > > > > #!/usr/bin/perl > > use strict; > > use warnings; > > > > # illegal_null.plx -- Perl complains about non-shortest-form null. > > > > my $data = "foo\xC0\x80\n"; > > > > open (my $virtual_filehandle, "+<:utf8", \$data); > > print <$virtual_filehandle>; > > > > -- > Ken Krugler > TransPac Software, Inc. > <http://www.transpac.com> > +1 530-470-9200 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]