Fwd: Standard or Modified UTF-8?

Marvin Humphrey Sat, 27 Aug 2005 07:04:35 -0700

Greets,

It was suggested that I move this to the developers list from theusers list...


-- Marvin Humphrey

Begin forwarded message:

From: Marvin Humphrey <[EMAIL PROTECTED]>
Date: August 26, 2005 4:51:27 PM PDT
To: [email protected]
Subject: Standard or Modified UTF-8?
Reply-To: [email protected]


Greets,

As part of my attempt to speed up Plucene and establish indexcompatibility between Plucene and Java Lucene, I'm portingInputStream and OutputStream to XS (the C API for accessing Perl'sguts), and I believe I have found a documentation bug in the file-format spec at...


http://lucene.apache.org/java/docs/fileformats.html

"Lucene writes unicode character sequences using the standard UTF-8encoding."

Snooping the code in OutputStream, it looks like you are writingmodified UTF-8 -- NOT standard -- because a null byte is writtenusing the two-byte form.


      else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {
    writeByte((byte)(0xC0 | (code >> 6)));
    writeByte((byte)(0x80 | (code & 0x3F)));

http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

Can someone please confirm that the intention is to write modifiedUTF-8?


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Fwd: Standard or Modified UTF-8?

Reply via email to