Re: Lucene does NOT use UTF-8.

Doug Cutting Tue, 30 Aug 2005 09:50:46 -0700

[EMAIL PROTECTED] wrote:

How will the difference impact String memory allocations? Looking atthe String code, I can't see where it would make an impact.

I spoke a bit too soon. I should have looked at the code first. You'reright, I don't think it would require more allocations.

When considering this byte-count versus character-count issue pleasenote that it also arises elsewhere. The PrefixLength in the TermDictionary section of the file format document is currently defined as anumber of characters, not bytes.


http://lucene.apache.org/java/docs/fileformats.html#Term Dictionary

Implementing this in terms of bytes may have performance implications,since, at first glance, the entire byte sequence would need to beconverted from UTF-8 into the internal string representation for eachterm, rather than just the suffix. Does anyone see a way around that?

As for how we got to this point: I wrote Lucene's UTF-8 reading andwriting code in 1998, back when Unicode still had fewer than 2^16characters. It's surprising that it has lasted this long without anyonenoticing!


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

Reply via email to