>Unicode is 16 bits. UTF-8 needs 1 byte for a 7-bit character (ASCII), >2 bytes for an 11-bit character (including ISO-8859-1), and 3 bytes for >a 16-bit character.
This is partly true. Unicode itself is coding independent. I believe Unicode is currently defined as having up to 2^31 positions, although the current plan is for somewhere between 2^20 and 2^21 characters. (2^16 characters was the old Unicode standard - dropped when someone pointed out that Chinese alone has more than 2^16 characters). Unicode needs to be encoded somehow as a sequence of words. UTF-8 encodes Unicode as sequences of 8 bit words - either 1, 2, or 3 depending on the character. UTF-16 encodes it as a sequence of 16 bit words: 1 or 2. UTF-32 encodes it as a sequence of 32 bit words, always 1 per character. UTF-8 is the most common encoding. It handles ISO-Latin-1 easily (fits in 1 word). Unicode is cool - if you want to learn more, see http://www.unicode.org/ http://www.unicode.org/unicode/faq/utf_bom.html I'm a bit confused about this discussion; Java does a great job of hiding character encodings from you. Is Lucene turning byte arrays into character arrays somewhere? [EMAIL PROTECTED] . . . . . . . . http://www.media.mit.edu/~nelson/