I asked my colleague your question on Unicode & bytes - this was his reply :
Unicode is 16 bits. UTF-8 needs 1 byte for a 7-bit character (ASCII), 2 bytes for an 11-bit character (including ISO-8859-1), and 3 bytes for a 16-bit character. DaveS Joanne Dmitry Serebrennikov (11/10/2001 18:44): >I figured that I might as well be adding comments as I am reading and >figuring out the code. >One thing I was not clear on - characters are stored with 1 to 3 bytes. >Is that sufficient to represent all Unicode characters? I thought >Unicode was four bytes. > >Index: InputStream.java >=================================================================== >RCS file: >/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/store/InputStream. java,v >retrieving revision 1.1.1.1