From: "Dan Sugalski" <[EMAIL PROTECTED]> > >Agreed. I'll probably have the encoding structure provide the terminating > >bytes. As a side note don't we also have to split UTF-16 into UTF-16BE and > >UTF-16LE (big endian and little endian)? > > I think UTF-16 can be a single encoding. The little/big endian issue can be > dealt with by an I/O filter.
Will an IO filter have an opportunity to inject itself when we mmap a file? It was because you said you wanted this capability that I thought we were maintaining the serialized forms of unicode encodings. Otherwise, I would be highly tempted to convert the internal representation of all unicode strings into and array of 4 byte ints (allows for much faster processing). David