On 2014-06-04 00:58, Paul Rubin wrote:
> Steven D'Aprano <st...@pearwood.info> writes:
> >> Maybe there's a use-case for a microcontroller that works in
> >> ISO-8859-5 natively, thus using only eight bits per character,
> > That won't even make the Russians happy, since in Russia there
> > are multiple incompatible legacy encodings.
> I've never understood why not use UTF-8 for everything.
If you use UTF-8 for everything, then you end up in a world where
string-indexing (see ChrisA's other side thread on this topic) is no
longer an O(1) operation, but an O(N) operation. Some of us slice
strings for a living. ;-) I understand that using UTF-32 would allow
us to maintain O(1) indexing at the cost of every string occupying 4
bytes per character. The FSR (again, as I understand it) allows
strings that fit in one-byte-per-character to use that, scaling up to
use wider characters internally as they're actually needed/used.
At the cost of complexity and non-constant memory space, an O(N)
algorithm could be tweaked down to O(log N) by using an internal
balanced tree of offsets-to-chunks (where the chunk-size was the size
of a block where it was faster to scan linearly than to navigate the
tree). One might even endow the algorithm with FSR smarts, so each
chunk/fragment could be a different encoding in memory, and linearly
iterating over the string would walk the tree, returning each decoded