On Sat, Mar 19, 2016 at 2:26 AM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Michael Torrie <torr...@gmail.com>: > >> On 03/18/2016 02:26 AM, Jussi Piitulainen wrote: >>> I think Julia's way of dealing with its strings-as-UTF-8 [2] is more >>> promising. Indexing is by bytes (1-based in Julia) but the value at a >>> valid index is the whole UTF-8 character at that point, and an >>> invalid index raises an exception. >> >> This seems to me to be a leaky abstraction. > > It may be that Python's Unicode abstraction is an untenable illusion > because the underlying reality is 8-bit and there's no way to hide it > completely. >
The underlying reality is 1-bit. Or maybe the underlying reality is actually electrical signals that don't even have a clear definition of "bits" and bounce between two states for a few fractions of a second before settling. And maybe someone's implementing Python on the George Banks Kite CPU, which consists of two cents' worth of paper and string, on which text is actually represented by glyph. They're all equally valid notions of "underlying reality". Text is an abstract concept, just as numbers are. You fundamentally cannot represent the notion of "three" in a computer; what you'll generally do is encode that in some way. C does this by encoding that in a machine word, then storing the machine word in memory, either least significant byte lowest in memory, or the other way around. Congratulations, C! You've already made two conflicting encodings for integers, and you still have to predeclare a maximum representable value. If you go for arbitrary-precision integers, there are a whole lot more ways to encode them. GMP has a bunch of tweakables like "number of nail bits", or you can go for a simple variable-length integer that has seven bits of payload per byte and sets the high bit if there are more bytes to read (and again, you have to figure out whether that's little-endian or big-endian), or you can go for a more complex scheme. Python's Unicode abstraction *never* leaks information about how it's stored in memory [1] [2]; a Unicode string in Python consists of a series of codepoints in a well-defined order. This is exactly what you would expect of a system in which codepoints are fundamental objects that can truly be represented directly; if you can prove, from within Python, that the interpreter uses bytes to represent text, I'd be extremely surprised. ChrisA [1] Not since 3.3, at least. 2.7 narrow builds (eg on Windows) can leak the UTF-16 level, but not further than that. [2] Well, you might be able to figure stuff out based on timings. Only in cryptography have I ever heard performance treated as a leak. -- https://mail.python.org/mailman/listinfo/python-list