Julian -- thanks for taking this on. NumPy's handling of strings on Python 3 certainly needs fixing.
On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.face...@gmail.com> wrote: > Variable-length encodings, of which UTF-8 is obviously the one that makes > good handling essential, are indeed more complicated. But is it strictly > necessary that string arrays hold fixed-length *strings*, or can the > encoding length be fixed instead? That is, currently if you try to assign a > longer string than will fit, the string is truncated to the number of > characters in the data type. Instead, for encoded Unicode, the string could > be truncated so that the encoding fits. Of course this is not completely > trivial for variable-length encodings, but it should be doable, and it > would allow UTF-8 to be used just the way it usually is - as an encoding > that's almost 8-bit. > I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters. In fact, we already have this sort of distinction between element size and memory usage: np.string_ uses null padding to store shorter strings in a larger dtype. The only reason I see for supporting encodings other than UTF-8 is for memory-mapping arrays stored with those encodings, but that seems like a lot of extra trouble for little gain.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion