Travis E. Oliphant wrote: > Numpy supports arrays of arbitrary fixed-length "records". It is > much more than numeric-only data now. One of the fields that a > record can contain is a string. If strings are supported, it makes > sense to support unicode strings as well.
Hmm. How do you support strings in fixed-length records? Strings are variable-sized, after all. On common application is that you have a C struct in some API which has a fixed-size array for string data (either with a length field, or null-terminated), in this case, it is moderately useful to model such a struct in Python. However, transferring this to Unicode is pointless - there aren't any similar Unicode structs that need support. > This allows NumPy to memory-map arbitrary data-files on disk. Ok, so this is the "C struct" case. Then why do you need Unicode support there? Which common file format has embedded fixed-size Unicode data? > Perhaps you should explain why you think NumPy "shouldn't support > Unicode" I think I said "Unicode arrays", not Unicode. Unicode arrays are a pointless data type, IMO. Unicode always comes in strings (i.e. variable sized, either null-terminated or with an introducing length). On disk/on the wire Unicode comes as UTF-8 more often than not. Using UCS-2/UCS-2 as an on-disk represenationis also questionable practice (although admittedly Microsoft uses that a lot). > That is currently what is done. The current unicode data-type is > exactly what Python uses. Then I wonder how this goes along with the use case "allow to map arbitrary files". > The chararray subclass gives to unicode and string arrays all the > methods of unicode and strings (operating on an element-by-element > basis). For strings, I can see use cases (although I wonder how you deal with data formats that also support variable-sized strings, as most data formats supporting strings do). > Please explain why having zero of them is *sufficient*. Because I (still) cannot imagine any specific application that might need such a feature (IOWYAGNI). >> If the purpose is to support arbitrary Unicode characters, it >> should use 4 bytes (as two bytes are insufficient to represent >> arbitrary Unicode characters). > > > And Python does not support arbitrary Unicode characters on narrow > builds? Then how is \U0010FFFF represented? It's represented using UTF-16. Try this for yourself: py> len(u"\U0010FFFF") 2 py> u"\U0010FFFF"[0] u'\udbff' py> u"\U0010FFFF"[1] u'\udfff' This has all kinds of non-obvious implications. > The purpose is to represent bytes as they might exist in a file or > data-stream according to the users specification. See, and this is precisely the statement that I challenge. Sure, they "might" exist - but I'd rather expect that they don't. If they exist, "Unicode" might come as variable-sized UTF-8, UTF-16, or UTF-32. In either case, NumPy should already support that by mapping a string object onto the encoded bytes, to which you then can apply .decode() should you need to process the actual Unicode data. > The purpose is > whatever the user wants them for. It's the same purpose as having an > unsigned 64-bit data-type --- because users may need it to represent > data as it exists in a file. No. I would expect you have 64-bit longs because users *do* need them, and because there wouldn't be an easy work-around if users wouldn't have them. For Unicode, it's different: users don't directly need them (atleast not many users), and if they do, there is an easy work-around for their absence. Say I want to process NTFS run lists. In NTFS run lists, there are 24-bit integers, 40-bit integers, and 4-bit integers (i.e. nibbles). Can I represent them all in NumPy? Can I have NumPy transparently map a sequence of run list records (which are variable-sized) map as an array of run list records? Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com