Martin v. Löwis wrote: > Travis E. Oliphant wrote: > >>Currently that means that they are "unicode" strings of basic size UCS2 >>or UCS4 depending on the platform. It is this duality that has some >>people concerned. For all other data-types, NumPy allows the user to >>explicitly request a bit-width for the data-type. > > > Why is that a desirable property? Also: Why does have NumPy support for > Unicode arrays in the first place? >
Numpy supports arrays of arbitrary fixed-length "records". It is much more than numeric-only data now. One of the fields that a record can contain is a string. If strings are supported, it makes sense to support unicode strings as well. This allows NumPy to memory-map arbitrary data-files on disk. Perhaps you should explain why you think NumPy "shouldn't support Unicode" > > My initial reaction is: use whatever Python uses in "NumPy Unicode". > Upon closer inspection, it is not all that clear what operations > are supported on a Unicode array, and how these operations relate > to the Python Unicode type. That is currently what is done. The current unicode data-type is exactly what Python uses. The chararray subclass gives to unicode and string arrays all the methods of unicode and strings (operating on an element-by-element basis). When you extract an element from the unicode data-type you get a Python unicode object (every NumPy data-type has a corresponding "type-object" that determines what is returned when an element is extracted). All of these types are in a hierarchy of data-types which inherit from the basic Python types when available. > > In any case, I think NumPy should have only a single "Unicode array" > type (please do explain why having zero of them is insufficient). > Please explain why having zero of them is *sufficient*. > If the purpose of the type is to interoperate with a Python > unicode object, it should use the same width (as this will > allow for mempcy). > > If the purpose is to support arbitrary Unicode characters, it should > use 4 bytes (as two bytes are insufficient to represent arbitrary > Unicode characters). And Python does not support arbitrary Unicode characters on narrow builds? Then how is \U0010FFFF represented? > > If the purpose is something else, please explain what the purpose > is. The purpose is to represent bytes as they might exist in a file or data-stream according to the users specification. The purpose is whatever the user wants them for. It's the same purpose as having an unsigned 64-bit data-type --- because users may need it to represent data as it exists in a file. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com