jorisvandenbossche commented on issue #41388: URL: https://github.com/apache/arrow/issues/41388#issuecomment-2088180177
While certainly confusing, this is actually somewhat intentional behaviour AFAIK. NumPy explicitly describes their `S` dtype as "zero-terminated bytes" ([docs](https://numpy.org/doc/stable/reference/arrays.dtypes.html#specifying-and-constructing-data-types)). Although those dtypes are fixed-width, numpy essentially kind of supports variable length bytes/strings by padding shorter strings with null bytes. In their recent documentation, this is explained better (https://numpy.org/devdocs/user/basics.types.html#data-types-for-strings-and-bytes): > If we specify a shorter or longer data type, the string is either truncated or zero-padded to fit in the specified width As an example, consider this simple array: ``` >>> np_arr = np.array([b"short", b"longer one"]) >>> np_arr array([b'short', b'longer one'], dtype='|S10') >>> arr = pa.array(np_arr) >>> arr <pyarrow.lib.BinaryArray object at 0x7fba6640a8c0> [ 73686F7274, 6C6F6E676572206F6E65 ] >>> arr.to_pylist() [b'short', b'longer one'] ``` If we would not discard null bytes, then this would result in actual null bytes in the data: ``` >>> arr = pa.array(np_arr, pa.binary(10)) >>> arr.to_pylist() [b'short\x00\x00\x00\x00\x00', b'longer one'] ``` As far as I understand, that is also the reason why we still default to our variable sized string/binary types when converting from numpy's fixed-width dtypes. And as you noted, a current workaround is indeed to specify to convert the numpy array to a fixed-size binary type. Now, the way we drop those padded null bytes, is by using `strnlen` which I think essentially means we are searching for the first null byte to determine the length of the value: https://github.com/apache/arrow/blob/22f88fa4a8f5ac7250f1845aace5a78d20006ef2/python/pyarrow/src/arrow/python/numpy_to_arrow.cc#L562-L566 So I assume we could change this to only drop actually _trailing_ null bytes, instead of truncating it at the first null byte we encounter (I assume numpy does this since it prints the null byte in the middle just fine?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
