jorisvandenbossche commented on issue #41388:
URL: https://github.com/apache/arrow/issues/41388#issuecomment-2088180177

   While certainly confusing, this is actually somewhat intentional behaviour 
AFAIK.
   
   NumPy explicitly describes their `S` dtype as "zero-terminated bytes" 
([docs](https://numpy.org/doc/stable/reference/arrays.dtypes.html#specifying-and-constructing-data-types)).
 Although those dtypes are fixed-width, numpy essentially kind of supports 
variable length bytes/strings by padding shorter strings with null bytes. In 
their recent documentation, this is explained better 
(https://numpy.org/devdocs/user/basics.types.html#data-types-for-strings-and-bytes):
   
   > If we specify a shorter or longer data type, the string is either 
truncated or zero-padded to fit in the specified width
   
   As an example, consider this simple array:
   
   ```
   >>> np_arr = np.array([b"short", b"longer one"])
   >>> np_arr
   array([b'short', b'longer one'], dtype='|S10')
   
   >>> arr = pa.array(np_arr)
   >>> arr
   <pyarrow.lib.BinaryArray object at 0x7fba6640a8c0>
   [
     73686F7274,
     6C6F6E676572206F6E65
   ]
   >>> arr.to_pylist()
   [b'short', b'longer one']
   ```
   
   If we would not discard null bytes, then this would result in actual null 
bytes in the data:
   
   ```
   >>> arr = pa.array(np_arr, pa.binary(10))
   >>> arr.to_pylist()
   [b'short\x00\x00\x00\x00\x00', b'longer one']
   ```
   
   As far as I understand, that is also the reason why we still default to our 
variable sized string/binary types when converting from numpy's fixed-width 
dtypes.  
   And as you noted, a current workaround is indeed to specify to convert the 
numpy array to a fixed-size binary type.
   
   Now, the way we drop those padded null bytes, is by using `strnlen` which I 
think essentially means we are searching for the first null byte to determine 
the length of the value:
   
   
https://github.com/apache/arrow/blob/22f88fa4a8f5ac7250f1845aace5a78d20006ef2/python/pyarrow/src/arrow/python/numpy_to_arrow.cc#L562-L566
   
   So I assume we could change this to only drop actually _trailing_ null 
bytes, instead of truncating it at the first null byte we encounter (I assume 
numpy does this since it prints the null byte in the middle just fine?)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to