[ 
https://issues.apache.org/jira/browse/ARROW-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671612#comment-16671612
 ] 

Antoine Pitrou commented on ARROW-3685:
---------------------------------------

This is because Numpy string arrays act semantically as variable-sized (using 
an optional zero byte to denote end of string):
{code:python}
>>> a = np.array([b"foo", b"barz"], dtype='S')                                  
>>>                                                      
>>> a                                                                           
>>>                                                      
array([b'foo', b'barz'], dtype='|S4')
>>> pa.array(a)                                                                 
>>>                                                      
<pyarrow.lib.BinaryArray object at 0x7f14f684c278>
[
  666F6F,
  6261727A
]
{code}

If you want a fixed-size binary array, you can pass the type explicitly:
{code:python}
>>> pa.array(a, type=pa.binary(4))                                              
>>>                                                      
<pyarrow.lib.FixedSizeBinaryArray object at 0x7f153ec75318>
[
  666F6F00,
  6261727A
]
{code}

The real issue here is that even a fixed-size binary array cannot convert back 
to a Numpy string array currently:
{code}
>>> pa.array(a, type=pa.binary(4)).to_pandas()                                  
>>>                                                      
array([b'foo\x00', b'barz'], dtype=object)
{code}


> [Python] Use fixed size binary for NumPy fixed-size string dtypes
> -----------------------------------------------------------------
>
>                 Key: ARROW-3685
>                 URL: https://issues.apache.org/jira/browse/ARROW-3685
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.11.1
>            Reporter: Maarten Breddels
>            Priority: Major
>
> I'm working on getting support for arrow in vaex (out of core dataframe 
> library for Python) in this PR:
> [https://github.com/maartenbreddels/vaex/pull/116]
> And I fixed length binary arrays for numpy (say dtype='S42') will be 
> converted to a non-fixed length array. Trying to convert that back to numpy 
> will fail, since there is no such conversion.
> It makes more sense to convert dtype='S42', to an arrow array with 
> pyarrow.binary(42) type. As I do in:
> https://github.com/maartenbreddels/vaex/blob/4b4facb64fea9f83593ce0f0b82fc26ddf96b506/packages/vaex-arrow/vaex_arrow/convert.py#L4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to