[ 
https://issues.apache.org/jira/browse/ARROW-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16556196#comment-16556196
 ] 

Wes McKinney commented on ARROW-2913:
-------------------------------------

>From Arrow's perspective, these buffers are "just memory"; they're given an 
>interpretation in the context of the columnar data structure.

We could pin some extra metadata on the buffer on the Python side without much 
work, which would make the exported memory view look like int32, double, etc. 
That wouldn't be intrusive to any of the C++ API or implementation details

> [Python] Exported buffers don't expose type information
> -------------------------------------------------------
>
>                 Key: ARROW-2913
>                 URL: https://issues.apache.org/jira/browse/ARROW-2913
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>    Affects Versions: 0.10.0
>            Reporter: Antoine Pitrou
>            Priority: Major
>
> Using the {{buffers()}} method on array gives you a list of buffers backing 
> the array, but those buffers lose typing information:
> {code:python}
> >>> a = pa.array(range(10))
> >>> a.type
> DataType(int64)
> >>> buffers = a.buffers()
> >>> [(memoryview(buf).format, memoryview(buf).shape) for buf in buffers]
> [('b', (2,)), ('b', (80,))]
> {code}
> Conversely, Numpy exposes type information in the Python buffer protocol:
> {code:python}
> >>> a = pa.array(range(10))
> >>> memoryview(a.to_numpy()).format
> 'l'
> >>> memoryview(a.to_numpy()).shape
> (10,)
> {code}
> Exposing type information on buffers could be important for third-party 
> systems, such as Dask/distributed, for type-based data compression when 
> serializing.
> Since our C++ buffers are not typed, it's not obvious how to solve this. 
> Should we return tensors instead?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to