jorisvandenbossche commented on issue #33986:
URL: https://github.com/apache/arrow/issues/33986#issuecomment-1415832284

   I have been recently thinking about this as well, triggered by our work to 
support the DataFrame Interchange protocol in pyarrow 
(https://github.com/apache/arrow/pull/14804, 
https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html). 
   For some context, this "interchange protocol" is a python specific one to 
allow interchanging dataframe-data, typically between different dataframe 
libraries (eg convert a dataframe of library1 into a dataframe of library2 
without both libraries having to know about each other). Essentially, this is 
very similar to our C Data Interface (it specifies how to share the memory, and 
in the end also has pointers to buffers of a certain size, just like the C 
struct), but because it is a python API, it gives you some functionality on top 
of those raw buffers (or, before you get to the buffers). 
   
   As illustration, using latest pyarrow (using a pyarrow.Table, so all data is 
already in memory, but we _could_ add the same to something backed by a not-yet 
materialized stream or data on disk):
   
   ```python
   >>> table = pa.table({'a': [1, 2, 3], 'b': [4, 5, 6]})
   # accessing this interchange object
   >>> interchange_object = table.__dataframe__()
   >>> interchange_object
   <pyarrow.interchange.dataframe._PyArrowDataFrame at 0x7ffb64f45420>
   # this doesn't yet need to materialize all the buffers, but you can inspect 
metadata
   >>> interchange_object.num_columns()
   2
   >>> interchange_object.column_names()
   ['a', 'b']
   # you can select a subset (i.e. simple projection)
   >>> subset = interchange_object.select_columns_by_name(['a'])
   >>> subset.num_columns()
   1
   # only when actually asking for the buffers of one chunk of a column, the 
data needs
   # to be in memory (to pass a pointer to the buffers)
   >>> subset.get_column_by_name("a").get_buffers()
   {'data': (PyArrowBuffer({'bufsize': 24, 'ptr': 140717840175104, 'device': 
'CPU'}),
     (<DtypeKind.INT: 0>, 64, 'l', '=')),
    'validity': None,
    'offsets': None}
   ```
   
   To be clear, I don't propose we do something exactly like that (and 
personally, I think it's also a missed opportunity for the DataFrame 
Interchange protocol to not use Arrow for the memory layout specification, and 
not give access to data as arrow data or using the Arrow C Interface). 
   But it has some nice properties compared to our "raw" C Data Interface. It 
allows to first inspect the data (columns, data types) before you actually 
start materializing/consuming the data chunk by chunk, and allows you to get 
the data of only a subset of the columns. While when using the C Stream for a 
Table or generically RecordBatchReader, you directly get the struct with all 
buffers for all columns. So if you don't need all columns, you need to ensure 
to first subset the data before accessing the C interface data, but this means 
you need to know the API of the object you are getting the data from (partly 
removing the benefit of the generality of the C Interface). 
   
   So having some standard Python interface that would give you delayed and 
queryable (filter/simple projection) access to Arrow C Interface data sounds 
really interesting, or having this as an additional C ABI (for which we can 
still provide such a Python interface as well) like David sketched above. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to