Joris Van den Bossche created ARROW-10882:
---------------------------------------------

             Summary: [Python][Dataset] Writing dataset from python iterator of 
record batches
                 Key: ARROW-10882
                 URL: https://issues.apache.org/jira/browse/ARROW-10882
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Joris Van den Bossche


At the moment, from python you can write a dataset with {{ds.write_dataset}} 
for example starting from a *list* of record batches. 

But this currently needs to be an actual list (or gets converted to a list), so 
an iterator or generated gets fully consumed (potentially bringing the record 
batches in memory), before starting to write. 

We should also be able to use the python iterator itself to back a 
{{RecordBatchIterator}}-like object, that can be consumed while writing the 
batches.

We already have a {{arrow::py::PyRecordBatchReader}} that might be useful here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to