[ 
https://issues.apache.org/jira/browse/ARROW-10882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10882:
-----------------------------------
    Labels: dataset pull-request-available  (was: dataset)

> [Python][Dataset] Writing dataset from python iterator of record batches
> ------------------------------------------------------------------------
>
>                 Key: ARROW-10882
>                 URL: https://issues.apache.org/jira/browse/ARROW-10882
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Assignee: David Li
>            Priority: Major
>              Labels: dataset, pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> At the moment, from python you can write a dataset with {{ds.write_dataset}} 
> for example starting from a *list* of record batches. 
> But this currently needs to be an actual list (or gets converted to a list), 
> so an iterator or generated gets fully consumed (potentially bringing the 
> record batches in memory), before starting to write. 
> We should also be able to use the python iterator itself to back a 
> {{RecordBatchIterator}}-like object, that can be consumed while writing the 
> batches.
> We already have a {{arrow::py::PyRecordBatchReader}} that might be useful 
> here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to