[jira] [Comment Edited] (ARROW-4283) [Python] Should RecordBatchStreamReader/Writer be AsyncIterable?

Weston Pace (Jira) Fri, 09 Dec 2022 08:46:10 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645386#comment-17645386
 ]


Weston Pace edited comment on ARROW-4283 at 12/9/22 4:44 PM:
-------------------------------------------------------------

Things have changed a bit since 2019.  The {{RecordBatchFileWriter}} has an 
asynchronous API now.  It's currently exposed as a whole-file reading 
{{AsyncGenerator}} (an iterator function that returns a promise each time you 
call it) via {{RecordBatchFileWriter::OpenAsync}} and 
{{RecordBatchFileWriter::GetRecordBatchGenerator}}.  Although, under the hood, 
there are {{ReadFooterAsync}}, {{ReadRecordBatchAsync}} methods that could be 
exposed should more direct control be desired.

Adapting this pattern to the streaming reader should be pretty straightforward. 
 These methods all return {{arrow::Future}}.  As far as I know no one has done 
the neccesary work to plumb {{arrow::Future}} into a python async API (e.g. 
{{asyncio}}).

Asynchronous methods in Arrow typically work by offloading the blocking I/O 
calls to a global I/O thread pool (which can have more threads than there are 
cores and should generally be sized appropriately for the I/O device).  This 
keeps the CPU threads free and non-blocking.  To hook this into {{asyncio}} you 
would probably want to call {{arrow::Future::AddCallback}} and then, in that 
callback, schedule a task on some kind of python executor.  In that python 
executor task you will want to mark some kind of {{asyncio}} future complete 
and this will presumably run any needed callbacks.


was (Author: westonpace):
Things have changed a bit since 2019.  The {{RecordBatchFileWriter}} has an 
asynchronous API now.  It's currently exposed as a whole-file reading 
{{AsyncGenerator}} (an iterator function that returns a promise each time you 
call it) via {{RecordBatchFileWriter::OpenAsync}} and 
{{RecordBatchFileWriter::GetRecordBatchGenerator}}.  Although, under the hood, 
there are {{ReadFooterAsync}}, {{ReadRecordBatchAsync}} methods that could be 
exposed should more direct control be desired.

Adapting this pattern to the streaming reader should be pretty straightforward. 
 These methods all return {{arrow::Future}}.  As far as I know no one has done 
the neccesary work to plumb {{arrow::Future}} into a python async API (e.g. 
{{asyncio}}).

Asynchronous methods in Arrow typically work by offloading the blocking I/O 
calls to a global I/O thread pool (which can have more threads than there are 
cores and should generally be sized appropriately for the I/O device).  This 
keeps the CPU threads free and non-blocking.  To hook this into {{asyncio}} you 
would probably want to call {{arrow::Future::AddCallback}} and then, in that 
callback, schedule a task on some kind of python executor.  In that python 
executor task you will want to mark some kind of {{asyncio}} future complete.

> [Python] Should RecordBatchStreamReader/Writer be AsyncIterable?
> ----------------------------------------------------------------
>
>                 Key: ARROW-4283
>                 URL: https://issues.apache.org/jira/browse/ARROW-4283
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Paul Taylor
>            Priority: Minor
>
> Filing this issue after a discussion today with [~xhochy] about how to 
> implement streaming pyarrow http services. I had attempted to use both Flask 
> and [aiohttp|https://aiohttp.readthedocs.io/en/stable/streams.html]'s 
> streaming interfaces because they seemed familiar, but no dice. I have no 
> idea how hard this would be to add -- supporting all the asynciterable 
> primitives in JS was non-trivial.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-4283) [Python] Should RecordBatchStreamReader/Writer be AsyncIterable?

Reply via email to