[
https://issues.apache.org/jira/browse/ARROW-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645401#comment-17645401
]
Weston Pace commented on ARROW-4283:
------------------------------------
Also, a note on {{RecordBatchStreamWriter}}. In most cases you can probably
get away with not making this async. If you are writing to disk then the write
is, typically, implicitly async. The write function merely does a memcpy from
user space to kernel space (into the page cache), marks the page dirty, and
then immediately returns (without waiting for the data to be persisted to
disk). The only time this is blocking is if you are out of memory and swapping
in which case you might have to wait for some physical memory to become
available. We do run into this on large datasets though and are currently
investigating a direct I/O alternative which would, unfortunately, require
async. So that would be the exception to "in most cases".
Cloud filesystems behave similarly (well, I'm quite certain Arrow's S3 writer
is implicitly async and others should be able to be so) where we create an S3
request and then submit that request to an I/O thread under the hood and simply
have a non-blocking write method.
> [Python] Should RecordBatchStreamReader/Writer be AsyncIterable?
> ----------------------------------------------------------------
>
> Key: ARROW-4283
> URL: https://issues.apache.org/jira/browse/ARROW-4283
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Paul Taylor
> Priority: Minor
>
> Filing this issue after a discussion today with [~xhochy] about how to
> implement streaming pyarrow http services. I had attempted to use both Flask
> and [aiohttp|https://aiohttp.readthedocs.io/en/stable/streams.html]'s
> streaming interfaces because they seemed familiar, but no dice. I have no
> idea how hard this would be to add -- supporting all the asynciterable
> primitives in JS was non-trivial.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)