[
https://issues.apache.org/jira/browse/ARROW-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16747321#comment-16747321
]
Paul Taylor commented on ARROW-4283:
------------------------------------
[~pitrou] Thanks for the feedback.
I want to clarify: my Python skills aren't sharp, I'm not familiar with the
pyarrow API or Python's asyncio/async-iterable primitives, so filter my
comments through the lens of a beginner.
The little experience I do have is using the RecordBatchStreamReader to read
from stdin (via {{sys.stdin.buffer}}) and named file descriptors (via
{{os.fdopen()}}). Since Python's so friendly (and I have no idea how the Python
IO primitives work), I thought maybe I could pass aiohttp's {{Request.stream}}
to the RecordBatchStreamReader constructor, and quickly learned that no, I
can't ;).
In the JS implementation we have two main entry points for reading RecordBatch
streams:
# a static
[{{RecordBatchReader.from(source)}}|https://github.com/apache/arrow/blob/cc1ce6194b905768b1a6d9f0e209270f62dc558a/js/src/ipc/reader.ts#L142],
which accepts heterogeneous source types and returns a RecordBatchReader for
the underlying Arrow type (file, stream, or JSON) and conforms to sync/async
semantics of the source input type
# methods that create [through/transform
streams|https://github.com/apache/arrow/blob/cc1ce6194b905768b1a6d9f0e209270f62dc558a/js/bin/file-to-stream.js#L33]
from the RecordBatchReader and RecordBatchWriter, for use with node's native
stream primitives
Each link in the streaming pipeline is a sort of transform stream, and a
significant amount of effort went into supporting all the different
node/browser IO primitives, so I understand if that's too much to ask at this
point.
As an alternative, would it be possible to add a method that accepts a Python
byte stream, and returns a zero-copy AsyncIterable of RecordBatches? Or maybe
add an an example in the
[python/ipc|https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-streams]
docs page of how to do that?
> Should RecordBatchStreamReader/Writer be AsyncIterable?
> -------------------------------------------------------
>
> Key: ARROW-4283
> URL: https://issues.apache.org/jira/browse/ARROW-4283
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Paul Taylor
> Priority: Minor
> Fix For: 0.13.0
>
>
> Filing this issue after a discussion today with [~xhochy] about how to
> implement streaming pyarrow http services. I had attempted to use both Flask
> and [aiohttp|https://aiohttp.readthedocs.io/en/stable/streams.html]'s
> streaming interfaces because they seemed familiar, but no dice. I have no
> idea how hard this would be to add -- supporting all the asynciterable
> primitives in JS was non-trivial.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)