[
https://issues.apache.org/jira/browse/ARROW-15969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509268#comment-17509268
]
Lubo Slivka commented on ARROW-15969:
-------------------------------------
Hi [~lidavidm] ,
I took a stab at this; since i'm very new to Arrow, I'm unsure whether it is
the right way to do it so want to run it by you before creating PR if you don't
mind.
After some poking around in C++ code, it seemed like a good idea to add
GetRecordBatchReader() method to the RecordBatchFileReader; there is already
GetRecordBatchGenerator() so getting the batch reader in a similar way looked
like a good fit to me:
[https://github.com/lupko/arrow/commit/ff56b44f881fd6069b73d9a432f6528d7ff11bb2]
As a disclaimer, since I have not followed/used C++ for more than a decade, I
have pieced the implementation based on other things I've seen in reader.cc :)
---
The Python part is then straightforward. Added
RecordBatchFileReader.to_reader() that calls this new method and wraps the
result into RecordBatchReader.
[https://github.com/lupko/arrow/commit/72240813ef4fff337c0ecb44c715febf7bc3d2fc]
---
Does this make sense? Shall I create PR?
Regards,
Lubo
> [Python] Add conversion from RecordBatchFileReader to RecordBatchReader
> -----------------------------------------------------------------------
>
> Key: ARROW-15969
> URL: https://issues.apache.org/jira/browse/ARROW-15969
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Lubo Slivka
> Priority: Major
>
> The suggested improvement is to introduce a conversion/adapter so that all
> batches from RecordBatchFileReader can be read one-by-one using
> RecordBatchReader.
> Perhaps a new instance method RecordBatchFileReader.to_reader()? This would
> follow the suit of for instance the pyarrow.flight.MetadataRecordBatchReader
> which also has to_reader().
> *Motivation*
> Record Batches serialized into IPC file format can be read using
> RecordBatchFileReader. The interface of this reader is incompatible with
> RecordBatchReader.
> This impacts for instance the Flight RPC DoGet, where it is not possible to
> efficiently (e.g. fully in C++) send out all data by using
> pyarrow.flight.RecordBatchStream. However, there may be other use cases where
> client code wants to read data batch-by-batch transparently, without caring
> about the serialization format.
> Further background is here:
> [https://lists.apache.org/thread/b9jwk103fgxfo4kct12t00ymdft7bklb]
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)