[
https://issues.apache.org/jira/browse/ARROW-15969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511786#comment-17511786
]
Lubo Slivka commented on ARROW-15969:
-------------------------------------
Sorry for confusion - I have not expressed myself properly - I meant the exact
thing as you drawn :)
The reason I see for not doing the class hierarchy as you propose is repeated
streaming.
As I gather - and please correct me if I'm wrong, I'm very new to Arrow, and
may be basing my argument on wrong assumption: one can get
RecordBatchFileReader open as long as feasible and read from it at will.
Keeping the file open cuts down on IO overhead, so it is a good idea to reuse
it.
Having RecordBatchFileReader extend RecordBatchReader and implementing the
necessary methods means client can stream the file once. To stream again, a new
instance of RecordBatchFileReader has to be created.. or it is necessary to add
some kind of Reset() function to allow streaming the whole file again. imho the
adapter on top of RecordBatchFileReader is cleaner way that 'naturally' allows
for repeated streaming.
> [C++][Python] Add conversion from RecordBatchFileReader to RecordBatchReader
> ----------------------------------------------------------------------------
>
> Key: ARROW-15969
> URL: https://issues.apache.org/jira/browse/ARROW-15969
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Lubo Slivka
> Priority: Major
>
> The suggested improvement is to introduce a conversion/adapter so that all
> batches from RecordBatchFileReader can be read one-by-one using
> RecordBatchReader.
> Perhaps a new instance method RecordBatchFileReader.to_reader()? This would
> follow the suit of for instance the pyarrow.flight.MetadataRecordBatchReader
> which also has to_reader().
> *Motivation*
> Record Batches serialized into IPC file format can be read using
> RecordBatchFileReader. The interface of this reader is incompatible with
> RecordBatchReader.
> This impacts for instance the Flight RPC DoGet, where it is not possible to
> efficiently (e.g. fully in C++) send out all data by using
> pyarrow.flight.RecordBatchStream. However, there may be other use cases where
> client code wants to read data batch-by-batch transparently, without caring
> about the serialization format.
> Further background is here:
> [https://lists.apache.org/thread/b9jwk103fgxfo4kct12t00ymdft7bklb]
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)