[ 
https://issues.apache.org/jira/browse/ARROW-15969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509792#comment-17509792
 ] 

Lubo Slivka commented on ARROW-15969:
-------------------------------------

Hi [~amol-] ,

thanks for input. I suspect the key difference here could be that:
 * RecordBatchStreamReader allows going through all batches in the stream file, 
one-by-one, once
 * RecordBatchFileReader allows for repeated, random access; one can have the 
file reader open all the time and read from it as needed. even read all the 
batches as needed, multiple times

The RecordBatchStreamReader naturally "IS A" RecordBatchReader (it even says in 
doc it 'reads a stream of batches').  The file reader is not. Imho due to this 
difference, implementing RBR on the file reader would lead to 'awkwardness' 
down the road - what if user wants to consume contents of the file using RBR 
multiple times? 

Anyway, that's how I understand it with my limited knowledge of the codebase 
:]. Please let me know if i'm missing something.

Thanks,

--L

> [C++][Python] Add conversion from RecordBatchFileReader to RecordBatchReader
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-15969
>                 URL: https://issues.apache.org/jira/browse/ARROW-15969
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Lubo Slivka
>            Priority: Major
>
> The suggested improvement is to introduce a conversion/adapter so that all 
> batches from RecordBatchFileReader can be read one-by-one using 
> RecordBatchReader.
> Perhaps a new instance method RecordBatchFileReader.to_reader()? This would 
> follow the suit of for instance the pyarrow.flight.MetadataRecordBatchReader 
> which also has to_reader().
> *Motivation*
> Record Batches serialized into IPC file format can be read using 
> RecordBatchFileReader. The interface of this reader is incompatible with 
> RecordBatchReader.
> This impacts for instance the Flight RPC DoGet, where it is not possible to 
> efficiently (e.g. fully in C++) send out all data by using 
> pyarrow.flight.RecordBatchStream. However, there may be other use cases where 
> client code wants to read data batch-by-batch transparently, without caring 
> about the serialization format.
> Further background is here: 
> [https://lists.apache.org/thread/b9jwk103fgxfo4kct12t00ymdft7bklb]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to