Weston Pace created ARROW-14026:
-----------------------------------

             Summary: [C++] Batch readahead not working correctly in Parquet 
scanner
                 Key: ARROW-14026
                 URL: https://issues.apache.org/jira/browse/ARROW-14026
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Weston Pace


The parquet scanner implements batch readahead by applying a readahead 
generator to the generator returned by 
parquet::arrow::FileReader::GetRecordBatchGenerator.  However, that generator 
is constructed with MakeConcatenatedGenerator which, regrettably, has this 
comment:

> This generator is async-reentrant but will never pull from source reentrantly 
> and will never pull from any subscription reentrantly.

This effectively prevents any batch readahead from happening and the file is 
always read one batch at a time.  Part of the problem seems to be that 
ReadOneRowGroup in reader.cc returns a RecordBatchGenerator when it seems it 
should be able to return a RecordBatch.  For the testing I am doing I changed 
this to return a single record batch which allowed me to get rid of the 
concatenated generator and batch readahead appeared to work properly but I 
didn't fully confirm the correctness of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to