Weston Pace created ARROW-14026:
-----------------------------------
Summary: [C++] Batch readahead not working correctly in Parquet
scanner
Key: ARROW-14026
URL: https://issues.apache.org/jira/browse/ARROW-14026
Project: Apache Arrow
Issue Type: Bug
Components: C++
Reporter: Weston Pace
The parquet scanner implements batch readahead by applying a readahead
generator to the generator returned by
parquet::arrow::FileReader::GetRecordBatchGenerator. However, that generator
is constructed with MakeConcatenatedGenerator which, regrettably, has this
comment:
> This generator is async-reentrant but will never pull from source reentrantly
> and will never pull from any subscription reentrantly.
This effectively prevents any batch readahead from happening and the file is
always read one batch at a time. Part of the problem seems to be that
ReadOneRowGroup in reader.cc returns a RecordBatchGenerator when it seems it
should be able to return a RecordBatch. For the testing I am doing I changed
this to return a single record batch which allowed me to get rid of the
concatenated generator and batch readahead appeared to work properly but I
didn't fully confirm the correctness of this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)