Joris Van den Bossche created ARROW-11142:
---------------------------------------------
Summary: [C++][Parquet] Inconsistent batch_size usage in parquet
GetRecordBatchReader
Key: ARROW-11142
URL: https://issues.apache.org/jira/browse/ARROW-11142
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Joris Van den Bossche
The RecordBatchReader returned from
{{parquet::arrow::FileReader::GetRecordBatchReader}}, which was originally
introduced in ARROW-1012 and now exposed in Python (ARROW-7800), shows some
inconsistent behaviour in how the {{batch_size}} is followed across parquet
file row groups.
See also comments at
https://github.com/apache/arrow/pull/6979#issuecomment-754672429
Small example with a parquet file of 300 rows consisting of 3 row groups of 100
rows:
{code}
table = pa.table({'a': range(300)})
pq.write_table(table, "test.parquet", row_group_size=100)
f = pq.ParquetFile("test.parquet")
{code}
When reading this with a batch_size that doesn't align with the size of the row
groups, by default batches that cross the row group boundaries are returned:
{code}
In [5]: [batch.num_rows for batch in f.iter_batches(batch_size=80)]
Out[5]: [80, 80, 80, 60]
{code}
However, when the file contains a dictionary typed column with string values
(integer dictionary values doesn't trigger it), the batches follow row group
boundaries:
{code}
table = pa.table({'a': pd.Categorical([str(x) for x in range(300)])})
pq.write_table(table, "test.parquet", row_group_size=100)
f = pq.ParquetFile("test.parquet")
In [13]: [batch.num_rows for batch in f.iter_batches(batch_size=80)]
Out[13]: [80, 20, 60, 40, 40, 60]
{code}
But it doesn't start to count again for batch_size at the beginning of a row
group, so it only splits batches.
And additionally, when reading empty batches (empty column selection), then the
row group boundaries are followed, but differently (the batching is done
independently for each row group):
{code}
In [14]: [batch.num_rows for batch in f.iter_batches(batch_size=80, columns=[])]
Out[14]: [80, 20, 80, 20, 80, 20]
{code}
(this is explicitly coded here:
https://github.com/apache/arrow/blob/e05f032c1e5d590ac56372d13ec637bd28b47a96/cpp/src/parquet/arrow/reader.cc#L899-L921)
---
I don't know what the expected behaviour should be, but I would at least expect
it to be consistent?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)