Joris Van den Bossche created ARROW-11142:
---------------------------------------------

             Summary: [C++][Parquet] Inconsistent batch_size usage in parquet 
GetRecordBatchReader
                 Key: ARROW-11142
                 URL: https://issues.apache.org/jira/browse/ARROW-11142
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche


The RecordBatchReader returned from 
{{parquet::arrow::FileReader::GetRecordBatchReader}}, which was originally 
introduced in ARROW-1012 and now exposed in Python (ARROW-7800), shows some 
inconsistent behaviour in how the {{batch_size}} is followed across parquet 
file row groups. 

See also comments at 
https://github.com/apache/arrow/pull/6979#issuecomment-754672429

Small example with a parquet file of 300 rows consisting of 3 row groups of 100 
rows:

{code}
table = pa.table({'a': range(300)})
pq.write_table(table, "test.parquet", row_group_size=100)
f = pq.ParquetFile("test.parquet")
{code}

When reading this with a batch_size that doesn't align with the size of the row 
groups, by default batches that cross the row group boundaries are returned:

{code}
In [5]: [batch.num_rows for batch in f.iter_batches(batch_size=80)]
Out[5]: [80, 80, 80, 60]
{code}

However, when the file contains a dictionary typed column with string values 
(integer dictionary values doesn't trigger it), the batches follow row group 
boundaries:

{code}
table = pa.table({'a': pd.Categorical([str(x) for x in range(300)])})
pq.write_table(table, "test.parquet", row_group_size=100)
f = pq.ParquetFile("test.parquet")

In [13]: [batch.num_rows for batch in f.iter_batches(batch_size=80)]
Out[13]: [80, 20, 60, 40, 40, 60]
{code}

But it doesn't start to count again for batch_size at the beginning of a row 
group, so it only splits batches.

And additionally, when reading empty batches (empty column selection), then the 
row group boundaries are followed, but differently (the batching is done 
independently for each row group):

{code}
In [14]: [batch.num_rows for batch in f.iter_batches(batch_size=80, columns=[])]
Out[14]: [80, 20, 80, 20, 80, 20]
{code}

(this is explicitly coded here: 
https://github.com/apache/arrow/blob/e05f032c1e5d590ac56372d13ec637bd28b47a96/cpp/src/parquet/arrow/reader.cc#L899-L921)

---

I don't know what the expected behaviour should be, but I would at least expect 
it to be consistent?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to