jorisvandenbossche commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-754672429
Trying this out locally, I see the following strange behaviour:
```
In [90]: size = 300
In [91]: table = pa.table({'str': [str(x) for x in range(size)]})
In [92]: pq.write_table(table, "test.parquet", row_group_size=100)
In [93]: f = pq.ParquetFile("test.parquet")
In [94]: [b.num_rows for b in f.iter_batches(batch_size=80)]
Out[94]: [80, 80, 80, 60]
In [95]: table = pa.table({'str': pd.Categorical([str(x) for x in
range(size)])})
In [96]: pq.write_table(table, "test.parquet", row_group_size=100)
In [97]: f = pq.ParquetFile("test.parquet")
In [98]: [b.num_rows for b in f.iter_batches(batch_size=80)]
Out[98]: [80, 20, 60, 40, 40, 60]
```
So by having a dictionary type, the size of the batches is rather strange.
Now, it was already discussed above that with categorical data, it was behaving
differently (and that's also the reason the tests don't include categorical for
now). So since you are mostly exposing the existing feature in python, and not
actually implementing it, I suppose this is an existing bug in the batch_size
handling of the parquet C++ record batch reader, and it can be seen as a bug to
report/fix separately?
> I can confirm that once I merge in the latest changes from apache master,
I am getting the batches to be the expected size (and spanning across the
parquet chunks). I have updated the test_iter_batches_columns_reader unit test
accordingly.
That's again something tied to the C++ implementation, but I would actually
not expect the baches to cross different row groups (so for example for a row
group size of 1000, total size of 2000 and batch size of 300, I would expect
batches of [300, 300, 300, 100, 300, 300, 300, 100] instead of [300, 300, 300,
300, 300, 300, 200].
But so maybe something to open a JIRA for to follow-up on.
> Looking at the code, no longer think this `batch_size` parameter actually
would affect those other read methods.
>
> There are a few different "batch_size" parameters floating around the
`reader.cc`, but there's only one reference to the one in `reader_properties_`
(`ArrowReaderProperties`):
> ...
> As far as I can tell, that's exclusively on the code path for the
RecordBatchReader, and not the other readers. So I don't think we need to add
the parameter to those other methods.
That indeed seems to be the case
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]