[GitHub] [arrow] jorisvandenbossche commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

GitBox Tue, 05 Jan 2021 06:35:01 -0800


jorisvandenbossche commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-754672429



   Trying this out locally, I see the following strange behaviour:
   
   ```
   In [90]: size = 300
   
   In [91]: table = pa.table({'str': [str(x) for x in range(size)]})
   
   In [92]: pq.write_table(table, "test.parquet", row_group_size=100)
   
   In [93]: f = pq.ParquetFile("test.parquet")
   
   In [94]: [b.num_rows for b in f.iter_batches(batch_size=80)]
   Out[94]: [80, 80, 80, 60]
   
   In [95]: table = pa.table({'str': pd.Categorical([str(x) for x in 
range(size)])})
   
   In [96]: pq.write_table(table, "test.parquet", row_group_size=100)
   
   In [97]: f = pq.ParquetFile("test.parquet")
   
   In [98]: [b.num_rows for b in f.iter_batches(batch_size=80)]
   Out[98]: [80, 20, 60, 40, 40, 60]
   ```
   
   So by having a dictionary type, the size of the batches is rather strange. 
Now, it was already discussed above that with categorical data, it was behaving 
differently (and that's also the reason the tests don't include categorical for 
now). So since you are mostly exposing the existing feature in python, and not 
actually implementing it, I suppose this is an existing bug in the batch_size 
handling of the parquet C++ record batch reader, and it can be seen as a bug to 
report/fix separately?
   
   > I can confirm that once I merge in the latest changes from apache master, 
I am getting the batches to be the expected size (and spanning across the 
parquet chunks). I have updated the test_iter_batches_columns_reader unit test 
accordingly.
   
   That's again something tied to the C++ implementation, but I would actually 
not expect the baches to cross different row groups (so for example for a row 
group size of 1000, total size of 2000 and batch size of 300, I would expect 
batches of [300, 300, 300, 100, 300, 300, 300, 100] instead of [300, 300, 300, 
300, 300, 300, 200]. 
   But so maybe something to open a JIRA for to follow-up on.
   
   > Looking at the code, no longer think this `batch_size` parameter actually 
would affect those other read methods.
   > 
   > There are a few different "batch_size" parameters floating around the 
`reader.cc`, but there's only one reference to the one in `reader_properties_` 
(`ArrowReaderProperties`):
   > ...
   > As far as I can tell, that's exclusively on the code path for the 
RecordBatchReader, and not the other readers. So I don't think we need to add 
the parameter to those other methods.
   
   That indeed seems to be the case


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

Reply via email to