westonpace commented on pull request #11911:
URL: https://github.com/apache/arrow/pull/11911#issuecomment-996972468
> I think there might be a more direct way to count the number of row groups
created by inspecting the parquet files, rather than inferring based on the
batches that dataset to_batches() returns
For a parquet file you can do:
```
# Either works
pq.ParquetFile('/tmp/foo.parquet').metadata.num_row_groups
pq.read_metadata('/tmp/foo.parquet').num_row_groups
```
For an IPC file you can do:
```
with ipc.RecordBatchFileReader('/tmp/foo.arrow') as reader:
num_record_batches = reader.num_record_batches
```
For testing purposes though I would almost rather just stick with reading in
a table as it's universal across the formats. The performance difference at
this scale should be trivial. Also, this test is checking the # of rows in
each batch in addition to the # of batches (although one could argue that the
feature can be tested solely by the # of batches).
There actually is no way to get the size of the batches in IPC without
reading them in (this has some implications for scanning and someday I'd like
to do some experiments on whether or not a change to the IPC format might help
us here). For parquet that `metadata` object is rich enough you can get the
size of each row group (`metadata.row_group(0).num_rows` for example)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]