[GitHub] [arrow] westonpace commented on pull request #11911: ARROW-15019: [Python] Add bindings for new dataset writing options

GitBox Fri, 17 Dec 2021 11:12:33 -0800


westonpace commented on pull request #11911:
URL: https://github.com/apache/arrow/pull/11911#issuecomment-996972468



   > I think there might be a more direct way to count the number of row groups 
created by inspecting the parquet files, rather than inferring based on the 
batches that dataset to_batches() returns
   
   For a parquet file you can do:
   ```
   # Either works
   pq.ParquetFile('/tmp/foo.parquet').metadata.num_row_groups
   pq.read_metadata('/tmp/foo.parquet').num_row_groups
   ```
   For an IPC file you can do:
   ```
   with ipc.RecordBatchFileReader('/tmp/foo.arrow') as reader:
     num_record_batches = reader.num_record_batches
   ```
   
   For testing purposes though I would almost rather just stick with reading in 
a table as it's universal across the formats.  The performance difference at 
this scale should be trivial.  Also, this test is checking the # of rows in 
each batch in addition to the # of batches (although one could argue that the 
feature can be tested solely by the # of batches).
   
   There actually is no way to get the size of the batches in IPC without 
reading them in (this has some implications for scanning and someday I'd like 
to do some experiments on whether or not a change to the IPC format might help 
us here).  For parquet that `metadata` object is rich enough you can get the 
size of each row group (`metadata.row_group(0).num_rows` for example)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #11911: ARROW-15019: [Python] Add bindings for new dataset writing options

Reply via email to