pitrou commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1824694277
Something weird is that most columns out of this file have a single chunk,
even though the file has 21 row groups. This doesn't look right:
```python
>>> [(name, a.num_chunks) for name, a in zip(tab.column_names, tab.columns)]
[('l_orderkey', 1),
('l_partkey', 1),
('l_suppkey', 1),
('l_linenumber', 1),
('l_quantity', 1),
('l_extendedprice', 1),
('l_discount', 1),
('l_tax', 1),
('l_returnflag', 21),
('l_linestatus', 21),
('l_shipdate', 1),
('l_commitdate', 1),
('l_receiptdate', 1),
('l_shipinstruct', 21),
('l_shipmode', 21),
('l_comment', 1)]
>>> pf =
pq.ParquetFile('~/arrow/data/lineitem/lineitem_0002072d-7283-43ae-b645-b26640318053.parquet')
>>> pf.metadata
<pyarrow._parquet.FileMetaData object at 0x7f236076dcb0>
created_by: DuckDB
num_columns: 16
num_rows: 2568534
num_row_groups: 21
format_version: 1.0
serialized_size: 29792
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]