Re: [I] Parquet deserialization speeds slower on Linux [arrow]

via GitHub Thu, 23 Nov 2023 08:27:48 -0800


pitrou commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1824694277


   Something weird is that most columns out of this file have a single chunk, 
even though the file has 21 row groups. This doesn't look right:
   ```python
   >>> [(name, a.num_chunks) for name, a in zip(tab.column_names, tab.columns)]
   [('l_orderkey', 1),
    ('l_partkey', 1),
    ('l_suppkey', 1),
    ('l_linenumber', 1),
    ('l_quantity', 1),
    ('l_extendedprice', 1),
    ('l_discount', 1),
    ('l_tax', 1),
    ('l_returnflag', 21),
    ('l_linestatus', 21),
    ('l_shipdate', 1),
    ('l_commitdate', 1),
    ('l_receiptdate', 1),
    ('l_shipinstruct', 21),
    ('l_shipmode', 21),
    ('l_comment', 1)]
   
   >>> pf = 
pq.ParquetFile('~/arrow/data/lineitem/lineitem_0002072d-7283-43ae-b645-b26640318053.parquet')
   >>> pf.metadata
   <pyarrow._parquet.FileMetaData object at 0x7f236076dcb0>
     created_by: DuckDB
     num_columns: 16
     num_rows: 2568534
     num_row_groups: 21
     format_version: 1.0
     serialized_size: 29792
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parquet deserialization speeds slower on Linux [arrow]

Reply via email to