[GitHub] [arrow-datafusion] devinjdangelo commented on pull request #7692: Update Default Parquet Write Compression

via GitHub Fri, 29 Sep 2023 05:54:19 -0700


devinjdangelo commented on PR #7692:
URL: 
https://github.com/apache/arrow-datafusion/pull/7692#issuecomment-1740851565


   I was also a bit suspicious of how identical the performance was. I caught a 
mistake in my set up, both numbers above were for ZSTD not uncompressed. I 
corrected the mistake and ran two more tests below. Indeed, using local storage 
uncompressed reads are a good bit faster. I would be interested to compare this 
to remote object storage where bandwidth may be more of a bottleneck..
   
   Removing the group by / order by:
   
   - Uncompressed: 0.5305
   - Zstd: 0.7015
   
   New Query:
   ```sql
       select
           sum(l_quantity) as sum_qty,
           sum(l_extendedprice) as sum_base_price,
           sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
           sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
           avg(l_quantity) as avg_qty,
           avg(l_extendedprice) as avg_price,
           avg(l_discount) as avg_disc,
           count(*) as count_order
       from
           test
       where
           l_shipdate <= date '1998-09-02'
   ```
   
   I also timed caching the entire parquet file into memory (select * from 
test), then df.cache(). I only did average over 5 runs this time.
   
   - Uncompressed: 6.818s
   - Zstd: 10.052


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] devinjdangelo commented on pull request #7692: Update Default Parquet Write Compression

Reply via email to