[GitHub] [arrow-datafusion] devinjdangelo commented on pull request #7692: Update Default Parquet Write Compression

via GitHub Fri, 29 Sep 2023 05:29:17 -0700


devinjdangelo commented on PR #7692:
URL: 
https://github.com/apache/arrow-datafusion/pull/7692#issuecomment-1740814474


   I ran a quick test using 
[Query1](https://github.com/apache/arrow-datafusion/blob/main/benchmarks/queries/q1.sql),
 reading the uncompressed parquet file vs. zstd compressed. This is using local 
SSD based storage. 
   
   - Uncompressed: 1.8393
   - Zstd: 1.8297
   
   The performance is nearly identical. I averaged each over 50 runs for the 
above numbers and they are converging towards <1% performance difference. 
Variance run-to-run is ~5% so on a single run either one can be faster.
   
   Script:
   ```python
   import time
   from datafusion import SessionContext
   
   t = time.time()
   
   #uncompressed file, ~3.6Gb on disk
   #file = 
"/home/dev/arrow-datafusion/benchmarks/data/tpch_sf10/lineitem/part-0.parquet"
   
   #zstd compressed file, ~1.6Gb on disk
   file = "/home/dev/arrow-datafusion/test_out/benchon.parquet"
   
   # Create a DataFusion context
   ctx = SessionContext()
   
   # Register table with context
   ctx.register_parquet('test', file)
   
   times = []
   for i in range(50):
     t = time.time()
       query = """
       select
           l_returnflag,
           l_linestatus,
           sum(l_quantity) as sum_qty,
           sum(l_extendedprice) as sum_base_price,
           sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
           sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
           avg(l_quantity) as avg_qty,
           avg(l_extendedprice) as avg_price,
           avg(l_discount) as avg_disc,
           count(*) as count_order
       from
           test
       where
               l_shipdate <= date '1998-09-02'
       group by
           l_returnflag,
           l_linestatus
       order by
           l_returnflag,
           l_linestatus
       """
   
       # Execute SQL
       df = ctx.sql(f"{query}")
       df.show()
       elapsed = time.time() - t
       times.append(elapsed)
       print(f"datafusion agg query {elapsed}s")
   
   print(sum(times)/len(times))
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] devinjdangelo commented on pull request #7692: Update Default Parquet Write Compression

Reply via email to