devinjdangelo commented on PR #7692: URL: https://github.com/apache/arrow-datafusion/pull/7692#issuecomment-1740814474
I ran a quick test using [Query1](https://github.com/apache/arrow-datafusion/blob/main/benchmarks/queries/q1.sql), reading the uncompressed parquet file vs. zstd compressed. This is using local SSD based storage. - Uncompressed: 1.8393 - Zstd: 1.8297 The performance is nearly identical. I averaged each over 50 runs for the above numbers and they are converging towards <1% performance difference. Variance run-to-run is ~5% so on a single run either one can be faster. Script: ```python import time from datafusion import SessionContext t = time.time() #uncompressed file, ~3.6Gb on disk #file = "/home/dev/arrow-datafusion/benchmarks/data/tpch_sf10/lineitem/part-0.parquet" #zstd compressed file, ~1.6Gb on disk file = "/home/dev/arrow-datafusion/test_out/benchon.parquet" # Create a DataFusion context ctx = SessionContext() # Register table with context ctx.register_parquet('test', file) times = [] for i in range(50): t = time.time() query = """ select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from test where l_shipdate <= date '1998-09-02' group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus """ # Execute SQL df = ctx.sql(f"{query}") df.show() elapsed = time.time() - t times.append(elapsed) print(f"datafusion agg query {elapsed}s") print(sum(times)/len(times)) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
