gopidesupavan commented on PR #62232: URL: https://github.com/apache/airflow/pull/62232#issuecomment-3939582305
Benchmarks: Though we already have datafusion benchamarks here https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/ with tpch. I have tried using trips yello_taxi data with 44.5M records. and run these agregation queries on the data that stored in s3. it took 1 minute 47s ``` """SELECT COUNT(*) AS total_trips FROM yellow_taxi;""", """SELECT SUM(total_amount) AS total_revenue FROM yellow_taxi;""", """SELECT AVG(trip_distance) AS avg_distance_miles, AVG(fare_amount) AS avg_fare FROM yellow_taxi;""", """SELECT MIN(tip_amount) AS min_tip, MAX(tip_amount) AS max_tip, AVG(tip_amount) AS avg_tip FROM yellow_taxi WHERE payment_type = 1;""", """SELECT DATE_TRUNC('day', tpep_pickup_datetime) AS pickup_day, COUNT(*) AS trips, AVG(passenger_count) AS avg_passengers FROM yellow_taxi GROUP BY pickup_day ORDER BY pickup_day DESC;""", """SELECT DATE_TRUNC('month', tpep_pickup_datetime) AS pickup_month, SUM(fare_amount) AS total_fares, SUM(tolls_amount) AS total_tolls FROM yellow_taxi GROUP BY pickup_month ORDER BY pickup_month DESC;""" ``` <img width="1164" height="827" alt="image" src="https://github.com/user-attachments/assets/ed39ffbf-26a7-4a41-8efa-bc50c59b5dcc" /> With HITL approval operator the results can be viewed before the next task: <img width="1055" height="810" alt="image" src="https://github.com/user-attachments/assets/9b260e2d-8219-409b-a631-7abf3eae040f" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
