alamb commented on PR #10026: URL: https://github.com/apache/datafusion/pull/10026#issuecomment-2075788643
Sorry for the delay, I spent time trying to reproduce performance difference in this PR Reproducer 1. Download: https://datasets.clickhouse.com/hits_compatible/hits.parquet 2. Use q28.sql ```shell cat q28.sql SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM 'hits.parquet' WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25; ``` # Tests Build datafusion-cli with ```shell cd datafusion-cli cargo build --release cp target/release/datafusion-cli ~/Downloads/datafusion-cli-distributor ``` Also build it with merge-base ``` git checkout `git merge-base HEAD apache/main` cd datafusion-cli cargo build --release cp target/release/datafusion-cli ~/Downloads/datafusion-cli-merge-base ``` ```shell for i in `seq 1 5`; do ./datafusion-cli-distributor -f q28.sql | grep Elapsed ; done Elapsed 6.576 seconds. Elapsed 6.591 seconds. Elapsed 6.499 seconds. Elapsed 6.599 seconds. Elapsed 6.515 seconds. ``` ```shell for i in `seq 1 5`; do ./datafusion-cli-merge-base -f q28.sql | grep Elapsed ; done Elapsed 6.756 seconds. Elapsed 6.761 seconds. Elapsed 6.633 seconds. Elapsed 6.565 seconds. Elapsed 6.565 seconds. ``` If anything this looks better for the distributor branch. I'll see if I can reproduce the results on the gcp machine -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
