alamb commented on PR #10026:
URL: https://github.com/apache/datafusion/pull/10026#issuecomment-2075788643

   Sorry for the delay, I spent time trying to reproduce performance difference 
in this PR
   
   Reproducer
   1. Download: https://datasets.clickhouse.com/hits_compatible/hits.parquet
   2. Use q28.sql
   
   ```shell
   cat q28.sql
   SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') 
AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer")
   FROM 'hits.parquet' WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 
100000 ORDER BY l DESC LIMIT 25;
   ```
   
   
   # Tests
   
   Build datafusion-cli with
   
   ```shell
   cd datafusion-cli
   cargo build --release
   cp target/release/datafusion-cli ~/Downloads/datafusion-cli-distributor
   ```
   
   Also build it with merge-base
   
   ```
   git checkout `git merge-base HEAD apache/main`
   cd datafusion-cli
   cargo build --release
   cp target/release/datafusion-cli ~/Downloads/datafusion-cli-merge-base
   ```
   
   ```shell
   for i in `seq 1 5`; do ./datafusion-cli-distributor -f q28.sql  | grep 
Elapsed ; done
   Elapsed 6.576 seconds.
   Elapsed 6.591 seconds.
   Elapsed 6.499 seconds.
   Elapsed 6.599 seconds.
   Elapsed 6.515 seconds.
   ```
   
   ```shell
    for i in `seq 1 5`; do ./datafusion-cli-merge-base  -f q28.sql  | grep 
Elapsed ; done
   Elapsed 6.756 seconds.
   Elapsed 6.761 seconds.
   Elapsed 6.633 seconds.
   Elapsed 6.565 seconds.
   Elapsed 6.565 seconds.
   ```
   
   If anything this looks better for the distributor branch.
   
   I'll see if I can reproduce the results on the gcp machine


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to