alamb commented on issue #6906: URL: https://github.com/apache/datafusion/issues/6906#issuecomment-2355857689
In terms of implementation, what I suggest is: 1. Do a POC implementaiton: wire up just enough `StringView, don't worry about GC, basic unit tests 2. Verify it makes the clickbench query faster 3. Flesh out testing, documentation, add support for StringArrary, etc 4. Merge and profit (bonus points for blogging about it) For the POC here is the reproducer I recommend: ### Step 1. Get `hits_partitioned` using `bench.sh`: ```shell cd benchmarks ./bench.sh data clickbench_partitioned ``` ### Step 2: Prepare a script with reproducer query: ```sql set datafusion.execution.parquet.schema_force_view_types = true; SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits_partitioned WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25; ``` ```shell andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2/benchmarks/data$ cat q28.sql set datafusion.execution.parquet.schema_force_view_types = true; SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits_partitioned WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25; ``` ### Step 3: Run script (with release build of `datafusion-cli`): ```shell datafusion-cli -f q28.sql ``` * `set datafusion.execution.parquet.schema_force_view_types = true;` --> `Elapsed 18.431 seconds.` * `set datafusion.execution.parquet.schema_force_view_types = false;` --> `Elapsed 6.427 seconds.` The goal is to get `set datafusion.execution.parquet.schema_force_view_types = true;` to be the same (or better) than when it is false If you look at the [flamegraph-string-view.svg](https://github.com/user-attachments/assets/e89c5566-c635-467b-9b28-24b7d2d91bfa), you can see most of the time is spent doing GroupsAccumulator  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org