alamb commented on issue #6906:
URL: https://github.com/apache/datafusion/issues/6906#issuecomment-2355857689

   In terms of implementation, what I suggest is:
   
   1. Do a POC implementaiton: wire up just enough `StringView, don't worry 
about GC, basic unit tests
   2. Verify it makes the clickbench query faster
   3. Flesh out testing, documentation, add support for StringArrary, etc
   4. Merge and profit (bonus points for blogging about it)
   
   For the POC here is the reproducer I recommend:
   ### Step 1. Get `hits_partitioned` using `bench.sh`:
   ```shell
   cd benchmarks
   ./bench.sh data clickbench_partitioned
   ```
   
   ### Step 2: Prepare a script with reproducer query:
   ```sql
   set datafusion.execution.parquet.schema_force_view_types = true;
   
   
   SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') 
AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer")
   FROM hits_partitioned
   WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC 
LIMIT 25;
   ```
   
   
   
   ```shell
   andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2/benchmarks/data$ cat 
q28.sql
   set datafusion.execution.parquet.schema_force_view_types = true;
   
   SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') 
AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer")
   FROM hits_partitioned
   WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC 
LIMIT 25;
   ```
   
   ### Step 3: Run script (with release build of `datafusion-cli`):
   ```shell
   datafusion-cli -f q28.sql
   ```
   
   * `set datafusion.execution.parquet.schema_force_view_types = true;` -->  
`Elapsed 18.431 seconds.`
   * `set datafusion.execution.parquet.schema_force_view_types = false;` -->  
`Elapsed 6.427 seconds.`
   
   The goal is to get `set datafusion.execution.parquet.schema_force_view_types 
= true;`  to be the same (or better) than when it is false
   
   If you look at the 
[flamegraph-string-view.svg](https://github.com/user-attachments/assets/e89c5566-c635-467b-9b28-24b7d2d91bfa),
 you can see most of the time is spent doing GroupsAccumulator 
   
   ![Screenshot 2024-09-16 at 4 44 50 
PM](https://github.com/user-attachments/assets/89b736a3-d761-42f3-b6dc-6982c445628a)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to