Re: [I] Make ClickBench Q23 Go Faster [datafusion]

via GitHub Wed, 23 Apr 2025 08:58:51 -0700


acking-you commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2824790924


   > I tried the rewrite into a Semi join and indeed it is over 2x slower 
(5.3sec vs 12sec)
   > 
   > > SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY 
"EventTime" LIMIT 10;
   > Elapsed 5.320 seconds.
   > Here is what I think the rewrite is
   > 
   > > SELECT * from 'hits_partitioned' WHERE "WatchID" IN (
   >   SELECT "WatchID" FROM 'hits_partitioned' WHERE "URL" LIKE '%google%' 
ORDER BY "EventTime" LIMIT 10
   > );
   > 
   > Elapsed 12.023 seconds.
   > WatchID is a unique key
   > 
   > > select count(distinct "WatchID"), count(*) from 'hits_partitioned';
   > +------------------------------------------+----------+
   > | count(DISTINCT hits_partitioned.WatchID) | count(*) |
   > +------------------------------------------+----------+
   > | 99997493                                 | 99997497 |
   > +------------------------------------------+----------+
   > I also double checked the output
   > 
   > ## orig
   > datafusion-cli -c "SELECT * FROM 'hits_partitioned' WHERE \"URL\" LIKE 
'%google%' ORDER BY \"EventTime\" LIMIT 10;" > orig.out
   > 
   > ## rewrite
   >  datafusion-cli -c "SELECT * from 'hits_partitioned' WHERE \"WatchID\" IN 
(SELECT \"WatchID\" FROM 'hits_partitioned' WHERE \"URL\" LIKE '%google%' ORDER 
BY \"EventTime\" LIMIT 10);" > rewrite.out
   > 
   > ## check
   > sort orig.out > orig.out.sort
   > sort rewrite.out > rewrite.out.sort
   > diff orig.out.sort rewrite.out.sort
   > 
   > 7c7
   > < Elapsed 5.649 seconds.
   > ---
   > > Elapsed 11.067 seconds.
   
   I recently took a detailed look at this optimization in ClickHouse, and it 
might offer you some insights @alamb .
   
   ## rewrite SQL in ClickHouse
   First, in ClickHouse, each row of data can be located using the two virtual 
columns `_part` and `_part_offset` (this applies only to MergeTree tables).  
   Specifically, for the optimization you mentioned, you can compare the 
performance of the following two queries in ClickHouse:
   ```sql
   -- Q1:
   SELECT * from hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10;
   10 rows in set. Elapsed: 34.907 sec. Processed 18.63 million rows, 11.77 GB 
(533.61 thousand rows/s., 337.25 MB/s.)
   Peak memory usage: 1.31 GiB.
   -- Q2:
   SELECT * FROM hits WHERE (_part,_part_offset) in 
   (SELECT _part,_part_offset from hits WHERE "URL" LIKE '%google%' ORDER BY 
"EventTime" LIMIT 10);
   10 rows in set. Elapsed: 0.334 sec. Processed 18.68 million rows, 3.13 GB 
(55.88 million rows/s., 9.37 GB/s.)
   Peak memory usage: 236.42 MiB.
   ```
   I measured that Q1 took 34 seconds, while Q2 only took 7.2 seconds (both 
cleared the page cache before running).  
   This performance improvement may be attributed to ClickHouse correctly 
utilizing the result of the subquery to filter parts and quickly pinpoint the 
relevant data.
   
   However, in earlier versions of ClickHouse (such as 23.12), the 
aforementioned query Q2 would actually degrade in performance. But at that 
time, if I split Q2 into two separate statements and executed them manually, it 
would still work perfectly fine.
   ```sql
   -- Initial (Clear page cache)
       -- step 1
   SELECT _part,_part_offset  FROM hits WHERE "URL" LIKE '%google%' ORDER BY 
"EventTime" LIMIT 10;
   10 rows in set. Elapsed: 6.254 sec. Processed 18.63 million rows, 3.10 GB 
(2.98 million rows/s., 495.76 MB/s.)
   Peak memory usage: 190.79 MiB.
       -- step 2
   SELECT * FROM hits WHERE
       (_part = 'all_1_210_3' AND _part_offset IN (20223140, 20223142, 
20223144, 19725555, 15188338, 13322137, 19741966, 3076201))
       OR
       (_part = 'all_211_216_1' AND _part_offset IN (692957, 692958));
   10 rows in set. Elapsed: 0.731 sec. Processed 65.04 thousand rows, 36.83 MB 
(89.02 thousand rows/s., 50.41 MB/s.)
   Peak memory usage: 46.23 MiB.
   
   -- The total time is 6.98 seconds.
   ```
   
   ## Conclusion
   - Accessing row_id might require some optimization techniques, such as 
merging row_id conditions to quickly skip through RowGroups
   - After implementing this optimization, the performance improvement in 
ClickBench may not be significant. I noticed that when columns are fully loaded 
into the page cache, the performance gain from lazy materialization is not very 
noticeable. Additionally, ClickBench only clears the page cache before 
executing the first SQL query
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

Reply via email to