acking-you commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2824790924
> I tried the rewrite into a Semi join and indeed it is over 2x slower (5.3sec vs 12sec) > > > SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10; > Elapsed 5.320 seconds. > Here is what I think the rewrite is > > > SELECT * from 'hits_partitioned' WHERE "WatchID" IN ( > SELECT "WatchID" FROM 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10 > ); > > Elapsed 12.023 seconds. > WatchID is a unique key > > > select count(distinct "WatchID"), count(*) from 'hits_partitioned'; > +------------------------------------------+----------+ > | count(DISTINCT hits_partitioned.WatchID) | count(*) | > +------------------------------------------+----------+ > | 99997493 | 99997497 | > +------------------------------------------+----------+ > I also double checked the output > > ## orig > datafusion-cli -c "SELECT * FROM 'hits_partitioned' WHERE \"URL\" LIKE '%google%' ORDER BY \"EventTime\" LIMIT 10;" > orig.out > > ## rewrite > datafusion-cli -c "SELECT * from 'hits_partitioned' WHERE \"WatchID\" IN (SELECT \"WatchID\" FROM 'hits_partitioned' WHERE \"URL\" LIKE '%google%' ORDER BY \"EventTime\" LIMIT 10);" > rewrite.out > > ## check > sort orig.out > orig.out.sort > sort rewrite.out > rewrite.out.sort > diff orig.out.sort rewrite.out.sort > > 7c7 > < Elapsed 5.649 seconds. > --- > > Elapsed 11.067 seconds. I recently took a detailed look at this optimization in ClickHouse, and it might offer you some insights @alamb . ## rewrite SQL in ClickHouse First, in ClickHouse, each row of data can be located using the two virtual columns `_part` and `_part_offset` (this applies only to MergeTree tables). Specifically, for the optimization you mentioned, you can compare the performance of the following two queries in ClickHouse: ```sql -- Q1: SELECT * from hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10; 10 rows in set. Elapsed: 34.907 sec. Processed 18.63 million rows, 11.77 GB (533.61 thousand rows/s., 337.25 MB/s.) Peak memory usage: 1.31 GiB. -- Q2: SELECT * FROM hits WHERE (_part,_part_offset) in (SELECT _part,_part_offset from hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10); 10 rows in set. Elapsed: 0.334 sec. Processed 18.68 million rows, 3.13 GB (55.88 million rows/s., 9.37 GB/s.) Peak memory usage: 236.42 MiB. ``` I measured that Q1 took 34 seconds, while Q2 only took 7.2 seconds (both cleared the page cache before running). This performance improvement may be attributed to ClickHouse correctly utilizing the result of the subquery to filter parts and quickly pinpoint the relevant data. However, in earlier versions of ClickHouse (such as 23.12), the aforementioned query Q2 would actually degrade in performance. But at that time, if I split Q2 into two separate statements and executed them manually, it would still work perfectly fine. ```sql -- Initial (Clear page cache) -- step 1 SELECT _part,_part_offset FROM hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10; 10 rows in set. Elapsed: 6.254 sec. Processed 18.63 million rows, 3.10 GB (2.98 million rows/s., 495.76 MB/s.) Peak memory usage: 190.79 MiB. -- step 2 SELECT * FROM hits WHERE (_part = 'all_1_210_3' AND _part_offset IN (20223140, 20223142, 20223144, 19725555, 15188338, 13322137, 19741966, 3076201)) OR (_part = 'all_211_216_1' AND _part_offset IN (692957, 692958)); 10 rows in set. Elapsed: 0.731 sec. Processed 65.04 thousand rows, 36.83 MB (89.02 thousand rows/s., 50.41 MB/s.) Peak memory usage: 46.23 MiB. -- The total time is 6.98 seconds. ``` ## Conclusion - Accessing row_id might require some optimization techniques, such as merging row_id conditions to quickly skip through RowGroups - After implementing this optimization, the performance improvement in ClickBench may not be significant. I noticed that when columns are fully loaded into the page cache, the performance gain from lazy materialization is not very noticeable. Additionally, ClickBench only clears the page cache before executing the first SQL query -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org