acking-you commented on issue #11212: URL: https://github.com/apache/datafusion/issues/11212#issuecomment-2753584617
@alamb I sincerely apologize for not revisiting this issue or pushing forward with that [PR](https://github.com/apache/datafusion/pull/11247) for such a long time. However, this optimization has led to significant performance improvements in one of our internal use cases—nearly 100 times faster. Recently, while seeing the community working on runtime filters, this issue came to mind again. I’m truly grateful for the community’s friendliness and active engagement. ## Application Scenarios ### Scenario 1 I attempted to reproduce our usage scenario using the hits dataset from clickbench hits: ```sql SELECT count(*) FROM hits WHERE -- Core filtering criteria: In most cases, the data can be filtered down to 0 records. "UserID" > 7350907328996404000 AND "URL" = 'http://smeshariki.ru/recipes/search/cuZ29vZ2xlLzcwODthZHdvcmRz&page/make=43;1334313116' -- Other filters AND "EventTime" BETWEEN EXTRACT(epoch FROM '2014-03-23 00:00:00'::timestamp)::BIGINT AND EXTRACT(epoch FROM '2014-04-22 23:59:59'::timestamp)::BIGINT AND (("Age" BETWEEN 18 AND 35 AND "Sex" = 'female') OR ("Income" > 50000 AND ("Interests" & 128) = 128)) AND ( ("IsMobile" = 1 AND "MobilePhoneModel" LIKE 'iPhone%') OR ("IsMobile" = 0 AND "ResolutionWidth" >= 1920) ) AND "IsDownload" = 0 AND "IsNotBounce" = 1 AND "DontCountHits" = 0 AND split_part(split_part("URL", 'make=', 2), ';', 1) = '43' AND "UTMSource" = 'google_ads' AND "UTMCampaign" = 'spring_promo' AND "HTTPError" = CAST(200 AS SMALLINT) AND "JavascriptEnable" = 1 AND "CookieEnable" = 1 AND EXTRACT(hour FROM to_timestamp("EventTime")) BETWEEN 14 AND 18 AND ( regexp_match("UserAgent", 'Chrome/(9[0-9]|10[0-2])') IS NOT NULL OR regexp_match("UserAgent", 'Safari/[6-8]\.') IS NOT NULL ) AND "SocialSourceNetworkID" IN (CAST(5 AS SMALLINT),CAST(12 AS SMALLINT)) AND "SocialAction" = 'share' AND ( ("ClientTimeZone" BETWEEN -5 AND 5 AND "BrowserCountry" <> 'RU') OR ("ClientTimeZone" NOT BETWEEN -5 AND 5 AND "BrowserCountry" = 'RU') ) AND ( strpos("OriginalURL", 'utm_id=') > 0 OR "OpenstatCampaignID" IS NOT NULL ) AND "Robotness" < CAST(0.3 AS FLOAT) AND "IsArtifical" = 0 AND "IsEvent" = 1 AND "IsParameter" = 0; ``` The performance difference of this SQL compared to the main branch without short-circuit optimization is as follows, achieving a nearly 500X improvement: ``` Benchmark clickbench_partitioned.json -------------------- ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Query ┃ main ┃ add_short_circuit ┃ Change ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ QQuery 0 │ 5024.39ms │ 10.16ms │ +494.40x faster │ └──────────────┴───────────┴───────────────────┴─────────────────┘ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ Total Time (main) │ 5024.39ms │ │ Total Time (add_short_circuit) │ 10.16ms │ │ Average Time (main) │ 5024.39ms │ │ Average Time (add_short_circuit) │ 10.16ms │ │ Queries Faster │ 1 │ │ Queries Slower │ 0 │ │ Queries with No Change │ 0 │ └──────────────────────────────────┴───────────┘ ``` ### Scenario 2 I also thought of a scenario where `where in (subquery)` could be used: ```sql SELECT count(*) FROM hits WHERE "UserID" > 7350909328996404000 and "URL" in (select "URL" from hits where "URL" like '%tt%'); ``` 但这个场景并不会对性能产生优化,因为该语句实际上是一个 join: ``` LogicalPlan: Projection: count(Int64(1)) AS count(*) └─ Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] └─ Projection └─ LeftSemi Join: hits.URL = __correlated_sq_1.URL ├─ Projection: hits.URL │ └─ Filter: hits.UserID > Int64(9050909328996404000) │ └─ TableScan: hits │ (projection=[UserID, URL], filter=UserID > 9e18) └─ SubqueryAlias: __correlated_sq_1 └─ Filter: hits.URL LIKE "%tt%" └─ TableScan: hits (projection=[URL], filter=URL contains "tt") PhysicalPlan: ProjectionExec: count(*) └─ AggregateExec (Final) └─ CoalescePartitionsExec └─ AggregateExec (Partial) └─ ProjectionExec └─ CoalesceBatchesExec └─ HashJoinExec (LeftSemi) ├─ RepartitionExec (Hash) │ └─ CoalesceBatchesExec │ └─ FilterExec (UserID > 9e18) │ └─ DataSourceExec │ (parquet/hits, 12 partitions) │ [projection: UserID, URL] │ [pruning: UserID_max > 9e18] └─ RepartitionExec (Hash) └─ CoalesceBatchesExec └─ FilterExec (URL LIKE "%tt%") └─ DataSourceExec (parquet/hits, 12 partitions) [projection: URL] [predicate: URL contains "tt"] ``` This SQL query will perform extremely slowly with or without this optimization. I believe this scenario should benefit from the optimization of Dynamic Join Predicates, as mentioned by @alamb in this [issue](https://github.com/apache/datafusion/issues/7955), referencing the [duckdb blog](https://duckdb.org/2024/09/09/announcing-duckdb-110.html#dynamic-filter-pushdown-from-joins). ## Why TPCH or ClickBench Did Not Improve ### TPCH I went through each of the test SQL statements in TPCH and found that most of them focus on joins rather than point queries. Additionally, the computational load of the predicate conditions does not yield significant benefits, as illustrated by the following SQL: ```sql select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate >= date '1994-01-01' and l_shipdate < date '1995-01-01' and l_discount between 0.06 - 0.01 and 0.06 + 0.01 and l_quantity < 24; ``` Each predicate is based on simple comparisons and additions of numbers, which does not yield any profit. ### ClickBench There are not many SQL with more than one predicate in ClickBench, including Q21, Q22, Q36, Q37, Q38, Q39, Q40, Q41, and Q42: ```sql SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10; SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10; SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "URL" <> '' GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10; SELECT "Title", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "Title" <> '' GROUP BY "Title" ORDER BY PageViews DESC LIMIT 10; SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "IsLink" <> 0 AND "IsDownload" = 0 GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10 OFFSET 1000; SELECT "TraficSourceID", "SearchEngineID", "AdvEngineID", CASE WHEN ("SearchEngineID" = 0 AND "AdvEngineID" = 0) THEN "Referer" ELSE '' END AS Src, "URL" AS Dst, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 GROUP BY "TraficSourceID", "SearchEngineID", "AdvEngineID", Src, Dst ORDER BY PageViews DESC LIMIT 10 OFFSET 1000; SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "TraficSourceID" IN (-1, 6) AND "RefererHash" = 3594120000172545465 GROUP BY "URLHash", "EventDate"::INT::DATE ORDER BY PageViews DESC LIMIT 10 OFFSET 100; SELECT "WindowClientWidth", "WindowClientHeight", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "DontCountHits" = 0 AND "URLHash" = 2868770270353813622 GROUP BY "WindowClientWidth", "WindowClientHeight" ORDER BY PageViews DESC LIMIT 10 OFFSET 10000; SELECT DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) AS M, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-14' AND "EventDate"::INT::DATE <= '2013-07-15' AND "IsRefresh" = 0 AND "DontCountHits" = 0 GROUP BY DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) ORDER BY DATE_TRUNC('minute', M) LIMIT 10 OFFSET 1000; ``` From the above SQLs, we can observe that: 1. Most of them are simple predicate conditions. 2. Some of the slightly more complex predicate conditions are followed by `limit 10`, and the computational cost of these predicate conditions is still relatively low (especially when compared to many string-matching scenarios). ## Summary Scenarios that benefit from short-circuit optimization: 1. The initial filter can filter out most of the data. 2. Subsequent filters involve time-consuming computations (e.g., operations that mostly involve matching strings longer than 1k characters, or predicates with a large number of string manipulations). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org