Re: [I] `BinaryExpr` evaluate lacks optimization for `Or` and `And` scenarios [datafusion]

via GitHub Sat, 05 Apr 2025 10:09:59 -0700


acking-you commented on issue #11212:
URL: https://github.com/apache/datafusion/issues/11212#issuecomment-2753584617


   @alamb I sincerely apologize for not revisiting this issue or pushing 
forward with that [PR](https://github.com/apache/datafusion/pull/11247) for 
such a long time. However, this optimization has led to significant performance 
improvements in one of our internal use cases—nearly 100 times faster. 
Recently, while seeing the community working on runtime filters, this issue 
came to mind again. I’m truly grateful for the community’s friendliness and 
active engagement.
   
   ## Application Scenarios
   
   ### Scenario 1
   I attempted to reproduce our usage scenario using the hits dataset from 
clickbench hits:
   ```sql
   SELECT count(*)
   FROM hits
   WHERE 
       -- Core filtering criteria: In most cases, the data can be filtered down 
to 0 records.
       "UserID" > 7350907328996404000
       AND "URL" = 
'http://smeshariki.ru/recipes/search/cuZ29vZ2xlLzcwODthZHdvcmRz&page/make=43;1334313116'
       
       -- Other filters
       AND "EventTime" BETWEEN 
           EXTRACT(epoch FROM '2014-03-23 00:00:00'::timestamp)::BIGINT 
           AND 
           EXTRACT(epoch FROM '2014-04-22 23:59:59'::timestamp)::BIGINT
       AND (("Age" BETWEEN 18 AND 35 AND "Sex" = 'female') 
            OR ("Income" > 50000 AND ("Interests" & 128) = 128))
       AND (
           ("IsMobile" = 1 AND "MobilePhoneModel" LIKE 'iPhone%') 
           OR ("IsMobile" = 0 AND "ResolutionWidth" >= 1920)
       )
       AND "IsDownload" = 0
       AND "IsNotBounce" = 1
       AND "DontCountHits" = 0
       AND split_part(split_part("URL", 'make=', 2), ';', 1) = '43'
       AND "UTMSource" = 'google_ads'
       AND "UTMCampaign" = 'spring_promo'
       AND "HTTPError" = CAST(200 AS SMALLINT)
       AND "JavascriptEnable" = 1
       AND "CookieEnable" = 1
       AND EXTRACT(hour FROM to_timestamp("EventTime")) BETWEEN 14 AND 18
       AND (
           regexp_match("UserAgent", 'Chrome/(9[0-9]|10[0-2])') IS NOT NULL
           OR regexp_match("UserAgent", 'Safari/[6-8]\.') IS NOT NULL
       )
       AND "SocialSourceNetworkID" IN (CAST(5 AS SMALLINT),CAST(12 AS SMALLINT))
       AND "SocialAction" = 'share'
       AND (
           ("ClientTimeZone" BETWEEN -5 AND 5 AND "BrowserCountry" <> 'RU')
           OR ("ClientTimeZone" NOT BETWEEN -5 AND 5 AND "BrowserCountry" = 
'RU')
       )
       AND (
           strpos("OriginalURL", 'utm_id=') > 0 
           OR "OpenstatCampaignID" IS NOT NULL
       )
       AND "Robotness" < CAST(0.3 AS FLOAT)
       AND "IsArtifical" = 0
       AND "IsEvent" = 1
       AND "IsParameter" = 0;
   ```
   The performance difference of this SQL compared to the main branch without 
short-circuit optimization is as follows, achieving a nearly 500X improvement:
   ```
   Benchmark clickbench_partitioned.json
   --------------------
   ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
   ┃ Query        ┃      main ┃ add_short_circuit ┃          Change ┃
   ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
   │ QQuery 0     │ 5024.39ms │           10.16ms │ +494.40x faster │
   └──────────────┴───────────┴───────────────────┴─────────────────┘
   ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
   ┃ Benchmark Summary                ┃           ┃
   ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
   │ Total Time (main)                │ 5024.39ms │
   │ Total Time (add_short_circuit)   │   10.16ms │
   │ Average Time (main)              │ 5024.39ms │
   │ Average Time (add_short_circuit) │   10.16ms │
   │ Queries Faster                   │         1 │
   │ Queries Slower                   │         0 │
   │ Queries with No Change           │         0 │
   └──────────────────────────────────┴───────────┘
   ```
   ### Scenario 2  
   I also thought of a scenario where `where in (subquery)` could be used:
   ```sql
   SELECT count(*) FROM hits WHERE "UserID" > 7350909328996404000 and  "URL" in 
(select "URL" from hits where "URL" like '%tt%');
    ```
   但这个场景并不会对性能产生优化，因为该语句实际上是一个 join：
   ```
   LogicalPlan:
   Projection: count(Int64(1)) AS count(*)
   └─ Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
      └─ Projection
         └─ LeftSemi Join: hits.URL = __correlated_sq_1.URL
            ├─ Projection: hits.URL
            │  └─ Filter: hits.UserID > Int64(9050909328996404000)
            │     └─ TableScan: hits
            │        (projection=[UserID, URL], filter=UserID > 9e18)
            └─ SubqueryAlias: __correlated_sq_1
               └─ Filter: hits.URL LIKE "%tt%"
                  └─ TableScan: hits
                     (projection=[URL], filter=URL contains "tt")
   
   PhysicalPlan:
   ProjectionExec: count(*) 
   └─ AggregateExec (Final)
      └─ CoalescePartitionsExec
         └─ AggregateExec (Partial)
            └─ ProjectionExec
               └─ CoalesceBatchesExec
                  └─ HashJoinExec (LeftSemi)
                     ├─ RepartitionExec (Hash)
                     │  └─ CoalesceBatchesExec
                     │     └─ FilterExec (UserID > 9e18)
                     │        └─ DataSourceExec
                     │           (parquet/hits, 12 partitions)
                     │           [projection: UserID, URL]
                     │           [pruning: UserID_max > 9e18]
                     └─ RepartitionExec (Hash)
                        └─ CoalesceBatchesExec
                           └─ FilterExec (URL LIKE "%tt%")
                              └─ DataSourceExec
                                 (parquet/hits, 12 partitions)
                                 [projection: URL]
                                 [predicate: URL contains "tt"]
   ```
   
   This SQL query will perform extremely slowly with or without this 
optimization. I believe this scenario should benefit from the optimization of 
Dynamic Join Predicates, as mentioned by @alamb in this 
[issue](https://github.com/apache/datafusion/issues/7955), referencing the 
[duckdb 
blog](https://duckdb.org/2024/09/09/announcing-duckdb-110.html#dynamic-filter-pushdown-from-joins).
   
   
   
   ## Why TPCH or ClickBench Did Not Improve  
   ### TPCH  
   I went through each of the test SQL statements in TPCH and found that most 
of them focus on joins rather than point queries. Additionally, the 
computational load of the predicate conditions does not yield significant 
benefits, as illustrated by the following SQL:
   ```sql
   select
       sum(l_extendedprice * l_discount) as revenue
   from
       lineitem
   where
           l_shipdate >= date '1994-01-01'
     and l_shipdate < date '1995-01-01'
     and l_discount between 0.06 - 0.01 and 0.06 + 0.01
     and l_quantity < 24;
   ```
   Each predicate is based on simple comparisons and additions of numbers, 
which does not yield any profit.
   
   ### ClickBench
   There are not many SQL with more than one predicate in ClickBench, including 
Q21, Q22, Q36, Q37, Q38, Q39, Q40, Q41, and Q42:
   ```sql
   SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE 
'%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC 
LIMIT 10;
   
   SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, 
COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT 
LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c 
DESC LIMIT 10;
   
   SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND 
"EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= 
'2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "URL" <> '' GROUP 
BY "URL" ORDER BY PageViews DESC LIMIT 10;
   
   SELECT "Title", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND 
"EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= 
'2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "Title" <> '' 
GROUP BY "Title" ORDER BY PageViews DESC LIMIT 10;
   
   SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND 
"EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= 
'2013-07-31' AND "IsRefresh" = 0 AND "IsLink" <> 0 AND "IsDownload" = 0 GROUP 
BY "URL" ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
   
   SELECT "TraficSourceID", "SearchEngineID", "AdvEngineID", CASE WHEN 
("SearchEngineID" = 0 AND "AdvEngineID" = 0) THEN "Referer" ELSE '' END AS Src, 
"URL" AS Dst, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND 
"EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= 
'2013-07-31' AND "IsRefresh" = 0 GROUP BY "TraficSourceID", "SearchEngineID", 
"AdvEngineID", Src, Dst ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
   SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM hits 
WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND 
"EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "TraficSourceID" 
IN (-1, 6) AND "RefererHash" = 3594120000172545465 GROUP BY "URLHash", 
"EventDate"::INT::DATE ORDER BY PageViews DESC LIMIT 10 OFFSET 100;
   
   SELECT "WindowClientWidth", "WindowClientHeight", COUNT(*) AS PageViews FROM 
hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND 
"EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "DontCountHits" 
= 0 AND "URLHash" = 2868770270353813622 GROUP BY "WindowClientWidth", 
"WindowClientHeight" ORDER BY PageViews DESC LIMIT 10 OFFSET 10000;
   
   SELECT DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) AS M, 
COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND 
"EventDate"::INT::DATE >= '2013-07-14' AND "EventDate"::INT::DATE <= 
'2013-07-15' AND "IsRefresh" = 0 AND "DontCountHits" = 0 GROUP BY 
DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) ORDER BY 
DATE_TRUNC('minute', M) LIMIT 10 OFFSET 1000;
   ```
   From the above SQLs, we can observe that:  
   1. Most of them are simple predicate conditions.  
   2. Some of the slightly more complex predicate conditions are followed by 
`limit 10`, and the computational cost of these predicate conditions is still 
relatively low (especially when compared to many string-matching scenarios).
   
   
   ## Summary
   
   Scenarios that benefit from short-circuit optimization:
   1. The initial filter can filter out most of the data.
   2. Subsequent filters involve time-consuming computations (e.g., operations 
that mostly involve matching strings longer than 1k characters, or predicates 
with a large number of string manipulations).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] `BinaryExpr` evaluate lacks optimization for `Or` and `And` scenarios [datafusion]

Reply via email to