Re: [PR] Enable datafusion.optimizer.filter_null_join_keys by default [datafusion]

via GitHub Mon, 09 Sep 2024 07:19:04 -0700


Dandandan commented on code in PR #12369:
URL: https://github.com/apache/datafusion/pull/12369#discussion_r1750347562



##########
datafusion/sqllogictest/test_files/group_by.slt:
##########
@@ -2868,18 +2879,24 @@ logical_plan
 04)------Projection: s.zip_code, s.country, s.sn, s.ts, s.currency, e.sn, 
e.amount
 05)--------Inner Join: s.currency = e.currency Filter: s.ts >= e.ts
 06)----------SubqueryAlias: s
-07)------------TableScan: sales_global projection=[zip_code, country, sn, ts, 
currency]
-08)----------SubqueryAlias: e
-09)------------TableScan: sales_global projection=[sn, ts, currency, amount]
+07)------------Filter: sales_global.currency IS NOT NULL

Review Comment:
   > But in order to skip hashing nulls, the input array would have to be 
"filtered" (aka copy the matching rows)
   
   Correct, but you save some copying in `RepartitionExec` / build side 
concatenate as well, and copying / checking columns of keys in probe side.
   In case there aren't any nulls (even if column is nullable), there is no 
copying happening.
   
   Even with CSV / MemTable in many cases null filter can be combined with 
existing filter expressions, so no extra copying is happening (less copying in 
fact as fewer rows need to be copied).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Enable datafusion.optimizer.filter_null_join_keys by default [datafusion]

Reply via email to