adriangb commented on code in PR #18393:
URL: https://github.com/apache/datafusion/pull/18393#discussion_r2546595321


##########
datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs:
##########
@@ -333,81 +402,154 @@ impl SharedBuildAccumulator {
                 // CollectLeft: Simple conjunction of bounds and membership 
check
                 AccumulatedBuildData::CollectLeft { data } => {
                     if let Some(partition_data) = data {
+                        // Create membership predicate (InList for small build 
sides, hash lookup otherwise)
+                        let membership_expr = create_membership_predicate(
+                            &self.on_right,
+                            partition_data.pushdown.clone(),
+                            &HASH_JOIN_SEED,
+                            self.probe_schema.as_ref(),
+                        )?;
+
                         // Create bounds check expression (if bounds available)
-                        let Some(filter_expr) = create_bounds_predicate(
+                        let bounds_expr = create_bounds_predicate(
                             &self.on_right,
                             &partition_data.bounds,
-                        ) else {
-                            // No bounds available, nothing to update
-                            return Ok(());
+                        );
+
+                        // Combine membership and bounds expressions
+                        let filter_expr = match (membership_expr, bounds_expr) 
{
+                            (Some(membership), Some(bounds)) => {
+                                // Both available: combine with AND
+                                Arc::new(BinaryExpr::new(
+                                    bounds,
+                                    Operator::And,
+                                    membership,
+                                ))
+                                    as Arc<dyn PhysicalExpr>
+                            }
+                            (Some(membership), None) => membership,
+                            (None, Some(bounds)) => bounds,

Review Comment:
   Yes, I added a note explaining that it's easier to handle it defensively so 
might as well (as opposed to `unreacheable!`)



##########
datafusion/common/src/config.rs:
##########
@@ -1019,6 +1019,22 @@ config_namespace! {
         /// will be collected into a single partition
         pub hash_join_single_partition_threshold_rows: usize, default = 1024 * 
128
 
+        /// Maximum size in bytes for the build side of a hash join to be 
pushed down as an InList expression for dynamic filtering.
+        /// Build sides larger than this will use hash table lookups instead.
+        /// Set to 0 to always use hash table lookups.
+        ///
+        /// InList pushdown can be more efficient for small build sides 
because it can result in better
+        /// statistics pruning as well as use any bloom filters present on the 
scan side.
+        /// InList expressions are also more transparent and easier to 
serialize over the network in distributed uses of DataFusion.
+        /// On the other hand InList pushdown requires making a copy of the 
data and thus adds some overhead to the build side and uses more memory.
+        ///
+        /// This setting is per-partition, so we may end up using 
`hash_join_inlist_pushdown_max_size` * `target_partitions` memory.
+        ///
+        /// The default is 128kB per partition.
+        /// This should allow point lookup joins (e.g. joining on a unique 
primary key) to use InList pushdown in most cases
+        /// but avoids excessive memory usage or overhead for larger joins.
+        pub hash_join_inlist_pushdown_max_size: usize, default = 128 * 1024

Review Comment:
   Added a config



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to