Dandandan commented on code in PR #19932:
URL: https://github.com/apache/datafusion/pull/19932#discussion_r2759396181


##########
docs/source/user-guide/configs.md:
##########
@@ -161,6 +161,8 @@ The following configuration settings are available:
 | datafusion.optimizer.hash_join_single_partition_threshold_rows          | 
131072                    | The maximum estimated size in rows for one input 
side of a HashJoin will be collected into a single partition                    
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                            
                                                                                
                                                                                
                                                                                
                                                                 |
 | datafusion.optimizer.hash_join_inlist_pushdown_max_size                 | 
131072                    | Maximum size in bytes for the build side of a hash 
join to be pushed down as an InList expression for dynamic filtering. Build 
sides larger than this will use hash table lookups instead. Set to 0 to always 
use hash table lookups. InList pushdown can be more efficient for small build 
sides because it can result in better statistics pruning as well as use any 
bloom filters present on the scan side. InList expressions are also more 
transparent and easier to serialize over the network in distributed uses of 
DataFusion. On the other hand InList pushdown requires making a copy of the 
data and thus adds some overhead to the build side and uses more memory. This 
setting is per-partition, so we may end up using 
`hash_join_inlist_pushdown_max_size` \* `target_partitions` memory. The default 
is 128kB per partition. This should allow point lookup joins (e.g. joining on a 
unique primary key) t
 o use InList pushdown in most cases but avoids excessive memory usage or 
overhead for larger joins.                                                      
                                                                                
                                                                       |
 | datafusion.optimizer.hash_join_inlist_pushdown_max_distinct_values      | 
150                       | Maximum number of distinct values (rows) in the 
build side of a hash join to be pushed down as an InList expression for dynamic 
filtering. Build sides with more rows than this will use hash table lookups 
instead. Set to 0 to always use hash table lookups. This provides an additional 
limit beyond `hash_join_inlist_pushdown_max_size` to prevent very large IN 
lists that might not provide much benefit over hash table lookups. This uses 
the deduplicated row count once the build side has been evaluated. The default 
is 150 values per partition. This is inspired by Trino's 
`max-filter-keys-per-column` setting. See: 
<https://trino.io/docs/current/admin/dynamic-filtering.html#dynamic-filter-collection-thresholds>
                                                                                
                                                                                
                     
                                                                                
                                                                                
                                                                                
                                                                 |
+| datafusion.optimizer.hash_join_map_pushdown                             | 
true                      | When true, pushes down hash table references for 
membership checks in hash joins when the build side is too large for InList 
pushdown. When false, no membership filter is created when InList thresholds 
are exceeded. Default: true                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                   
                                                                                
                                                                                
                                                                                
                                                                 |
+| datafusion.optimizer.hash_join_bounds_pushdown                          | 
true                      | When true, pushes down min/max bounds for join key 
columns. This enables statistics-based pruning (e.g., Parquet row group 
skipping). When false, only membership filters (InList or Map) are pushed down. 
Default: true                                                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                  
                                                                                
                                                                                
                                                                                
                                                                 |

Review Comment:
   Yes - I think as long as it's not evaluating against each row it shouldn't 
be very expensive (or we can optimize it easily).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to