Re: [PR] Optimize upsert performance for large datasets [iceberg-python]

via GitHub Fri, 20 Feb 2026 12:55:45 -0800


Fokko commented on code in PR #2943:
URL: https://github.com/apache/iceberg-python/pull/2943#discussion_r2835060643



##########
pyiceberg/table/upsert_util.py:
##########
@@ -23,16 +23,38 @@
 
 from pyiceberg.expressions import (
     AlwaysFalse,
+    AlwaysTrue,
+    And,
     BooleanExpression,
     EqualTo,
+    GreaterThanOrEqual,
     In,
+    LessThanOrEqual,
     Or,
 )
 
+# Threshold for switching from In() predicate to range-based or no filter.
+# When unique keys exceed this, the In() predicate becomes too expensive to 
process.
+LARGE_FILTER_THRESHOLD = 10_000
+
+# Minimum density (ratio of unique values to range size) for range filter to 
be effective.
+# Below this threshold, range filters read too much irrelevant data.
+DENSITY_THRESHOLD = 0.1

Review Comment:
   I think these thresholds are problematic. When you cross the threshold, 
you'll get different results in terms of runtime. Maybe we should add an option 
to rewrite in to range-based filtering. Keep in mind that in is problematic in 
Arrow: https://github.com/apache/arrow/issues/36283



##########
pyiceberg/table/upsert_util.py:
##########
@@ -23,16 +23,38 @@
 
 from pyiceberg.expressions import (
     AlwaysFalse,
+    AlwaysTrue,
+    And,
     BooleanExpression,
     EqualTo,
+    GreaterThanOrEqual,
     In,
+    LessThanOrEqual,
     Or,
 )
 
+# Threshold for switching from In() predicate to range-based or no filter.
+# When unique keys exceed this, the In() predicate becomes too expensive to 
process.
+LARGE_FILTER_THRESHOLD = 10_000
+
+# Minimum density (ratio of unique values to range size) for range filter to 
be effective.
+# Below this threshold, range filters read too much irrelevant data.
+DENSITY_THRESHOLD = 0.1
+
 
 def create_match_filter(df: pyarrow_table, join_cols: list[str]) -> 
BooleanExpression:
+    """
+    Create an Iceberg BooleanExpression filter that exactly matches rows based 
on join columns.
+
+    For single-column keys, uses an efficient In() predicate.
+    For composite keys, creates Or(And(...), And(...), ...) for exact row 
matching.
+    This function should be used when exact matching is required (e.g., 
overwrite, insert filtering).
+    """
     unique_keys = df.select(join_cols).group_by(join_cols).aggregate([])
 
+    if len(unique_keys) == 0:
+        return AlwaysFalse()

Review Comment:
   Should we remove line 65?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Optimize upsert performance for large datasets [iceberg-python]

Reply via email to