Re: [PR] API, Spark: Optimize NOT IN and != predicates for single-value files [iceberg]

via GitHub Thu, 20 Nov 2025 01:38:38 -0800


joyhaldar commented on code in PR #14593:
URL: https://github.com/apache/iceberg/pull/14593#discussion_r2544081722



##########
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java:
##########
@@ -327,6 +327,29 @@ public <T> Boolean eq(Bound<T> term, Literal<T> lit) {
     public <T> Boolean notEq(Bound<T> term, Literal<T> lit) {
       // because the bounds are not necessarily a min or max value, this 
cannot be answered using
       // them. notEq(col, X) with (X, Y) doesn't guarantee that X is a value 
in col.
+      // However, when min == max and the file has no nulls or NaN values, we 
can safely prune
+      // if that value equals the literal.
+      int id = term.ref().fieldId();
+      if (mayContainNull(id)) {

Review Comment:
   Thank you for the suggestion Nandor. I actually tried this initially but had 
to change it due to test failures.
   
   The problem that I faced is `mayContainNaN(id)`, which would be defined as 
`nanCounts == null || !nanCounts.containsKey(id) || nanCounts.get(id) != 0;` 
returns `true` when `nanCounts == null` or when the column has no entry in the 
map.
   
   Tests that failed with `mayContainNaN(id)`:
   - 
[testUnpartitionedYears()](https://github.com/apache/iceberg/blob/6e873baa7db22429fa231d552a696248f64ea1f7/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java#L440):
 expects 5 partitions, gets 10
   - 
[testUnpartitionedTruncateString()](https://github.com/apache/iceberg/blob/6e873baa7db22429fa231d552a696248f64ea1f7/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java#L763):
 expects 5 partitions, gets 10  
   - 
[testUnpartitionedOr()](https://github.com/apache/iceberg/blob/6e873baa7db22429fa231d552a696248f64ea1f7/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java#L967):
 expects 5 partitions, gets 10
   
   These tests use timestamp/string columns, and `mayContainNaN()` returns 
`true` for them (either because `nanCounts == null` or the column isn't in the 
map), preventing the optimization from running.
   
   The current approach checks NaN two ways:
   1. `NaNUtil.isNaN(bounds)` - returns `false` for timestamps/strings (they 
can't be NaN)
   2. `nanCounts.get(id) != 0` - only checks if stats actually exist



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] API, Spark: Optimize NOT IN and != predicates for single-value files [iceberg]

Reply via email to