joyhaldar commented on code in PR #14593:
URL: https://github.com/apache/iceberg/pull/14593#discussion_r2544081722
##########
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java:
##########
@@ -327,6 +327,29 @@ public <T> Boolean eq(Bound<T> term, Literal<T> lit) {
public <T> Boolean notEq(Bound<T> term, Literal<T> lit) {
// because the bounds are not necessarily a min or max value, this
cannot be answered using
// them. notEq(col, X) with (X, Y) doesn't guarantee that X is a value
in col.
+ // However, when min == max and the file has no nulls or NaN values, we
can safely prune
+ // if that value equals the literal.
+ int id = term.ref().fieldId();
+ if (mayContainNull(id)) {
Review Comment:
Thank you for the suggestion Nandor. I actually tried this initially but had
to change it due to test failures.
The problem that I faced is `mayContainNaN(id)`, which would be defined as
`nanCounts == null || !nanCounts.containsKey(id) || nanCounts.get(id) != 0;`
returns `true` when `nanCounts == null` or when the column has no entry in the
map.
Tests that failed with `mayContainNaN(id)`:
-
[testUnpartitionedYears()](https://github.com/apache/iceberg/blob/6e873baa7db22429fa231d552a696248f64ea1f7/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java#L440):
expects 5 partitions, gets 10
-
[testUnpartitionedTruncateString()](https://github.com/apache/iceberg/blob/6e873baa7db22429fa231d552a696248f64ea1f7/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java#L763):
expects 5 partitions, gets 10
-
[testUnpartitionedOr()](https://github.com/apache/iceberg/blob/6e873baa7db22429fa231d552a696248f64ea1f7/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java#L967):
expects 5 partitions, gets 10
These tests use timestamp/string columns, and `mayContainNaN()` returns
`true` for them (either because `nanCounts == null` or the column isn't in the
map), preventing the optimization from running.
The current approach checks NaN two ways:
1. `NaNUtil.isNaN(bounds)` - returns `false` for timestamps/strings (they
can't be NaN)
2. `nanCounts.get(id) != 0` - only checks if stats actually exist
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]