stevenzwu commented on code in PR #14500:
URL: https://github.com/apache/iceberg/pull/14500#discussion_r2791601942
##########
api/src/main/java/org/apache/iceberg/expressions/Literals.java:
##########
@@ -622,10 +635,23 @@ public <T> Literal<T> to(Type type) {
return null;
}
+ @Override
+ public Comparator<UUID> comparator() {
+ return useSignedComparator ? SIGNED_CMP : RFC_CMP;
+ }
+
@Override
protected Type.TypeID typeId() {
return Type.TypeID.UUID;
}
+
+ /**
+ * Creates a new UUIDLiteral with the signed comparator for backward
compatibility with files
+ * written before RFC-compliant UUID comparisons were introduced.
+ */
+ UUIDLiteral withSignedComparator() {
Review Comment:
nit: `useSignedComparator` to be consistent with names in this PR?
##########
api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java:
##########
@@ -74,22 +86,37 @@ private ManifestEvaluator(PartitionSpec spec, Expression
partitionFilter, boolea
* @return false if the file cannot contain rows that match the expression,
true otherwise.
*/
public boolean eval(ManifestFile manifest) {
- return new ManifestEvalVisitor().eval(manifest);
+ boolean result = new ManifestEvalVisitor().eval(manifest, false);
+
+ // If the RFC-compliant evaluation says rows might match, or there's no
signed UUID expression,
+ // return the result.
+ if (result || signedUuidExpr == null) {
+ return result;
+ }
+
+ // Always try with signed UUID comparator as a fallback. There is no
reliable way to detect
+ // which comparator was used when the manifest's partition field summaries
were written.
+ return new ManifestEvalVisitor().eval(manifest, true);
}
private static final boolean ROWS_MIGHT_MATCH = true;
private static final boolean ROWS_CANNOT_MATCH = false;
private class ManifestEvalVisitor extends BoundExpressionVisitor<Boolean> {
private List<PartitionFieldSummary> stats = null;
+ // Flag to use signed UUID comparator for backward compatibility.
+ // This is needed for the IN predicate because the comparator information
is lost
Review Comment:
is it only for `IN` predicate?
##########
api/src/main/java/org/apache/iceberg/expressions/Literals.java:
##########
@@ -608,9 +608,22 @@ public String toString() {
}
}
- static class UUIDLiteral extends ComparableLiteral<UUID> {
+ static class UUIDLiteral extends BaseLiteral<UUID> {
+ private static final Comparator<UUID> RFC_CMP =
Review Comment:
nit: `UNSIGNED_CMP` and add comment to explain that unsigned comparator is
the RFC compliant. also add the reference to the exact RFC number
##########
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java:
##########
@@ -74,7 +85,17 @@ public InclusiveMetricsEvaluator(Schema schema, Expression
unbound, boolean case
*/
public boolean eval(ContentFile<?> file) {
// TODO: detect the case where a column is missing from the file using
file's max field id.
- return new MetricsEvalVisitor().eval(file);
+ boolean result = new MetricsEvalVisitor().eval(file, false);
+
+ // If the RFC-compliant evaluation says rows might match, or there's no
signed UUID expression,
+ // return the result.
+ if (result || signedUuidExpr == null) {
+ return result;
+ }
+
+ // Always try with signed UUID comparator as a fallback. There is no
reliable way to detect
+ // which comparator was used when the file's column metrics were written.
Review Comment:
nit: maybe the comment can be a little more clear
```
whether signed or unsigned comparator was used when the UUID column stats
were written.
```
##########
api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java:
##########
@@ -74,22 +86,37 @@ private ManifestEvaluator(PartitionSpec spec, Expression
partitionFilter, boolea
* @return false if the file cannot contain rows that match the expression,
true otherwise.
*/
public boolean eval(ManifestFile manifest) {
- return new ManifestEvalVisitor().eval(manifest);
+ boolean result = new ManifestEvalVisitor().eval(manifest, false);
+
+ // If the RFC-compliant evaluation says rows might match, or there's no
signed UUID expression,
+ // return the result.
+ if (result || signedUuidExpr == null) {
+ return result;
+ }
+
+ // Always try with signed UUID comparator as a fallback. There is no
reliable way to detect
+ // which comparator was used when the manifest's partition field summaries
were written.
Review Comment:
nit: `which comparator` --> `whether signed or unsigned comparator`
##########
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java:
##########
@@ -86,8 +107,12 @@ private class MetricsEvalVisitor extends
ExpressionVisitors.BoundVisitor<Boolean
private Map<Integer, Long> nanCounts = null;
private Map<Integer, ByteBuffer> lowerBounds = null;
private Map<Integer, ByteBuffer> upperBounds = null;
+ // Flag to use signed UUID comparator for backward compatibility.
+ // This is needed for the IN predicate because the comparator information
is lost
+ // when binding converts literals to a Set<T> of raw values.
+ private boolean useSignedUuidComparator = false;
- private boolean eval(ContentFile<?> file) {
+ private boolean eval(ContentFile<?> file, boolean signedUuidMode) {
Review Comment:
Can we pass in the `Expression` in the 2nd arg? That would be a little
cleaner and we can make this inner class `static`.
Do we need the boolean flag because the `IN` evaluation only deals with
`literalSet` of raw values (not `Literal` wrapper)? Wondering if we can fix
that first? There is probably a valid reason that
`UnboundPredicate#bindInOperation` would convert `List<Literal<T>>` to
`List<T>` raw values. Maybe @rdblue would know.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]