Re: [PR] feat: optimize truncate(col) == value to startsWith predicate [iceberg-cpp]

via GitHub Thu, 06 Nov 2025 20:55:04 -0800


wgtmac commented on code in PR #289:
URL: https://github.com/apache/iceberg-cpp/pull/289#discussion_r2501653362



##########
src/iceberg/expression/predicate.cc:
##########
@@ -143,6 +145,27 @@ bool IsFloatingType(TypeId type) {
   return type == TypeId::kFloat || type == TypeId::kDouble;
 }
 
+/// \brief Count the number of UTF-8 code points in a string.
+/// This matches the behavior of TruncateUtils::TruncateUTF8.
+///
+/// NOTE: This counts code points, not grapheme clusters (user-perceived 
characters).
+/// Per the Iceberg spec, combining marks count as separate code points.

Review Comment:
   Iceberg spec does not have this. We should not mislead readers.



##########
src/iceberg/expression/predicate.cc:
##########
@@ -143,6 +145,27 @@ bool IsFloatingType(TypeId type) {
   return type == TypeId::kFloat || type == TypeId::kDouble;
 }
 
+/// \brief Count the number of UTF-8 code points in a string.
+/// This matches the behavior of TruncateUtils::TruncateUTF8.
+///
+/// NOTE: This counts code points, not grapheme clusters (user-perceived 
characters).
+/// Per the Iceberg spec, combining marks count as separate code points.
+/// Example: "é" as e + combining-acute (U+0065 U+0301) = 2 code points,
+///          but "é" as single precomposed character (U+00E9) = 1 code point.
+///
+/// \param str The UTF-8 encoded string
+/// \return The number of code points (not bytes, not graphemes)
+inline int32_t CountUTF8CodePoints(std::string_view str) {

Review Comment:
   Perhaps this should be moved to `string_util.h` or `truncate_util.h`



##########
src/iceberg/expression/predicate.cc:
##########
@@ -246,7 +269,69 @@ Result<std::shared_ptr<Expression>> 
UnboundPredicate<B>::BindLiteralOperation(
     }
   }
 
-  // TODO(gangwu): translate truncate(col) == value to startsWith(value)
+  // Optimize: translate truncate(col, width) == value to col startsWith(value)
+  // This optimization allows better predicate pushdown and index usage
+  // IMPORTANT: Only valid when literal has exactly `width` UTF-8 code points
+  //
+  // NOTE: This rewrite is safe because:
+  // - Iceberg string comparisons are binary (byte-for-byte), no collation
+  // - STARTS_WITH uses the same binary comparison semantics as equality
+  // - truncate(col, w) == "value" ⟺ col STARTS_WITH "value" when len(value) 
== w
+  // - When source has < w code points, truncate returns full string; equality
+  //   implies exact match, so STARTS_WITH remains valid (short-string 
invariance)
+  if (BASE::op() == Expression::Operation::kEq &&
+      bound_term->kind() == Term::Kind::kTransform) {
+    // Safe to cast after kind check confirms it's a transform
+    auto* transform_term = dynamic_cast<BoundTransform*>(bound_term.get());
+    if (!transform_term) {
+      // Should never happen after kind check, but be defensive
+      return std::make_shared<BoundLiteralPredicate>(BASE::op(), 
std::move(bound_term),
+                                                     std::move(literal));
+    }
+
+    if (transform_term->transform()->transform_type() == 
TransformType::kTruncate &&
+        literal.type()->type_id() == TypeId::kString &&
+        !literal.IsNull()) {  // Null safety: skip null literals
+
+      // Extract width parameter using type-safe API
+      auto width_opt = transform_term->transform()->param();

Review Comment:
   I think we can greatly simplify this line and below.
   
   We can add `transform_func` like below.
   
   ```cpp
   class ICEBERG_EXPORT BoundTransform : public BoundTerm {
    public:
     const std::shared_ptr<TransformFunction>& transform_func() const { return 
transform_func_; }
   };
   ```
   
   Then we just need to call 
`transform_term->transform_func()->Transform(literal)` and check if its return 
value is same as `literal`. We don't even need `CountUTF8CodePoints` above.



##########
src/iceberg/transform.h:
##########
@@ -141,6 +141,19 @@ class ICEBERG_EXPORT Transform : public util::Formattable {
   /// \brief Returns the transform type.
   TransformType transform_type() const;
 
+  /// \brief Returns the optional parameter for parameterized transforms.
+  ///
+  /// For transforms like bucket(N) or truncate(W), returns the parameter 
value.
+  /// For non-parameterized transforms (identity, year, etc.), returns 
std::nullopt.
+  ///
+  /// \return The parameter if present, otherwise std::nullopt
+  std::optional<int32_t> param() const {

Review Comment:
   Can we remove this function? This makes it harder to evolve the variant in 
the future.



##########
src/iceberg/expression/predicate.cc:
##########
@@ -246,7 +269,69 @@ Result<std::shared_ptr<Expression>> 
UnboundPredicate<B>::BindLiteralOperation(
     }
   }
 
-  // TODO(gangwu): translate truncate(col) == value to startsWith(value)
+  // Optimize: translate truncate(col, width) == value to col startsWith(value)
+  // This optimization allows better predicate pushdown and index usage
+  // IMPORTANT: Only valid when literal has exactly `width` UTF-8 code points
+  //
+  // NOTE: This rewrite is safe because:
+  // - Iceberg string comparisons are binary (byte-for-byte), no collation
+  // - STARTS_WITH uses the same binary comparison semantics as equality
+  // - truncate(col, w) == "value" ⟺ col STARTS_WITH "value" when len(value) 
== w
+  // - When source has < w code points, truncate returns full string; equality
+  //   implies exact match, so STARTS_WITH remains valid (short-string 
invariance)
+  if (BASE::op() == Expression::Operation::kEq &&
+      bound_term->kind() == Term::Kind::kTransform) {
+    // Safe to cast after kind check confirms it's a transform
+    auto* transform_term = dynamic_cast<BoundTransform*>(bound_term.get());
+    if (!transform_term) {
+      // Should never happen after kind check, but be defensive
+      return std::make_shared<BoundLiteralPredicate>(BASE::op(), 
std::move(bound_term),
+                                                     std::move(literal));
+    }

Review Comment:
   ```suggestion
       auto* transform_term = 
internal::checked_cast<BoundTransform*>(bound_term.get());
   ```
   
   Let's reuse the existing pattern for cast.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: optimize truncate(col) == value to startsWith predicate [iceberg-cpp]

Reply via email to