Copilot commented on code in PR #748:
URL: https://github.com/apache/iceberg-cpp/pull/748#discussion_r3418276495


##########
src/iceberg/util/string_util.h:
##########
@@ -128,6 +132,18 @@ class ICEBERG_EXPORT StringUtils {
     }
     return value;
   }
+
+ private:
+  // std::tolower/std::toupper require their argument to be representable as
+  // unsigned char (or EOF); passing a raw char with a non-ASCII (negative) 
value is
+  // undefined behavior, so cast through unsigned char first.
+  static char ToLowerAscii(char c) {
+    return static_cast<char>(std::tolower(static_cast<unsigned char>(c)));
+  }
+
+  static char ToUpperAscii(char c) {
+    return static_cast<char>(std::toupper(static_cast<unsigned char>(c)));
+  }

Review Comment:
   `ToUpperAscii` is still locale-dependent because it calls `std::toupper`. In 
non-"C" locales, bytes >= 0x80 may be transformed, potentially corrupting UTF-8 
multibyte sequences and breaking the stated "non-ASCII bytes are left 
unchanged" behavior. Use an explicit ASCII-only mapping.



##########
src/iceberg/test/string_util_test.cc:
##########
@@ -41,4 +41,26 @@ TEST(StringUtilsTest, ToUpper) {
   ASSERT_EQ(StringUtils::ToUpper("123"), "123");
 }
 
+// Non-ASCII (multibyte UTF-8) bytes have the high bit set, i.e. are negative
+// when stored in a signed char. Passing them straight to std::tolower/toupper 
is
+// undefined behavior. Conversion only touches ASCII; multibyte bytes pass 
through
+// unchanged. See https://github.com/apache/iceberg-cpp/issues/613.
+TEST(StringUtilsTest, NonAsciiPassThrough) {
+  ASSERT_EQ(StringUtils::ToLower("Naïve"), "naïve");
+  ASSERT_EQ(StringUtils::ToUpper("café"), "CAFé");
+  // Pure non-ASCII input is returned verbatim.
+  ASSERT_EQ(StringUtils::ToLower("日本語"), "日本語");
+  ASSERT_EQ(StringUtils::ToUpper("日本語"), "日本語");

Review Comment:
   These assertions use non-ASCII characters directly in ordinary string 
literals. That makes the test dependent on the compiler/source-file encoding 
(and MSVC may require `/utf-8`). Using explicit UTF-8 byte escapes makes the 
test portable while still validating the "high-bit bytes pass through 
unchanged" behavior.



##########
src/iceberg/test/string_util_test.cc:
##########
@@ -41,4 +41,26 @@ TEST(StringUtilsTest, ToUpper) {
   ASSERT_EQ(StringUtils::ToUpper("123"), "123");
 }
 
+// Non-ASCII (multibyte UTF-8) bytes have the high bit set, i.e. are negative
+// when stored in a signed char. Passing them straight to std::tolower/toupper 
is
+// undefined behavior. Conversion only touches ASCII; multibyte bytes pass 
through
+// unchanged. See https://github.com/apache/iceberg-cpp/issues/613.
+TEST(StringUtilsTest, NonAsciiPassThrough) {
+  ASSERT_EQ(StringUtils::ToLower("Naïve"), "naïve");
+  ASSERT_EQ(StringUtils::ToUpper("café"), "CAFé");
+  // Pure non-ASCII input is returned verbatim.
+  ASSERT_EQ(StringUtils::ToLower("日本語"), "日本語");
+  ASSERT_EQ(StringUtils::ToUpper("日本語"), "日本語");
+}
+
+TEST(StringUtilsTest, EqualsIgnoreCase) {
+  ASSERT_TRUE(StringUtils::EqualsIgnoreCase("AbC", "abc"));
+  ASSERT_TRUE(StringUtils::EqualsIgnoreCase("", ""));
+  ASSERT_FALSE(StringUtils::EqualsIgnoreCase("abc", "abcd"));
+  ASSERT_FALSE(StringUtils::EqualsIgnoreCase("abc", "abd"));
+  // ASCII case is folded; non-ASCII bytes are compared as-is (no UB).
+  ASSERT_TRUE(StringUtils::EqualsIgnoreCase("Café", "café"));
+  ASSERT_FALSE(StringUtils::EqualsIgnoreCase("café", "cafe"));

Review Comment:
   Same portability concern here: using non-ASCII characters directly in string 
literals makes the test dependent on source encoding. Prefer explicit UTF-8 
byte escapes so the test is stable across toolchains.



##########
src/iceberg/util/string_util.h:
##########
@@ -128,6 +132,18 @@ class ICEBERG_EXPORT StringUtils {
     }
     return value;
   }
+
+ private:
+  // std::tolower/std::toupper require their argument to be representable as
+  // unsigned char (or EOF); passing a raw char with a non-ASCII (negative) 
value is
+  // undefined behavior, so cast through unsigned char first.
+  static char ToLowerAscii(char c) {
+    return static_cast<char>(std::tolower(static_cast<unsigned char>(c)));
+  }

Review Comment:
   `ToLowerAscii` is still locale-dependent because it calls `std::tolower`. In 
a non-"C" locale (e.g., ISO-8859-1), bytes >= 0x80 may be transformed, which 
can corrupt UTF-8 multibyte sequences and contradict the comment that non-ASCII 
bytes are left unchanged. Implement an explicit ASCII-only mapping instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to