akravchukdremio commented on code in PR #50187:
URL: https://github.com/apache/arrow/pull/50187#discussion_r3444127160


##########
cpp/src/gandiva/precompiled/string_ops.cc:
##########
@@ -1895,8 +1864,7 @@ const char* replace_with_max_len_utf8_utf8_utf8(gdv_int64 
context, const char* t
   for (; text_index <= text_len - from_str_len;) {
     if (memcmp(text + text_index, from_str, from_str_len) == 0) {
       if (out_index + text_index - last_match_index + to_str_len > max_length) 
{

Review Comment:
   As far as C++ evaluates this expression left-to-right, `out_index + 
text_index` can overflow before subtracting `last_match_index`, even when the 
final prospective output length is legal. This matters now, because before this 
PR, the wrapper function always passed `max_length = 65535`, but now the 
wrapper function may pass `max_length` close to `INT_MAX`, so the now we can 
have here large `out_index` values. To fix it we can use intermediate variable 
with result type of `gdv_int64` and casting all operands to this type (e.g. 
`static_cast<gdv_int64>(out_index)`). Or we can just do subtraction in first 
order in such way:
   
   
   ```suggestion
         if (out_index + (text_index - last_match_index) + to_str_len > 
max_length) {
   ```
   
   But maybe this is a very rare case and we can ignore it (?)



##########
cpp/src/gandiva/precompiled/string_ops_test.cc:
##########
@@ -1971,6 +1971,62 @@ TEST(TestStringOps, TestReplace) {
   EXPECT_EQ(std::string(out_str, out_len), "TestString");
   EXPECT_FALSE(ctx.has_error());
 
+  // Large output (>64 KB) must not overflow: buffer is sized to the exact 
result.
+  std::string large_in(35000, 'X');
+  std::string large_expected(70000, '\0');
+  for (int i = 0; i < 35000; ++i) {
+    large_expected[2 * i] = 'X';
+    large_expected[2 * i + 1] = 'Y';
+  }
+  out_str = replace_utf8_utf8_utf8(ctx_ptr, large_in.data(),
+                                   static_cast<int32_t>(large_in.size()), "X", 
1, "XY", 2,
+                                   &out_len);
+  EXPECT_EQ(out_len, 70000);
+  EXPECT_EQ(std::string(out_str, out_len), large_expected);
+  EXPECT_FALSE(ctx.has_error());
+
+  // Large shrinking output ("XX" -> "X") on a >64 KB input.
+  std::string large_shrink_in(70000, 'X');
+  std::string large_shrink_expected(35000, 'X');
+  out_str = replace_utf8_utf8_utf8(ctx_ptr, large_shrink_in.data(),
+                                   
static_cast<int32_t>(large_shrink_in.size()), "XX", 2,
+                                   "X", 1, &out_len);
+  EXPECT_EQ(out_len, 35000);
+  EXPECT_EQ(std::string(out_str, out_len), large_shrink_expected);
+  EXPECT_FALSE(ctx.has_error());
+
+  // Edge case: result size of exactly 0 (every byte of text is removed). Takes
+  // the no-scan shrink path (to_str_len <= from_str_len).
+  out_str = replace_utf8_utf8_utf8(ctx_ptr, "aaaa", 4, "a", 1, "", 0, 
&out_len);
+  EXPECT_EQ(out_len, 0);
+  EXPECT_EQ(std::string(out_str, out_len), "");
+  EXPECT_FALSE(ctx.has_error());
+
+  // Edge case: result size one past the INT_MAX boundary. 65536 single-char
+  // matches each expanding to 32768 bytes gives max_length = 65536 * 32768 =
+  // 2^31 = INT_MAX + 1, so it is reported cleanly (guard fires before any 
alloc).

Review Comment:
   nit: we can add test case when we have just max_length = `INT_MAX`, so code 
should proceed without returning error



##########
cpp/src/gandiva/precompiled/string_ops.cc:
##########
@@ -1946,9 +1914,41 @@ const char* replace_utf8_utf8_utf8(gdv_int64 context, 
const char* text,
                                    gdv_int32 text_len, const char* from_str,
                                    gdv_int32 from_str_len, const char* to_str,
                                    gdv_int32 to_str_len, gdv_int32* out_len) {
+  // Size the output buffer to the exact result, so large results are not 
capped
+  // by an arbitrary limit. When the replacement is no longer than the matched
+  // text, the result can only shrink or stay the same, so text_len is a safe
+  // bound and we can skip scanning. Otherwise count non-overlapping matches to
+  // get the exact expanded size.
+  gdv_int64 max_length;
+  if (to_str_len <= from_str_len) {
+    max_length = text_len;
+  } else {
+    gdv_int64 num_matches = 0;
+    if (from_str_len > 0 && from_str_len <= text_len) {
+      for (gdv_int32 i = 0; i <= text_len - from_str_len;) {
+        if (memcmp(text + i, from_str, from_str_len) == 0) {
+          num_matches++;
+          i += from_str_len;
+        } else {
+          i++;
+        }
+      }
+    }
+    max_length =

Review Comment:
   nit: if `num_matches == 0`, we can do early return to avoid calling helper 
method:
   
   ```cpp
   if (num_matches == 0) {
     *out_len = text_len;
     return text;
   }
   ```



##########
cpp/src/gandiva/precompiled/string_ops.cc:
##########
@@ -1931,7 +1899,7 @@ const char* replace_with_max_len_utf8_utf8_utf8(gdv_int64 
context, const char* t
   }
 
   if (out_index + text_len - last_match_index > max_length) {

Review Comment:
   Same here as above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to