akravchukdremio commented on code in PR #50187:
URL: https://github.com/apache/arrow/pull/50187#discussion_r3444127160
##########
cpp/src/gandiva/precompiled/string_ops.cc:
##########
@@ -1895,8 +1864,7 @@ const char* replace_with_max_len_utf8_utf8_utf8(gdv_int64
context, const char* t
for (; text_index <= text_len - from_str_len;) {
if (memcmp(text + text_index, from_str, from_str_len) == 0) {
if (out_index + text_index - last_match_index + to_str_len > max_length)
{
Review Comment:
As far as C++ evaluates this expression left-to-right, `out_index +
text_index` can overflow before subtracting `last_match_index`, even when the
final prospective output length is legal. This matters now, because before this
PR, the wrapper function always passed `max_length = 65535`, but now the
wrapper function may pass `max_length` close to `INT_MAX`, so the now we can
have here large `out_index` values. To fix it we can use intermediate variable
with result type of `gdv_int64` and casting all operands to this type (e.g.
`static_cast<gdv_int64>(out_index)`). Or we can just do subtraction in first
order in such way:
```suggestion
if (out_index + (text_index - last_match_index) + to_str_len >
max_length) {
```
But maybe this is a very rare case and we can ignore it (?)
##########
cpp/src/gandiva/precompiled/string_ops_test.cc:
##########
@@ -1971,6 +1971,62 @@ TEST(TestStringOps, TestReplace) {
EXPECT_EQ(std::string(out_str, out_len), "TestString");
EXPECT_FALSE(ctx.has_error());
+ // Large output (>64 KB) must not overflow: buffer is sized to the exact
result.
+ std::string large_in(35000, 'X');
+ std::string large_expected(70000, '\0');
+ for (int i = 0; i < 35000; ++i) {
+ large_expected[2 * i] = 'X';
+ large_expected[2 * i + 1] = 'Y';
+ }
+ out_str = replace_utf8_utf8_utf8(ctx_ptr, large_in.data(),
+ static_cast<int32_t>(large_in.size()), "X",
1, "XY", 2,
+ &out_len);
+ EXPECT_EQ(out_len, 70000);
+ EXPECT_EQ(std::string(out_str, out_len), large_expected);
+ EXPECT_FALSE(ctx.has_error());
+
+ // Large shrinking output ("XX" -> "X") on a >64 KB input.
+ std::string large_shrink_in(70000, 'X');
+ std::string large_shrink_expected(35000, 'X');
+ out_str = replace_utf8_utf8_utf8(ctx_ptr, large_shrink_in.data(),
+
static_cast<int32_t>(large_shrink_in.size()), "XX", 2,
+ "X", 1, &out_len);
+ EXPECT_EQ(out_len, 35000);
+ EXPECT_EQ(std::string(out_str, out_len), large_shrink_expected);
+ EXPECT_FALSE(ctx.has_error());
+
+ // Edge case: result size of exactly 0 (every byte of text is removed). Takes
+ // the no-scan shrink path (to_str_len <= from_str_len).
+ out_str = replace_utf8_utf8_utf8(ctx_ptr, "aaaa", 4, "a", 1, "", 0,
&out_len);
+ EXPECT_EQ(out_len, 0);
+ EXPECT_EQ(std::string(out_str, out_len), "");
+ EXPECT_FALSE(ctx.has_error());
+
+ // Edge case: result size one past the INT_MAX boundary. 65536 single-char
+ // matches each expanding to 32768 bytes gives max_length = 65536 * 32768 =
+ // 2^31 = INT_MAX + 1, so it is reported cleanly (guard fires before any
alloc).
Review Comment:
nit: we can add test case when we have just max_length = `INT_MAX`, so code
should proceed without returning error
##########
cpp/src/gandiva/precompiled/string_ops.cc:
##########
@@ -1946,9 +1914,41 @@ const char* replace_utf8_utf8_utf8(gdv_int64 context,
const char* text,
gdv_int32 text_len, const char* from_str,
gdv_int32 from_str_len, const char* to_str,
gdv_int32 to_str_len, gdv_int32* out_len) {
+ // Size the output buffer to the exact result, so large results are not
capped
+ // by an arbitrary limit. When the replacement is no longer than the matched
+ // text, the result can only shrink or stay the same, so text_len is a safe
+ // bound and we can skip scanning. Otherwise count non-overlapping matches to
+ // get the exact expanded size.
+ gdv_int64 max_length;
+ if (to_str_len <= from_str_len) {
+ max_length = text_len;
+ } else {
+ gdv_int64 num_matches = 0;
+ if (from_str_len > 0 && from_str_len <= text_len) {
+ for (gdv_int32 i = 0; i <= text_len - from_str_len;) {
+ if (memcmp(text + i, from_str, from_str_len) == 0) {
+ num_matches++;
+ i += from_str_len;
+ } else {
+ i++;
+ }
+ }
+ }
+ max_length =
Review Comment:
nit: if `num_matches == 0`, we can do early return to avoid calling helper
method:
```cpp
if (num_matches == 0) {
*out_len = text_len;
return text;
}
```
##########
cpp/src/gandiva/precompiled/string_ops.cc:
##########
@@ -1931,7 +1899,7 @@ const char* replace_with_max_len_utf8_utf8_utf8(gdv_int64
context, const char* t
}
if (out_index + text_len - last_match_index > max_length) {
Review Comment:
Same here as above
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]