lriggs opened a new pull request, #50187:
URL: https://github.com/apache/arrow/pull/50187

   ### Rationale for this change
   Gandiva's REPLACE hardcodes a 65535-byte output buffer, throwing Buffer 
overflow for output string whenever the result exceeds 64 KB. The cap is 
arbitrary: Gandiva's variable-length output column already grows dynamically 
and is only bounded by the int32 offset width (~2 GB). Real queries that 
replace into large concatenated/aggregated strings fail unnecessarily.
   
   ### What changes are included in this PR
   replace_utf8_utf8_utf8 now sizes the output buffer to the exact result 
instead of using a fixed cap. The output length of a replace is deterministic:
   
   
   out_len = text_len + num_matches * (to_str_len - from_str_len)
   The wrapper does a single counting pass over the input to find the number of 
non-overlapping matches of from_str (mirroring the match loop already used in 
the implementation), computes the exact size in gdv_int64 to avoid intermediate 
overflow, and passes that as max_length.
   
   The internal replace_with_max_len_utf8_utf8_utf8 is unchanged — its bounds 
checks now act purely as a correctness backstop (they should never fire with an 
exact bound), and its explicit-max-length signature remains for the existing 
unit tests.
   When to is shorter than from, the result shrinks and max_length <= text_len, 
so the shrinking path is sized correctly too.
   ### Are these changes tested?
   Yes. Added regression cases to TestStringOps.TestReplace in 
string_ops_test.cc:
   
   A 35000-char 'X' input with X → XY, producing a 70000-byte result 
(previously overflowed at 65535) — asserts no error and exact length/content.
   A 70000-char shrinking case (XX → X) to cover the shrink path on a >64 KB 
input.
   Full precompiled suite passes locally (132/132), including the existing 
explicit-max_len overflow tests, which call the internal function directly and 
are unaffected.
   
   ### Are there any user-facing changes?
   REPLACE now succeeds on results larger than 64 KB instead of erroring. No 
API or signature changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to