lriggs opened a new issue, #50186:
URL: https://github.com/apache/arrow/issues/50186

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Gandiva's REPLACE(text, from, to) fails with Buffer overflow for output 
string whenever the produced output string exceeds 65535 bytes. The function 
hardcodes a 65535-byte output buffer cap, even though Gandiva's variable-length 
output column grows dynamically and is only bounded by the int32 offset width 
(~2 GB).
   
    
   ### Root cause
   In cpp/src/gandiva/precompiled/string_ops.cc, the SQL-facing 
replace_utf8_utf8_utf8 delegates to replace_with_max_len_utf8_utf8_utf8 with a 
hardcoded max_length = 65535. That implementation allocates an arena buffer of 
exactly max_length bytes and raises the error as soon as the running output 
index would exceed it. The cap is arbitrary — nothing downstream requires it.
   
   ### To Reproduce
   Any REPLACE whose result exceeds 64 KB. Minimal C++:
   
   
   std::string in(35000, 'X');               // 35 KB input
   replace_utf8_utf8_utf8(ctx, in.data(), 35000, "X", 1, "XY", 2, &out_len);
   // -> error: "Buffer overflow for output string" (result would be 70 KB)
   SQL repro (Dremio):
   
   
   CREATE TABLE IF NOT EXISTS $scratch.gandiva_repro_seed AS
   SELECT '<Document>line2' AS clrmsgenvlp_msg,
          repeat('X', 35000) AS part_x, repeat('Y', 35000) AS part_y
   FROM (VALUES (1)) AS v(x);
   
   SELECT REPLACE(CONCAT(clrmsgenvlp_msg, part_x, part_y), 'X', 'XY')
   FROM $scratch.gandiva_repro_seed;
   CONCAT / CONCAT_WS / LISTAGG are not at fault — the failure is the REPLACE 
applied on top of their large output.
   
   ### Expected behavior
   REPLACE should return the full result regardless of size (up to Gandiva's 
normal variable-length limits), not fail at an arbitrary 64 KB threshold.
   
   
   
   ### Component(s)
   
   C++, Gandiva


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to