lriggs commented on code in PR #50187:
URL: https://github.com/apache/arrow/pull/50187#discussion_r3454682996


##########
cpp/src/gandiva/precompiled/string_ops.cc:
##########
@@ -1946,9 +1914,33 @@ const char* replace_utf8_utf8_utf8(gdv_int64 context, 
const char* text,
                                    gdv_int32 text_len, const char* from_str,
                                    gdv_int32 from_str_len, const char* to_str,
                                    gdv_int32 to_str_len, gdv_int32* out_len) {
+  // Count non-overlapping matches to size the output buffer exactly, so large
+  // results are not capped by an arbitrary limit.
+  gdv_int64 num_matches = 0;

Review Comment:
   I just pushed a new approach that does a hybrid. preallocates when possible 
and the counts when necessary.
   
   <p style="white-space: pre-wrap; margin-top: 0.1em; margin-bottom: 0.2em; 
color: rgb(97, 97, 97); font-family: -apple-system, &quot;system-ui&quot;, 
&quot;Segoe UI&quot;, Roboto, sans-serif; font-size: 13px; font-style: normal; 
font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; 
letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; 
text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 
0px; text-decoration-thickness: initial; text-decoration-style: initial; 
text-decoration-color: initial;"><strong>Perf.</strong> Sizing the buffer needs 
the output length up front (the arena allocator is bump-pointer; it can't grow 
in place). The straightforward approach — count matches in a pre-pass — adds an 
O(n) scan. Microbenchmarks (<code style="font-family: monospace; color: 
rgb(163, 21, 21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; 
border-radius: 3px; word-break: break-word; font-size: 0.9em;">-O2</code>) s
 howed that scan costing <strong>~27% on dense matches and up to ~95% on 
sparse</strong> matches (it's a genuine second pass, not a debug 
artifact).</p><p style="white-space: pre-wrap; margin-top: 0.1em; 
margin-bottom: 0.2em; color: rgb(97, 97, 97); font-family: -apple-system, 
&quot;system-ui&quot;, &quot;Segoe UI&quot;, Roboto, sans-serif; font-size: 
13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: 
normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: 
start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; 
-webkit-text-stroke-width: 0px; text-decoration-thickness: initial; 
text-decoration-style: initial; text-decoration-color: initial;">To avoid it, 
output sizing now branches:</p><ul style="padding-inline-start: 2em; color: 
rgb(97, 97, 97); font-family: -apple-system, &quot;system-ui&quot;, &quot;Segoe 
UI&quot;, Roboto, sans-serif; font-size: 13px; font-style: normal; 
font-variant-ligatures: normal; font-variant-ca
 ps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: 
start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; 
-webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: 
initial; text-decoration-style: initial; text-decoration-color: 
initial;"><li><strong>shrink / no-match</strong><span> </span>(<code 
style="font-family: monospace; color: rgb(163, 21, 21); background-color: 
rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break: 
break-word; font-size: 0.9em;">to_len ≤ from_len</code>): bound by<span> 
</span><code style="font-family: monospace; color: rgb(163, 21, 21); 
background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; 
word-break: break-word; font-size: 0.9em;">text_len</code>, no 
scan.</li><li><strong>small expansion</strong><span> </span>(<code 
style="font-family: monospace; color: rgb(163, 21, 21); background-color: 
rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; 
 word-break: break-word; font-size: 0.9em;">to_len − from_len ≤ 
from_len</code>, i.e. result ≤ ~2× input — the common case incl. the 
original<span> </span><code style="font-family: monospace; color: rgb(163, 21, 
21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 
3px; word-break: break-word; font-size: 0.9em;">'X'→'XY'</code>): allocate 
an<span> </span><strong>O(1) upper bound</strong><span> </span><code 
style="font-family: monospace; color: rgb(163, 21, 21); background-color: 
rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break: 
break-word; font-size: 0.9em;">text_len + 
(text_len/from_len)·(to_len−from_len)</code><span> </span>and do a single 
pass,<span> </span><strong>no scan</strong>. Over-allocates at most ~<code 
style="font-family: monospace; color: rgb(163, 21, 21); background-color: 
rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break: 
break-word; font-size: 0.9em;">text_len</code><span> </span>of tran
 sient arena scratch.</li><li><strong>large expansion</strong>: fall back to 
the exact match-counting scan, since an upper bound would over-allocate 
unboundedly for sparse input (and is gated on<span> </span><code 
style="font-family: monospace; color: rgb(163, 21, 21); background-color: 
rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break: 
break-word; font-size: 0.9em;">≤ INT_MAX</code>).</li></ul><p 
style="white-space: pre-wrap; margin-top: 0.1em; margin-bottom: 0.2em; color: 
rgb(97, 97, 97); font-family: -apple-system, &quot;system-ui&quot;, &quot;Segoe 
UI&quot;, Roboto, sans-serif; font-size: 13px; font-style: normal; 
font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; 
letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; 
text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 
0px; text-decoration-thickness: initial; text-decoration-style: initial; 
text-decoration-color: initial;">Result: the sca
 n overhead is eliminated for the common case (~0% vs a no-scan single pass), 
and retained only where it's actually needed:</p>
   
   case | before (count-always) | after
   -- | -- | --
   dense a→ab, 4 MB | +26.9% | +1.5%
   sparse a→ab, 4 MB | +86.3% | +1.3%
   dense bigexp a→abcd, 4 MB | — | +24.5% (fallback scan)
   sparse bigexp a→abcd, 4 MB | — | +86.0% (fallback scan)
   
   <p style="white-space: pre-wrap; margin-top: 0.1em; margin-bottom: 0.2em; 
color: rgb(97, 97, 97); font-family: -apple-system, &quot;system-ui&quot;, 
&quot;Segoe UI&quot;, Roboto, sans-serif; font-size: 13px; font-style: normal; 
font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; 
letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; 
text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 
0px; text-decoration-thickness: initial; text-decoration-style: initial; 
text-decoration-color: initial;">Known limitation: a <strong>dense</strong> 
big-expansion (e.g. <code style="font-family: monospace; color: rgb(163, 21, 
21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 
3px; word-break: break-word; font-size: 0.9em;">a→abcd</code> where every 
position matches) still scans even though the upper bound would have been exact 
— density can't be known without scanning, so the gate is conservative (≤2×
  over-allocation). The gate factor can be widened if dense big-expansions are 
a hot path.</p>
   
   
   
   case (input size / match density) | Before: count-always | After: 
upper-bound + fallback
   -- | -- | --
   small/dense a→ab | +26.8% | −0.5%
   small/sparse a→ab | +58.4% | +0.4%
   medium/dense a→ab | +26.1% | −1.6%
   medium/sparse a→ab | +95.4% | −0.0%
   large/dense a→ab | +26.9% | +1.5%
   large/sparse a→ab | +86.3% | +1.3%
   large/dense shrink ab→a | +0.4% | +0.7%
   large/dense bigexp a→abcd | (n/a) | +24.5%
   large/sparse bigexp a→abcd | (n/a) | +86.0%
   
   <br class="Apple-interchange-newline">



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to