lriggs commented on code in PR #50187:
URL: https://github.com/apache/arrow/pull/50187#discussion_r3454682996
##########
cpp/src/gandiva/precompiled/string_ops.cc:
##########
@@ -1946,9 +1914,33 @@ const char* replace_utf8_utf8_utf8(gdv_int64 context,
const char* text,
gdv_int32 text_len, const char* from_str,
gdv_int32 from_str_len, const char* to_str,
gdv_int32 to_str_len, gdv_int32* out_len) {
+ // Count non-overlapping matches to size the output buffer exactly, so large
+ // results are not capped by an arbitrary limit.
+ gdv_int64 num_matches = 0;
Review Comment:
I just pushed a new approach that does a hybrid. preallocates when possible
and the counts when necessary.
<p style="white-space: pre-wrap; margin-top: 0.1em; margin-bottom: 0.2em;
color: rgb(97, 97, 97); font-family: -apple-system, "system-ui",
"Segoe UI", Roboto, sans-serif; font-size: 13px; font-style: normal;
font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px;
text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width:
0px; text-decoration-thickness: initial; text-decoration-style: initial;
text-decoration-color: initial;"><strong>Perf.</strong> Sizing the buffer needs
the output length up front (the arena allocator is bump-pointer; it can't grow
in place). The straightforward approach — count matches in a pre-pass — adds an
O(n) scan. Microbenchmarks (<code style="font-family: monospace; color:
rgb(163, 21, 21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px;
border-radius: 3px; word-break: break-word; font-size: 0.9em;">-O2</code>) s
howed that scan costing <strong>~27% on dense matches and up to ~95% on
sparse</strong> matches (it's a genuine second pass, not a debug
artifact).</p><p style="white-space: pre-wrap; margin-top: 0.1em;
margin-bottom: 0.2em; color: rgb(97, 97, 97); font-family: -apple-system,
"system-ui", "Segoe UI", Roboto, sans-serif; font-size:
13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps:
normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align:
start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration-thickness: initial;
text-decoration-style: initial; text-decoration-color: initial;">To avoid it,
output sizing now branches:</p><ul style="padding-inline-start: 2em; color:
rgb(97, 97, 97); font-family: -apple-system, "system-ui", "Segoe
UI", Roboto, sans-serif; font-size: 13px; font-style: normal;
font-variant-ligatures: normal; font-variant-ca
ps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align:
start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px;
-webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness:
initial; text-decoration-style: initial; text-decoration-color:
initial;"><li><strong>shrink / no-match</strong><span> </span>(<code
style="font-family: monospace; color: rgb(163, 21, 21); background-color:
rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break:
break-word; font-size: 0.9em;">to_len ≤ from_len</code>): bound by<span>
</span><code style="font-family: monospace; color: rgb(163, 21, 21);
background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px;
word-break: break-word; font-size: 0.9em;">text_len</code>, no
scan.</li><li><strong>small expansion</strong><span> </span>(<code
style="font-family: monospace; color: rgb(163, 21, 21); background-color:
rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px;
word-break: break-word; font-size: 0.9em;">to_len − from_len ≤
from_len</code>, i.e. result ≤ ~2× input — the common case incl. the
original<span> </span><code style="font-family: monospace; color: rgb(163, 21,
21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius:
3px; word-break: break-word; font-size: 0.9em;">'X'→'XY'</code>): allocate
an<span> </span><strong>O(1) upper bound</strong><span> </span><code
style="font-family: monospace; color: rgb(163, 21, 21); background-color:
rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break:
break-word; font-size: 0.9em;">text_len +
(text_len/from_len)·(to_len−from_len)</code><span> </span>and do a single
pass,<span> </span><strong>no scan</strong>. Over-allocates at most ~<code
style="font-family: monospace; color: rgb(163, 21, 21); background-color:
rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break:
break-word; font-size: 0.9em;">text_len</code><span> </span>of tran
sient arena scratch.</li><li><strong>large expansion</strong>: fall back to
the exact match-counting scan, since an upper bound would over-allocate
unboundedly for sparse input (and is gated on<span> </span><code
style="font-family: monospace; color: rgb(163, 21, 21); background-color:
rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break:
break-word; font-size: 0.9em;">≤ INT_MAX</code>).</li></ul><p
style="white-space: pre-wrap; margin-top: 0.1em; margin-bottom: 0.2em; color:
rgb(97, 97, 97); font-family: -apple-system, "system-ui", "Segoe
UI", Roboto, sans-serif; font-size: 13px; font-style: normal;
font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px;
text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width:
0px; text-decoration-thickness: initial; text-decoration-style: initial;
text-decoration-color: initial;">Result: the sca
n overhead is eliminated for the common case (~0% vs a no-scan single pass),
and retained only where it's actually needed:</p>
case | before (count-always) | after
-- | -- | --
dense a→ab, 4 MB | +26.9% | +1.5%
sparse a→ab, 4 MB | +86.3% | +1.3%
dense bigexp a→abcd, 4 MB | — | +24.5% (fallback scan)
sparse bigexp a→abcd, 4 MB | — | +86.0% (fallback scan)
<p style="white-space: pre-wrap; margin-top: 0.1em; margin-bottom: 0.2em;
color: rgb(97, 97, 97); font-family: -apple-system, "system-ui",
"Segoe UI", Roboto, sans-serif; font-size: 13px; font-style: normal;
font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px;
text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width:
0px; text-decoration-thickness: initial; text-decoration-style: initial;
text-decoration-color: initial;">Known limitation: a <strong>dense</strong>
big-expansion (e.g. <code style="font-family: monospace; color: rgb(163, 21,
21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius:
3px; word-break: break-word; font-size: 0.9em;">a→abcd</code> where every
position matches) still scans even though the upper bound would have been exact
— density can't be known without scanning, so the gate is conservative (≤2×
over-allocation). The gate factor can be widened if dense big-expansions are
a hot path.</p>
case (input size / match density) | Before: count-always | After:
upper-bound + fallback
-- | -- | --
small/dense a→ab | +26.8% | −0.5%
small/sparse a→ab | +58.4% | +0.4%
medium/dense a→ab | +26.1% | −1.6%
medium/sparse a→ab | +95.4% | −0.0%
large/dense a→ab | +26.9% | +1.5%
large/sparse a→ab | +86.3% | +1.3%
large/dense shrink ab→a | +0.4% | +0.7%
large/dense bigexp a→abcd | (n/a) | +24.5%
large/sparse bigexp a→abcd | (n/a) | +86.0%
<br class="Apple-interchange-newline">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]