lriggs commented on code in PR #50187:
URL: https://github.com/apache/arrow/pull/50187#discussion_r3438187179
##########
cpp/src/gandiva/precompiled/string_ops.cc:
##########
@@ -1946,9 +1914,33 @@ const char* replace_utf8_utf8_utf8(gdv_int64 context,
const char* text,
gdv_int32 text_len, const char* from_str,
gdv_int32 from_str_len, const char* to_str,
gdv_int32 to_str_len, gdv_int32* out_len) {
+ // Count non-overlapping matches to size the output buffer exactly, so large
+ // results are not capped by an arbitrary limit.
+ gdv_int64 num_matches = 0;
Review Comment:
old vs new is measured without a separate build: the old wrapper was
literally replace_with_max_len_utf8_utf8_utf8(…, 65535, …), and that inner
function is unchanged. So new = replace_utf8_utf8_utf8 (counting scan + exact
alloc) and old = replace_with_max_len_utf8_utf8_utf8 given an adequately-sized
buffer (the pre-change algorithm, no scan). The delta is purely the O(n) scan.
Results
| case | text_len | matches | new ns/call | old
ns/call | delta |
|----|------|-----|-----|----------|-----------|
|small/dense |expand a->ab | 256 | 256 | 3013.5 |
2250.3 | +33.9%|
|small/sparse |expand a->ab | 256 | 4 | 1738.6 |
922.1 | +88.5%|
|medium/dense | expand a->ab | 65536 | 65536 | 753132.4 |
573520.9 | +31.3%|
|medium/sparse |expand a->ab | 65536 | 1024 | 436435.6 |
224018.2 | +94.8%|
|large/dense | expand a->ab | 4194304 | 4194304 | 52924267.0 |
40737654.2 | +29.9%|
|large/sparse expand |a->ab | 4194304 | 65536 | 28408644.3 |
14630748.5 | +94.2%|
|large/dense |shrink ab->a | 4194304 | 0 | 12771361.1 |
12659342.5 | +0.9%|
Interpretation:
The scan's relative cost is consistent across sizes (~30% dense, ~90%
sparse), as expected for an O(n) pass — it scales with input length, not match
count.
Sparse is the worst case (+90%): with few matches, the old replace pass is
mostly cheap byte-copying, so adding a full second comparison pass nearly
doubles the work. Dense (+30%): the replace pass already does
expansion-copying, so the scan is a smaller fraction.
Shrink path: +0.9% — confirms the refinement works; when to ≤ from the scan
is skipped and there's no measurable overhead.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]