lriggs opened a new pull request, #50112:
URL: https://github.com/apache/arrow/pull/50112
### Rationale for this change
Gandiva's runtime errors are surfaced to SQL users via
`ExecutionContext::set_error_msg`. An audit of all ~90 call sites in
`cpp/src/gandiva/` turned up two recurring problems:
1. **No SQL function name in the message.** Users see `"divide by zero
error"` or `"Output buffer length can't be negative"` with no indication of
which SQL function produced it, making errors hard to localize in long queries.
2. **The offending value is not echoed.** Many messages reject a value
(invalid weekday, bad boolean string, out-of-range index, …) without telling
the user what was supplied.
A good runtime error should answer four questions: *which function*, *what
went wrong*, *what value triggered it*, *what's the valid range*. This PR moves
the highest-impact messages toward that bar.
### What changes are included in this PR?
All edits follow the same pattern:
```
<SQL_FUNCTION_NAME>: <what failed>; <offending value>; <valid range or hint>
```
Where the old message contained a load-bearing substring (e.g. `divide by
zero error`, `Output buffer length can't be negative`), the substring is
preserved so existing substring-based tests still match.
#### REGEXP_EXTRACT (the original case study)
[cpp/src/gandiva/regex_functions_holder.cc:239-247](cpp/src/gandiva/regex_functions_holder.cc#L239-L247)
— the message `"Index to extract out of range"` now reads `"REGEXP_EXTRACT:
invalid group_index '<N>'; must be between 0 and <max> (the number of capture
groups in the pattern)"`, matching the level of detail of the Java path.
#### Divide-by-zero family
Eight call sites in `arithmetic_ops.cc`, `decimal_ops.cc`, and
`extended_math_ops.cc` (covering `DIVIDE`, `DIV`, `MOD`, `PMOD`, decimal
`Divide`/`Mod`, `LOG`) now prefix the SQL function name while preserving the
original `divide by zero error` substring. `mod_float64_float64` additionally
echoes the dividend.
#### Value-echoing
The following sites now include both the SQL function name and the offending
value:
- `CAST_BIT` — echoes the invalid input string, lists expected values
- `CAST_VARCHAR` / `CAST_VARBINARY` length checks — echoes requested length
- `REPEAT` — echoes count (and on overflow: count × input length)
- `CONVERT_REPLACE_INVALID_FROM_UTF8` — echoes byte count
- `LOCATE` — echoes start position
- `FACTORIAL` — echoes input value (both negative and >20 paths)
- `NEGATIVE` (integer overflow on `INT_MIN`, and `negative_daytimeinterval`
out-of-bounds)
- `CRC32` — echoes input length
- `BASE64` / `UNBASE64` — echoes input length
- `AES_ENCRYPT` / `AES_DECRYPT` — echoes data length on negative input;
`AES_ENCRYPT` OOM message also rewritten to name the function clearly
- `NEXT_DAY` — echoes the unrecognized weekday string, lists expected values
- `CAST_INTERVAL_YEAR` — echoes the overflowing source value
- `CAST_*_FROM_HEX` (3 error paths in the macro) — echoes the input hex
string
- `REPLACE` — buffer-overflow message prefixed with `REPLACE:`
### Are these changes tested?
- Strict `EXPECT_EQ(error, "exact string")` assertions converted to
`HasSubstr` / `.find()` so the test contract is the SQL function name plus a
stable phrase, not the exact wording.
- Updated assertions where the new message no longer contains the old
free-text fragment (e.g. `"Factorial of negative"` → `HasSubstr("FACTORIAL") +
HasSubstr("non-negative")`).
- Files touched: `arithmetic_ops_test.cc`, `decimal_ops_test.cc`,
`extended_math_ops_test.cc`, `time_test.cc`, `gdv_function_stubs_test.cc`.
```bash
cd cpp/debug
cmake --build . --target gandiva_shared gandiva-precompiled-test
gandiva-internals-test gandiva-projector-test -j4
./debug/gandiva-precompiled-test # 130/130 pass
./debug/gandiva-internals-test # 156/156 pass
./debug/gandiva-projector-test # 218/218 pass
```
All affected error paths are exercised by the existing unit tests — that is
how each site was located in the audit.
### Are there any user-facing changes?
Yes, improved error messages.
#### Example before / after
```
-- before
SELECT REGEXP_EXTRACT('100-500', '(\d+)-(\d+)', -1);
-- FUNCTION ERROR: Index to extract out of range
-- after
-- FUNCTION ERROR: REGEXP_EXTRACT: invalid group_index '-1';
-- must be between 0 and 2
-- (the number of capture groups in the pattern)
```
```
-- before
SELECT CAST('maybe' AS BOOLEAN);
-- FUNCTION ERROR: Invalid value for boolean.
-- after
-- FUNCTION ERROR: CAST_BIT: Invalid value for boolean: 'maybe'
-- (expected 0, 1, true, false; case-insensitive)
```
```
-- before
SELECT NEXT_DAY(TIMESTAMP '2025-01-01 00:00:00', 'frusday');
-- FUNCTION ERROR: The weekday in this entry is invalid
-- after
-- FUNCTION ERROR: NEXT_DAY: 'frusday' is not a recognized weekday
-- (expected MON|TUE|WED|THU|FRI|SAT|SUN)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]