dmitry-chirkov-dremio commented on issue #49817:
URL: https://github.com/apache/arrow/issues/49817#issuecomment-4284450598
Burned few tokens so you don't have to. Feel free to ignore and apply your
own reasoning.
---
Two layered fixes. They are independent but complementary:
- **Fix 1** gives `Decimal{128,256}::FromString` a clear "lossless-or-bust"
contract, so callers can trust the return value. Small blast radius, strong
invariant.
- **Fix 2** makes Gandiva's `castDECIMAL_utf8` the only Arrow caller that
intentionally handles lossy SQL-style casts by rounding *before* parsing.
Preserves the user-visible behavior of legal SQL like `CAST('1.555…' AS
DECIMAL(38,37))` after fix 1 tightens the parser.
Without fix 2, fix 1 would regress queries that used to return wrong results
but now return an error. Without fix 1, fix 2 only closes the Gandiva code
path; other Arrow callers (compute kernels, CSV converter, Python bindings that
land on `FromString`, …) would still silently corrupt on out-of-range input.
---
## Fix 1 — `Decimal{128,256}::FromString` returns `Status::Invalid` when
input exceeds type range
**File**: `cpp/src/arrow/util/decimal.cc`
**Function**: `DecimalFromString<Decimal>` (and its `Simple*` sibling for
Decimal32/64)
### Change
After the existing `parsed_precision` / `parsed_scale` derivation, validate
against the target decimal type's compile-time limits before writing the value:
```cpp
if (parsed_precision > Decimal::kMaxPrecision) {
return Status::Invalid(
"The string '", s, "' cannot be represented as ", type_name,
": parsed precision ", parsed_precision,
" exceeds maximum precision ", Decimal::kMaxPrecision);
}
if (parsed_scale > Decimal::kMaxScale) {
return Status::Invalid(
"The string '", s, "' cannot be represented as ", type_name,
": parsed scale ", parsed_scale,
" exceeds maximum scale ", Decimal::kMaxScale);
}
```
Place the checks **before** the `ShiftAndAdd` loop so no bits are written to
`out` in the error path.
Apply the same block to `SimpleDecimalFromString<T>` so Decimal32 (precision
≤ 9) and Decimal64 (precision ≤ 18) get the same guarantee.
### Contract after fix
- If `FromString` returns `Status::OK`, the written `Decimal` value is an
exact, lossless representation of the input string.
- Any lossy conversion (rounding, truncation, scale adjustment) is the
caller's responsibility — the parser never makes silent approximations.
### Test additions
Target file: `cpp/src/arrow/util/decimal_test.cc`
```cpp
TEST(Decimal128Test, FromStringRejectsExcessPrecision) {
Decimal128 out;
int32_t precision = 0, scale = 0;
// 38 significant digits — boundary, must succeed.
ASSERT_OK(Decimal128::FromString(
"1.5555555555555555555555555555555555555",
&out, &precision, &scale));
EXPECT_EQ(precision, 38);
EXPECT_EQ(scale, 37);
// 39 significant digits — first disallowed case.
ASSERT_RAISES(Invalid, Decimal128::FromString(
"1.55555555555555555555555555555555555555",
&out, &precision, &scale));
// 51 significant digits — the SQL repro from the issue.
ASSERT_RAISES(Invalid, Decimal128::FromString(
"1.55555555555555555555555555555555555555555555555555",
&out, &precision, &scale));
// Negative sign must not change precision accounting.
ASSERT_RAISES(Invalid, Decimal128::FromString(
"-1.55555555555555555555555555555555555555",
&out, &precision, &scale));
// Leading zeros are not significant; 38 significant digits must pass.
ASSERT_OK(Decimal128::FromString(
"0000001.5555555555555555555555555555555555555",
&out, &precision, &scale));
EXPECT_EQ(precision, 38);
// Integer-only overflow.
ASSERT_RAISES(Invalid, Decimal128::FromString(
"999999999999999999999999999999999999999", // 39 nines
&out, &precision, &scale));
}
TEST(Decimal128Test, FromStringRejectsExcessScale) {
Decimal128 out;
int32_t precision = 0, scale = 0;
// Scientific notation pushing scale past kMaxScale = 38.
ASSERT_RAISES(Invalid, Decimal128::FromString(
"1e-100", &out, &precision, &scale));
// Scale exactly kMaxScale — boundary, must succeed.
ASSERT_OK(Decimal128::FromString(
"1e-38", &out, &precision, &scale));
EXPECT_EQ(scale, 38);
}
TEST(Decimal256Test, FromStringRejectsExcessPrecision) {
Decimal256 out;
int32_t precision = 0, scale = 0;
// 76 significant digits — Decimal256 boundary, must succeed.
std::string sig76(75, '5');
ASSERT_OK(Decimal256::FromString("1." + sig76, &out, &precision, &scale));
EXPECT_EQ(precision, 76);
// 77 significant digits — first disallowed case.
std::string sig77(76, '5');
ASSERT_RAISES(Invalid, Decimal256::FromString(
"1." + sig77, &out, &precision, &scale));
}
TEST(Decimal32Test, FromStringRejectsExcessPrecision) {
Decimal32 out;
int32_t precision = 0, scale = 0;
// 9 significant digits — boundary, must succeed.
ASSERT_OK(Decimal32::FromString("123456789", &out, &precision, &scale));
// 10 significant digits — first disallowed case.
ASSERT_RAISES(Invalid, Decimal32::FromString(
"1234567890", &out, &precision, &scale));
}
TEST(Decimal64Test, FromStringRejectsExcessPrecision) {
Decimal64 out;
int32_t precision = 0, scale = 0;
// 18 significant digits — boundary, must succeed.
ASSERT_OK(Decimal64::FromString(
"123456789012345678", &out, &precision, &scale));
// 19 significant digits — first disallowed case.
ASSERT_RAISES(Invalid, Decimal64::FromString(
"1234567890123456789", &out, &precision, &scale));
}
```
### Risk assessment
Callers that today trust `FromString` to succeed on any syntactically valid
decimal string will now see `Status::Invalid` on out-of-range inputs. Audit
needed:
- `cpp/src/gandiva/gdv_function_stubs.cc::gdv_fn_dec_from_string` — already
propagates status correctly. (Fix #2 below makes this a non-issue in practice.)
- `cpp/src/arrow/csv/converter.cc` — already returns the status to the row
reader; a CSV row with 50-digit decimals will now surface an error instead of
garbage. Desired behavior.
- `cpp/src/arrow/python/decimal.cc` — does not call `FromString` (uses
`DecimalFromPyDecimal`); unaffected.
Existing tests that construct Decimals from strings like
`"9999999999999999999999999999999999999999"` (intentionally out-of-range) will
flip from pass to error. Grep `cpp/src/arrow/util/decimal_test.cc` and fix
expectations as part of this patch.
---
## Fix 2 — Gandiva `castDECIMAL_utf8` rounds to target scale before parsing
**File**: `cpp/src/gandiva/precompiled/decimal_wrapper.cc`
**Function**: `castDECIMAL_utf8` (around line 398)
### Problem solved
After fix 1, `CAST('1.555…(50 fives)' AS DECIMAL(38,37))` would raise an
error instead of returning the correctly rounded value
`1.5555555555555555555555555555555555556`. That's a behavior regression for
legal SQL input. `castDECIMAL_utf8` is the one place where the target
`(out_precision, out_scale)` is known before parsing, so it's the natural home
for caller-side rounding.
This fix also closes a separate, pre-existing bug: even when the input
*does* fit in 128 bits (e.g. 38 significant digits), the current code uses
`decimalops::Convert` to rescale, which **truncates** instead of rounding
half-up. Pre-rounding in the string domain makes Gandiva's VARCHAR→DECIMAL
match the Java `CastVarCharDecimal` path exactly (Java uses
`RoundingMode.HALF_UP` in `java.math.BigDecimal::setScale`).
### Change
Introduce a helper that normalizes the digit string to fit the target scale,
with HALF_UP rounding, before handing off to `Arrow::Decimal128::FromString`:
```cpp
// Round a decimal string to `target_scale` fractional digits using HALF_UP.
// Returns a std::string because rounding can extend the integer part by one
// digit when the rounding carry propagates.
// Operates on the raw digit representation — no 128-bit arithmetic needed.
std::string RoundDecimalStringHalfUp(const char* in, int32_t in_length,
int32_t target_scale);
```
Then in `castDECIMAL_utf8`:
```cpp
void castDECIMAL_utf8(int64_t context, const char* in, int32_t in_length,
int32_t out_precision, int32_t out_scale, int64_t*
out_high,
uint64_t* out_low) {
std::string rounded = RoundDecimalStringHalfUp(in, in_length, out_scale);
int64_t dec_high_from_str;
uint64_t dec_low_from_str;
int32_t precision_from_str;
int32_t scale_from_str;
int32_t status = gdv_fn_dec_from_string(
context, rounded.data(), static_cast<int32_t>(rounded.size()),
&precision_from_str, &scale_from_str,
&dec_high_from_str, &dec_low_from_str);
if (status != 0) {
*out_high = 0;
*out_low = 0;
return;
}
gandiva::BasicDecimalScalar128 x({dec_high_from_str, dec_low_from_str},
precision_from_str, scale_from_str);
bool overflow = false;
auto out = gandiva::decimalops::Convert(x, out_precision, out_scale,
&overflow);
*out_high = out.high_bits();
*out_low = out.low_bits();
}
```
### Helper contract (for review)
After `RoundDecimalStringHalfUp(in, target_scale)`:
- The returned string has **at most `target_scale` fractional digits**.
- The returned string is a valid decimal literal accepted by
`Arrow::Decimal128::FromString`.
- Rounding direction is HALF_UP: at the `target_scale + 1` digit, if digit ≥
5 the truncation point is rounded up; ties go away from zero.
- Leading sign (`+` / `-`) and exponent notation (`e±N`) are preserved /
normalized.
- The integer part may grow by one digit due to carry (e.g. `"9.99"` rounded
to scale 1 → `"10.0"`), but never by more than one. The caller (`Convert`)
handles cases where the resulting precision exceeds `out_precision`
### Test additions
Target file: `cpp/src/gandiva/tests/decimal_test.cc`
Unit-level tests for `RoundDecimalStringHalfUp`:
```cpp
TEST(DecimalStringRounding, HalfUpBasic) {
EXPECT_EQ(RoundDecimalStringHalfUp("1.5555", 4, 3), "1.556");
EXPECT_EQ(RoundDecimalStringHalfUp("1.5554", 4, 3), "1.555");
EXPECT_EQ(RoundDecimalStringHalfUp("1.5555", 4, 0), "2");
EXPECT_EQ(RoundDecimalStringHalfUp("9.99", 3, 1), "10.0"); // carry
extends integer part
EXPECT_EQ(RoundDecimalStringHalfUp("-1.5555", 4, 3), "-1.556");
}
TEST(DecimalStringRounding, HalfUpNoChangeWhenAlreadyShort) {
EXPECT_EQ(RoundDecimalStringHalfUp("1.5", 1, 5), "1.5");
EXPECT_EQ(RoundDecimalStringHalfUp("42", 0, 3), "42");
}
TEST(DecimalStringRounding, HalfUpExponent) {
EXPECT_EQ(RoundDecimalStringHalfUp("1.5555e2", 2, 1),
"155.6");
EXPECT_EQ(RoundDecimalStringHalfUp("1.5e-3", 1, 4),
"0.0015");
}
```
End-to-end tests for `castDECIMAL_utf8` (all expected results match what
`java.math.BigDecimal(str).setScale(scale, HALF_UP)` would produce):
```cpp
TEST(CastDecimalUtf8, FiftyDigitInputToDecimal_38_37) {
auto result = CallCastDecimalUtf8(
"1.55555555555555555555555555555555555555555555555555",
/*out_precision=*/38, /*out_scale=*/37);
EXPECT_EQ(result.ToString(37),
"1.5555555555555555555555555555555555556");
}
TEST(CastDecimalUtf8, HundredDigitInputToDecimal_38_37) {
std::string hundred_ones(100, '1');
auto result = CallCastDecimalUtf8(
"1." + hundred_ones,
/*out_precision=*/38, /*out_scale=*/37);
EXPECT_EQ(result.ToString(37),
"1.1111111111111111111111111111111111111");
}
TEST(CastDecimalUtf8, ThirtyEightDigitInputUsesHalfUpNotTruncate) {
// Regression: pre-fix Gandiva truncated to 1.5555…5 (37 fives).
auto result = CallCastDecimalUtf8(
"1.55555555555555555555555555555555555555", // 38 significant digits
/*out_precision=*/38, /*out_scale=*/37);
EXPECT_EQ(result.ToString(37),
"1.5555555555555555555555555555555555556");
}
TEST(CastDecimalUtf8, NegativeInputHalfUp) {
auto result = CallCastDecimalUtf8(
"-1.55555555555555555555555555555555555555555555555555",
/*out_precision=*/38, /*out_scale=*/37);
EXPECT_EQ(result.ToString(37),
"-1.5555555555555555555555555555555555556");
}
TEST(CastDecimalUtf8, ExactRepresentationBelowTargetScale) {
auto result = CallCastDecimalUtf8(
"1.5", /*out_precision=*/38, /*out_scale=*/37);
EXPECT_EQ(result.ToString(37),
"1.5000000000000000000000000000000000000");
}
TEST(CastDecimalUtf8, CarryIntoIntegerPart) {
// "9.99" rounded to scale 1 → "10.0".
auto result = CallCastDecimalUtf8(
"9.99", /*out_precision=*/38, /*out_scale=*/1);
EXPECT_EQ(result.ToString(1), "10.0");
}
TEST(CastDecimalUtf8, RoundedValueStillExceedsOutputPrecision) {
// After rounding, value has 11 significant digits; output only holds 3
before
// the decimal point. Match pre-existing Gandiva overflow behavior (zero).
auto result = CallCastDecimalUtf8(
"99999999999.5", /*out_precision=*/10, /*out_scale=*/0);
// Whatever the current zero-on-overflow convention is, assert it
explicitly
// so the test locks the contract.
EXPECT_EQ(result.ToString(0), "0");
}
```
### Risk assessment
`castDECIMAL_utf8` is an LLVM-IR-compiled function. Adding a C++ helper
means the bitcode module will pull in `std::string` and digit-walking code —
modest size increase but well within what Gandiva already includes (Arrow
utilities are linked). No ABI change; the function signature is unchanged.
Performance impact: one extra pass over the input string plus one small
allocation. For VARCHAR→DECIMAL casts this is dwarfed by the cost of
`Decimal128::FromString` itself, which already walks the string twice
(component parsing + `ShiftAndAdd`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]