dmitry-chirkov-dremio commented on issue #49817:
URL: https://github.com/apache/arrow/issues/49817#issuecomment-4284450598

   Burned few tokens so you don't have to. Feel free to ignore and apply your 
own reasoning.
   ---
   
   Two layered fixes. They are independent but complementary:
   
   - **Fix 1** gives `Decimal{128,256}::FromString` a clear "lossless-or-bust" 
contract, so callers can trust the return value. Small blast radius, strong 
invariant.
   - **Fix 2** makes Gandiva's `castDECIMAL_utf8` the only Arrow caller that 
intentionally handles lossy SQL-style casts by rounding *before* parsing. 
Preserves the user-visible behavior of legal SQL like `CAST('1.555…' AS 
DECIMAL(38,37))` after fix 1 tightens the parser.
   
   Without fix 2, fix 1 would regress queries that used to return wrong results 
but now return an error. Without fix 1, fix 2 only closes the Gandiva code 
path; other Arrow callers (compute kernels, CSV converter, Python bindings that 
land on `FromString`, …) would still silently corrupt on out-of-range input.
   
   ---
   
   ## Fix 1 — `Decimal{128,256}::FromString` returns `Status::Invalid` when 
input exceeds type range
   
   **File**: `cpp/src/arrow/util/decimal.cc`
   **Function**: `DecimalFromString<Decimal>` (and its `Simple*` sibling for 
Decimal32/64)
   
   ### Change
   
   After the existing `parsed_precision` / `parsed_scale` derivation, validate 
against the target decimal type's compile-time limits before writing the value:
   
   ```cpp
   if (parsed_precision > Decimal::kMaxPrecision) {
     return Status::Invalid(
         "The string '", s, "' cannot be represented as ", type_name,
         ": parsed precision ", parsed_precision,
         " exceeds maximum precision ", Decimal::kMaxPrecision);
   }
   if (parsed_scale > Decimal::kMaxScale) {
     return Status::Invalid(
         "The string '", s, "' cannot be represented as ", type_name,
         ": parsed scale ", parsed_scale,
         " exceeds maximum scale ", Decimal::kMaxScale);
   }
   ```
   
   Place the checks **before** the `ShiftAndAdd` loop so no bits are written to 
`out` in the error path.
   
   Apply the same block to `SimpleDecimalFromString<T>` so Decimal32 (precision 
≤ 9) and Decimal64 (precision ≤ 18) get the same guarantee.
   
   ### Contract after fix
   
   - If `FromString` returns `Status::OK`, the written `Decimal` value is an 
exact, lossless representation of the input string.
   - Any lossy conversion (rounding, truncation, scale adjustment) is the 
caller's responsibility — the parser never makes silent approximations.
   
   ### Test additions
   
   Target file: `cpp/src/arrow/util/decimal_test.cc`
   
   ```cpp
   TEST(Decimal128Test, FromStringRejectsExcessPrecision) {
     Decimal128 out;
     int32_t precision = 0, scale = 0;
   
     // 38 significant digits — boundary, must succeed.
     ASSERT_OK(Decimal128::FromString(
         "1.5555555555555555555555555555555555555",
         &out, &precision, &scale));
     EXPECT_EQ(precision, 38);
     EXPECT_EQ(scale, 37);
   
     // 39 significant digits — first disallowed case.
     ASSERT_RAISES(Invalid, Decimal128::FromString(
         "1.55555555555555555555555555555555555555",
         &out, &precision, &scale));
   
     // 51 significant digits — the SQL repro from the issue.
     ASSERT_RAISES(Invalid, Decimal128::FromString(
         "1.55555555555555555555555555555555555555555555555555",
         &out, &precision, &scale));
   
     // Negative sign must not change precision accounting.
     ASSERT_RAISES(Invalid, Decimal128::FromString(
         "-1.55555555555555555555555555555555555555",
         &out, &precision, &scale));
   
     // Leading zeros are not significant; 38 significant digits must pass.
     ASSERT_OK(Decimal128::FromString(
         "0000001.5555555555555555555555555555555555555",
         &out, &precision, &scale));
     EXPECT_EQ(precision, 38);
   
     // Integer-only overflow.
     ASSERT_RAISES(Invalid, Decimal128::FromString(
         "999999999999999999999999999999999999999",  // 39 nines
         &out, &precision, &scale));
   }
   
   TEST(Decimal128Test, FromStringRejectsExcessScale) {
     Decimal128 out;
     int32_t precision = 0, scale = 0;
   
     // Scientific notation pushing scale past kMaxScale = 38.
     ASSERT_RAISES(Invalid, Decimal128::FromString(
         "1e-100", &out, &precision, &scale));
   
     // Scale exactly kMaxScale — boundary, must succeed.
     ASSERT_OK(Decimal128::FromString(
         "1e-38", &out, &precision, &scale));
     EXPECT_EQ(scale, 38);
   }
   
   TEST(Decimal256Test, FromStringRejectsExcessPrecision) {
     Decimal256 out;
     int32_t precision = 0, scale = 0;
   
     // 76 significant digits — Decimal256 boundary, must succeed.
     std::string sig76(75, '5');
     ASSERT_OK(Decimal256::FromString("1." + sig76, &out, &precision, &scale));
     EXPECT_EQ(precision, 76);
   
     // 77 significant digits — first disallowed case.
     std::string sig77(76, '5');
     ASSERT_RAISES(Invalid, Decimal256::FromString(
         "1." + sig77, &out, &precision, &scale));
   }
   
   TEST(Decimal32Test, FromStringRejectsExcessPrecision) {
     Decimal32 out;
     int32_t precision = 0, scale = 0;
   
     // 9 significant digits — boundary, must succeed.
     ASSERT_OK(Decimal32::FromString("123456789", &out, &precision, &scale));
   
     // 10 significant digits — first disallowed case.
     ASSERT_RAISES(Invalid, Decimal32::FromString(
         "1234567890", &out, &precision, &scale));
   }
   
   TEST(Decimal64Test, FromStringRejectsExcessPrecision) {
     Decimal64 out;
     int32_t precision = 0, scale = 0;
   
     // 18 significant digits — boundary, must succeed.
     ASSERT_OK(Decimal64::FromString(
         "123456789012345678", &out, &precision, &scale));
   
     // 19 significant digits — first disallowed case.
     ASSERT_RAISES(Invalid, Decimal64::FromString(
         "1234567890123456789", &out, &precision, &scale));
   }
   ```
   
   ### Risk assessment
   
   Callers that today trust `FromString` to succeed on any syntactically valid 
decimal string will now see `Status::Invalid` on out-of-range inputs. Audit 
needed:
   
   - `cpp/src/gandiva/gdv_function_stubs.cc::gdv_fn_dec_from_string` — already 
propagates status correctly. (Fix #2 below makes this a non-issue in practice.)
   - `cpp/src/arrow/csv/converter.cc` — already returns the status to the row 
reader; a CSV row with 50-digit decimals will now surface an error instead of 
garbage. Desired behavior.
   - `cpp/src/arrow/python/decimal.cc` — does not call `FromString` (uses 
`DecimalFromPyDecimal`); unaffected.
   
   Existing tests that construct Decimals from strings like 
`"9999999999999999999999999999999999999999"` (intentionally out-of-range) will 
flip from pass to error. Grep `cpp/src/arrow/util/decimal_test.cc` and fix 
expectations as part of this patch.
   
   ---
   
   ## Fix 2 — Gandiva `castDECIMAL_utf8` rounds to target scale before parsing
   
   **File**: `cpp/src/gandiva/precompiled/decimal_wrapper.cc`
   **Function**: `castDECIMAL_utf8` (around line 398)
   
   ### Problem solved
   
   After fix 1, `CAST('1.555…(50 fives)' AS DECIMAL(38,37))` would raise an 
error instead of returning the correctly rounded value 
`1.5555555555555555555555555555555555556`. That's a behavior regression for 
legal SQL input. `castDECIMAL_utf8` is the one place where the target 
`(out_precision, out_scale)` is known before parsing, so it's the natural home 
for caller-side rounding.
   
   This fix also closes a separate, pre-existing bug: even when the input 
*does* fit in 128 bits (e.g. 38 significant digits), the current code uses 
`decimalops::Convert` to rescale, which **truncates** instead of rounding 
half-up. Pre-rounding in the string domain makes Gandiva's VARCHAR→DECIMAL 
match the Java `CastVarCharDecimal` path exactly (Java uses 
`RoundingMode.HALF_UP` in `java.math.BigDecimal::setScale`).
   
   ### Change
   
   Introduce a helper that normalizes the digit string to fit the target scale, 
with HALF_UP rounding, before handing off to `Arrow::Decimal128::FromString`:
   
   ```cpp
   // Round a decimal string to `target_scale` fractional digits using HALF_UP.
   // Returns a std::string because rounding can extend the integer part by one
   // digit when the rounding carry propagates.
   // Operates on the raw digit representation — no 128-bit arithmetic needed.
   std::string RoundDecimalStringHalfUp(const char* in, int32_t in_length,
                                        int32_t target_scale);
   ```
   
   Then in `castDECIMAL_utf8`:
   
   ```cpp
   void castDECIMAL_utf8(int64_t context, const char* in, int32_t in_length,
                         int32_t out_precision, int32_t out_scale, int64_t* 
out_high,
                         uint64_t* out_low) {
     std::string rounded = RoundDecimalStringHalfUp(in, in_length, out_scale);
   
     int64_t dec_high_from_str;
     uint64_t dec_low_from_str;
     int32_t precision_from_str;
     int32_t scale_from_str;
     int32_t status = gdv_fn_dec_from_string(
         context, rounded.data(), static_cast<int32_t>(rounded.size()),
         &precision_from_str, &scale_from_str,
         &dec_high_from_str, &dec_low_from_str);
     if (status != 0) {
       *out_high = 0;
       *out_low = 0;
       return;
     }
   
     gandiva::BasicDecimalScalar128 x({dec_high_from_str, dec_low_from_str},
                                      precision_from_str, scale_from_str);
     bool overflow = false;
     auto out = gandiva::decimalops::Convert(x, out_precision, out_scale, 
&overflow);
     *out_high = out.high_bits();
     *out_low = out.low_bits();
   }
   ```
   
   ### Helper contract (for review)
   
   After `RoundDecimalStringHalfUp(in, target_scale)`:
   
   - The returned string has **at most `target_scale` fractional digits**.
   - The returned string is a valid decimal literal accepted by 
`Arrow::Decimal128::FromString`.
   - Rounding direction is HALF_UP: at the `target_scale + 1` digit, if digit ≥ 
5 the truncation point is rounded up; ties go away from zero.
   - Leading sign (`+` / `-`) and exponent notation (`e±N`) are preserved / 
normalized.
   - The integer part may grow by one digit due to carry (e.g. `"9.99"` rounded 
to scale 1 → `"10.0"`), but never by more than one. The caller (`Convert`) 
handles cases where the resulting precision exceeds `out_precision`
   
   ### Test additions
   
   Target file: `cpp/src/gandiva/tests/decimal_test.cc`
   
   Unit-level tests for `RoundDecimalStringHalfUp`:
   
   ```cpp
   TEST(DecimalStringRounding, HalfUpBasic) {
     EXPECT_EQ(RoundDecimalStringHalfUp("1.5555", 4, 3), "1.556");
     EXPECT_EQ(RoundDecimalStringHalfUp("1.5554", 4, 3), "1.555");
     EXPECT_EQ(RoundDecimalStringHalfUp("1.5555", 4, 0), "2");
     EXPECT_EQ(RoundDecimalStringHalfUp("9.99", 3, 1), "10.0");  // carry 
extends integer part
     EXPECT_EQ(RoundDecimalStringHalfUp("-1.5555", 4, 3), "-1.556");
   }
   
   TEST(DecimalStringRounding, HalfUpNoChangeWhenAlreadyShort) {
     EXPECT_EQ(RoundDecimalStringHalfUp("1.5", 1, 5), "1.5");
     EXPECT_EQ(RoundDecimalStringHalfUp("42", 0, 3), "42");
   }
   
   TEST(DecimalStringRounding, HalfUpExponent) {
     EXPECT_EQ(RoundDecimalStringHalfUp("1.5555e2", 2, 1),
               "155.6");
     EXPECT_EQ(RoundDecimalStringHalfUp("1.5e-3", 1, 4),
               "0.0015");
   }
   ```
   
   End-to-end tests for `castDECIMAL_utf8` (all expected results match what 
`java.math.BigDecimal(str).setScale(scale, HALF_UP)` would produce):
   
   ```cpp
   TEST(CastDecimalUtf8, FiftyDigitInputToDecimal_38_37) {
     auto result = CallCastDecimalUtf8(
         "1.55555555555555555555555555555555555555555555555555",
         /*out_precision=*/38, /*out_scale=*/37);
     EXPECT_EQ(result.ToString(37),
               "1.5555555555555555555555555555555555556");
   }
   
   TEST(CastDecimalUtf8, HundredDigitInputToDecimal_38_37) {
     std::string hundred_ones(100, '1');
     auto result = CallCastDecimalUtf8(
         "1." + hundred_ones,
         /*out_precision=*/38, /*out_scale=*/37);
     EXPECT_EQ(result.ToString(37),
               "1.1111111111111111111111111111111111111");
   }
   
   TEST(CastDecimalUtf8, ThirtyEightDigitInputUsesHalfUpNotTruncate) {
     // Regression: pre-fix Gandiva truncated to 1.5555…5 (37 fives).
     auto result = CallCastDecimalUtf8(
         "1.55555555555555555555555555555555555555",  // 38 significant digits
         /*out_precision=*/38, /*out_scale=*/37);
     EXPECT_EQ(result.ToString(37),
               "1.5555555555555555555555555555555555556");
   }
   
   TEST(CastDecimalUtf8, NegativeInputHalfUp) {
     auto result = CallCastDecimalUtf8(
         "-1.55555555555555555555555555555555555555555555555555",
         /*out_precision=*/38, /*out_scale=*/37);
     EXPECT_EQ(result.ToString(37),
               "-1.5555555555555555555555555555555555556");
   }
   
   TEST(CastDecimalUtf8, ExactRepresentationBelowTargetScale) {
     auto result = CallCastDecimalUtf8(
         "1.5", /*out_precision=*/38, /*out_scale=*/37);
     EXPECT_EQ(result.ToString(37),
               "1.5000000000000000000000000000000000000");
   }
   
   TEST(CastDecimalUtf8, CarryIntoIntegerPart) {
     // "9.99" rounded to scale 1 → "10.0".
     auto result = CallCastDecimalUtf8(
         "9.99", /*out_precision=*/38, /*out_scale=*/1);
     EXPECT_EQ(result.ToString(1), "10.0");
   }
   
   TEST(CastDecimalUtf8, RoundedValueStillExceedsOutputPrecision) {
     // After rounding, value has 11 significant digits; output only holds 3 
before
     // the decimal point. Match pre-existing Gandiva overflow behavior (zero).
     auto result = CallCastDecimalUtf8(
         "99999999999.5", /*out_precision=*/10, /*out_scale=*/0);
     // Whatever the current zero-on-overflow convention is, assert it 
explicitly
     // so the test locks the contract.
     EXPECT_EQ(result.ToString(0), "0");
   }
   ```
   
   ### Risk assessment
   
   `castDECIMAL_utf8` is an LLVM-IR-compiled function. Adding a C++ helper 
means the bitcode module will pull in `std::string` and digit-walking code — 
modest size increase but well within what Gandiva already includes (Arrow 
utilities are linked). No ABI change; the function signature is unchanged.
   
   Performance impact: one extra pass over the input string plus one small 
allocation. For VARCHAR→DECIMAL casts this is dwarfed by the cost of 
`Decimal128::FromString` itself, which already walks the string twice 
(component parsing + `ShiftAndAdd`).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to