gruuya commented on code in PR #8700:
URL: https://github.com/apache/arrow-rs/pull/8700#discussion_r2693688957
##########
arrow-cast/src/parse.rs:
##########
@@ -2752,6 +2740,35 @@ mod tests {
let result = parse_decimal::<Decimal256Type>(s, 76, scale);
assert_eq!(i, result.unwrap());
}
+
+ let zero_scale_tests = [
+ ("0.123", 0, 3),
+ ("1.0", 1, 3),
+ ("1.2", 1, 3),
+ ("1.00", 1, 3),
+ ("1.23", 1, 3),
+ ("1.000", 1, 3),
+ ("1.123", 1, 3),
+ ("123.0", 123, 3),
+ ("123.4", 123, 3),
+ ("123.00", 123, 3),
+ ("123.45", 123, 3),
+ ("123.000000000000000000004", 123, 3),
+ ("0.123e2", 12, 3),
+ ("0.123e4", 1230, 10),
+ ("1.23e4", 12300, 10),
+ ("12.3e4", 123000, 10),
+ ("123e4", 1230000, 10),
+ (
+ "20000000000000000000000000000000000002.0",
+ 20000000000000000000000000000000000002,
+ 38,
+ ),
+ ];
+ for (s, i, precision) in zero_scale_tests {
+ let result_128 = parse_decimal::<Decimal128Type>(s, precision,
0).unwrap();
+ assert_eq!(i, result_128);
+ }
Review Comment:
Thanks, these cases turned out to be very valuable.
For one thing `"."` case wasn't being correctly handled for zero scale
decimals, I fixed that now by factoring out various validation points
throughout the code into a common unsupported check at the start
```rust
if matches!(s, "" | "-" | "+" | ".") {
return Err(ArrowError::ParseError(format!(
"can't parse the string value {s} to decimal"
)));
}
```
I found this is a lot easier to grok, since previously these validations
were scattered throughout `parse_decimal` and deduced implicitly based on
side-effects (e.g. we finished parsing a decimal with a decimal point but the
digit count is zero). Let me know if you have performance concerns about this,
and I can resort to the old (implicit) manner on deducing unsupported strings.
For another thing, the test cases `"1e"`, `"1e+"` and `"1e-"` are curious.
Strictly speaking they aren't valid, but I wouldn't be surprised if some system
can parse/generate these. The question is what should `parse_decimal` output
for them. I tested these on both main and this branch:
```
| input | output(main)
| output(this branch) |
|------------------------------------------------|-------------------------------------------------------------------|---------------------|
| parse_decimal::<Decimal128Type>("1e", 5, 0) | Err(ParseError("can't
parse the string value 1e to decimal")) | Ok(1) |
| parse_decimal::<Decimal128Type>("1e+", 5, 0) | Ok(1)
| Ok(1) |
| parse_decimal::<Decimal128Type>("1e-", 5, 0) | Ok(1)
| Ok(1) |
| parse_decimal::<Decimal128Type>("1e", 5, 2) | Err(ParseError("can't
parse the string value 1e to decimal")) | Ok(100) |
| parse_decimal::<Decimal128Type>("1e+", 5, 2) | Ok(100)
| Ok(100) |
| parse_decimal::<Decimal128Type>("1e-", 5, 2) | Ok(100)
| Ok(100) |
```
To make these consistent I think we should error out on all of them (that's
what Postgres does), at least until there's a worthwhile precedent to parse
some of these.
So _if_ the `matches!` approach from above is fine with you, and _if_ you
agree we should make these unsupported, then I can just append these patterns
there. Otherwise, let me know which ones would you think it'd make sense to
parse.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]