liukun4515 commented on issue #2387: URL: https://github.com/apache/arrow-rs/issues/2387#issuecomment-1224004677
> I'm really struggling to review the decimal array work without an answer to this, as the current approach is wildly inconsistent to the point of being effectively meaningless... As it isn't consistently enforced, there is no reliable way to elide validation as you cannot assume validation has actually been performed > > * from_iter - no validation > * with_precision_and_scale - validation if precision less > * DecimalBuilder - disable validation with safe API > > The result is that **it is perfectly possible to construct a DecimalArray with overflowed values using safe APIs,** consequently it is unclear how you can then meaningfully elide validation, or reason about if an operation will overflow, as you can't guarantee the input data wasn't already overflowed... It is also unclear what the meaning of validation is... > > For example, if you use `from_iter` to `collect` an iterator into a `DecimalArray` and then use `with_precision_and_scale` to set the precision and scale, this will elide the validation if the new precision is greater than the default, despite the validation never occurring... > > I can see 3 possible paths forward: > > **Strict Validation** > > * Invariant that no values in a DecimalArray exceed the precision > * Aways validate on construction, and mutation > > This is the most "correct" interpretation, but has significant performance drawbacks, and is much stronger than the guarantees we provide for other arithmetic types. > > **Loose Validation** > > * Mutation operations on DecimalArray saturate at the bounds of the underlying stored type > * Continue to provide APIs to validate no values have overflowed > * Perfectly legal for a DecimalArray to contain overflowed values, and it must not cause UB for this to be the case > > This would allow opt-in overflow detection, but would require the user to perform validation after every arithmetic operation > > **No Validation** > > Continue to provide APIs to validate all values are within bounds, but don't attempt to detect or handle overflow within kernels. This is consistent with how we handle overflow elsewhere and is perfectly well defined. > > > I think that is why the things become complicated. We may need to validate decimal values at some moments > > My 2 cents on this is we only need to validate the invariants that arrow-rs needs to in order to prevent UB. If some other software component has additional invariants, it is on that component to verify them. Provided we clearly document the invariants we uphold, I don't think this is an issue. > > My preference would therefore be to option 3, as it is the simplest and fastest to implement, and is consistent with how we handle overflow elsewhere. In my opinion, the reason of doing validation for the decimal array is that we would like to make sure all the elements in the array are the range for the precision. If we don't care about the if the element in the array is out of bounds, then we don't need to do validation in the kernel. For example, I call the `Cast` method to cast `int32 array` to `int8 array`, do we need to do validation and check the overflow for the casting? If we need to do the check and return the error message to user, we need to do validation in the kernel api. But as far as Decimal data type, If I cast the `Decimal(8,0) array) to `Decimal(12,0) array`, each element can't be overflow or out of the range for the new precision, we don't need to do validation and improve the performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
