liukun4515 commented on issue #2637: URL: https://github.com/apache/arrow-rs/issues/2637#issuecomment-1257323227
> I had a play around implementing this and it appears to be very tractable. What I did is as follows: > > * Define `struct i256([u8; 32])` and implement `ArrowNativeType` for it In order to do unified processing, I suggest that use the [u8;16] to present the i128 like the i256. We can implement the operation like comparation, arithmetic operation based on the u8 array. > * Implement `ArrowPrimitiveType` for `Decimal128Type` and `Decimal256Type` with `Native` of `i128` and `i256` respectively Using the `i128([u8;16])` and `i128([u8;32])` as the `Native` is good point, My concern is the `const DATA_TYPE: DataType` for this. The decimal data type is infinite and can't be enumerated. We can't enumerate all decimal type like time stamp with limited time unit. How to determine the `Data type` for this ArrowPrimitiveType? This is key point for me. > * Replace DecimalArray and DecimalBuilder with PrimitiveArray and PrimitiveBuilder > * Add specialized `impl PrimitiveArray<T: DecimalType>` adding `with_precision_and_scale` > * Add specialized `impl PrimitiveBuilder<T: DecimalType>` adding `with_precision_and_scale` > * Add `ArrayData::validate_decimal_values` which validates values are within bounds of decimal precision > * Remove old decimal logic, including specialized take, comparison kernels, etc... > * Add public free function to format signed little endian byte arrays with a given precision and scale > * Implement ArrowNativeOps for `i128` and `i256` > * Casting precision becomes purely a metadata operation, casting scale can use the normal arithmetic operations > > This appears to work quite well, and I'm fairly happy this will achieve the stated aims of massively simplifying the decimal implementation, making it more consistent with other types such as timestamps, and provide an easy path forward to getting full kernel coverage. > > There are, however, a couple of major implications of this change: > > * The values returned from `PrimitiveArray::value` will be the storage type, and **not** a semantic `Decimal` type. This is the same as timestamps, times, durations, etc... and I think is therefore fine, but is worth highlighting > * Precision will **not** be validated by default, instead only via explicit calls to `validate_decimal_values` [Decimal Precision Validation #2387](https://github.com/apache/arrow-rs/issues/2387). This makes the builders, `FromIterator` etc... consistently infallible (currently this is inconsistent) > * Overflow for checked kernels will be at the boundary of the storage type, i.e. `i128` or `i256` and not based on precision, users who wish to catch precision-based overflow will need to use the checked calls and then make an explicit call to validate the result > * Arithmetic, comparison, etc... across arrays with different scale will be non-trivial (although it would be complicated regardless) I think we don't need to consider operation with different scale. We should add checker for this with the assumptions. If the `native` for decimal has its data type(precision, scale), is this can be resolved? > > If people are happy with this path forward, I will polish up this approach over the coming week, but I'm loathe to work on this further if the consensus isn't there. Even better would be if someone is willing to help out with this effort 😄? > > Thoughts @alamb @liukun4515 @HaoYang670 @kmitchener @viirya? I personally still want to implement decimal data type alone and don't combine it with all primitive data type. But this thought is also what I have thought before, I basically agree with your proposal with some comments and questions. @tustvold -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
