Hi all, I’d like to propose the following changes to the in-memory Decimal128 format and solicit feedback.
1. When converting to and from an array of bytes, the input bytes are *assumed* to be in big-endian order and the output bytes are *guaranteed* to be in big-endian order. Additionally, the number of bytes is not assumed to be 16 anymore, and the length must be passed in to the Decimal128 constructor. 2. The byte width of the underlying FixedSizeBinary builder is the minimum number of bytes necessary to represent a Decimal128 of a given precision. For example, if my decimal type has precision 21, the byte width of my FixedSizeBinary builder will be 9. Both of these changes are motivated by the Parquet file format decimal specification and the impala parquet reader/writer. Pros: 1. Parquet writes from arrow arrays write the minimum number of bytes to disk, saving space and there’s no additional logic required in parquet-cpp. 2. parquet-cpp would be able to easily read from parquet files written from impala and easily write impala compatible parquet files Cons: 1. Additional work is required in the java implementation to implement this since right now decimals are always 16 bytes. 2. Integration tests are broken because decimal byte widths are now not reliably 16 bytes. 3. Possibly worse in-memory performance because of array-element byte widths that are not powers of two. Possible alternatives: - Implement the logic only in parquet-cpp - I originally went down this road, but it was significantly more complicated to do this parquet-cpp than it was to change the arrow decimal format. I’d be willing to push harder on this if changing the arrow decimal format in the proposed way is not a viable option. I have WIP PR up that implements this on the arrow side, but I’m happy to can it (or not) based on this discussion ( https://github.com/apache/arrow/pull/1108). Thanks, Phillip