yshcz opened a new issue, #1981: URL: https://github.com/apache/iceberg-rust/issues/1981
### Apache Iceberg Rust version None ### Describe the bug A value like `-1` has multiple valid two's complement representations: * 1 byte: 0xFF * 2 bytes: 0xFF 0xFF * 3 bytes: 0xFF 0xFF * and so on and the spec requires using the minimum representation for hashing: > Decimal values are hashed using the minimum number of bytes required to hold the unscaled value as a two's complement big-endian The current implementation converts i128 to bytes using https://doc.rust-lang.org/std/primitive.i128.html#method.to_be_bytes, which always returns a fixed [u8; 16] array. It then tries to find the minimum representation by skipping leading 0x00 bytes: ```rust let bytes = v.to_be_bytes(); // always 16 bytes if let Some(start) = bytes.iter().position(|&x| x != 0) { Self::hash_bytes(&bytes[start..]) } ... ``` For -1, to_be_bytes() returns [0xFF; 16]. Since no byte is 0x00, all 16 bytes are hashed instead of just [0xFF]. This causes hash mismatches with Java and Python implementations. For example, rows written by Spark with bucket[N] partitioning on a decimal column will be assigned to different buckets by iceberg-rust, causing incorrect partition pruning or writes to wrong files. A possible fix is to use num_bigint: ```rust use num_bigint::BigInt; fn hash_decimal(v: i128) -> i32 { let bytes = BigInt::from(v).to_signed_bytes_be(); Self::hash_bytes(&bytes) } ``` But this allocates a Vec\<u8\> on every call. Would be better to compute the minimum representation directly from the [u8; 16] ### To Reproduce PyIceberg: ```Python from decimal import Decimal from pyiceberg.utils.decimal import decimal_to_bytes import mmh3 print(mmh3.hash(decimal_to_bytes(Decimal(1)))) # -463810133 print(mmh3.hash(decimal_to_bytes(Decimal(-1)))) # -43192051 ``` iceberg-rust: ```Rust #[test] fn test_hash_decimal_with_negative_value() { assert_eq!(Bucket::hash_decimal(1), -463810133); // passes assert_eq!(Bucket::hash_decimal(-1), -43192051); // fails } ``` ### Expected behavior _No response_ ### Willingness to contribute None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
