yshcz opened a new issue, #1981:
URL: https://github.com/apache/iceberg-rust/issues/1981

   ### Apache Iceberg Rust version
   
   None
   
   ### Describe the bug
   
   A value like `-1` has multiple valid two's complement representations:
   * 1 byte: 0xFF
   * 2 bytes: 0xFF 0xFF
   * 3 bytes: 0xFF 0xFF
   * and so on
   
   and the spec requires using the minimum representation for hashing:
   
   > Decimal values are hashed using the minimum number of bytes required to 
hold the unscaled value as a two's complement big-endian
   
   
   The current implementation converts i128 to bytes using 
https://doc.rust-lang.org/std/primitive.i128.html#method.to_be_bytes, which 
always returns a fixed [u8; 16] array. It then tries to find the minimum 
representation by skipping leading 0x00 bytes:
   
   ```rust
     let bytes = v.to_be_bytes();  // always 16 bytes
     if let Some(start) = bytes.iter().position(|&x| x != 0) {
         Self::hash_bytes(&bytes[start..])
     } ...
   ```
   
   For -1, to_be_bytes() returns [0xFF; 16]. Since no byte is 0x00, all 16 
bytes are hashed instead of just [0xFF].
   
   This causes hash mismatches with Java and Python implementations. For 
example, rows written by Spark with bucket[N] partitioning on a decimal column 
will be assigned to different buckets by iceberg-rust, causing incorrect 
partition pruning or writes to wrong files.
   
   A possible fix is to use num_bigint:
   
   ```rust
   use num_bigint::BigInt;
   
   fn hash_decimal(v: i128) -> i32 {
       let bytes = BigInt::from(v).to_signed_bytes_be();
       Self::hash_bytes(&bytes)
   }
   ```
   
   But this allocates a Vec\<u8\> on every call. Would be better to compute the 
minimum representation directly from the [u8; 16]
   
   ### To Reproduce
   
   PyIceberg:
   ```Python
   from decimal import Decimal
   from pyiceberg.utils.decimal import decimal_to_bytes
   import mmh3
   
   print(mmh3.hash(decimal_to_bytes(Decimal(1))))   # -463810133
   print(mmh3.hash(decimal_to_bytes(Decimal(-1))))  # -43192051
   ```
   
   iceberg-rust:
   ```Rust
   #[test]
   fn test_hash_decimal_with_negative_value() {
       assert_eq!(Bucket::hash_decimal(1), -463810133);  // passes
       assert_eq!(Bucket::hash_decimal(-1), -43192051);  // fails
   }
   ```
   
   
   ### Expected behavior
   
   _No response_
   
   ### Willingness to contribute
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to