Re: [PR] Adds additional type support to arrow-avro writer [arrow-rs]

via GitHub Wed, 17 Sep 2025 11:32:03 -0700


nathaniel-d-ef commented on code in PR #8298:
URL: https://github.com/apache/arrow-rs/pull/8298#discussion_r2344300369



##########
arrow-avro/src/writer/encoder.rs:
##########
@@ -69,6 +73,77 @@ fn write_bool<W: Write + ?Sized>(out: &mut W, v: bool) -> 
Result<(), ArrowError>
         .map_err(|e| ArrowError::IoError(format!("write bool: {e}"), e))
 }
 
+/// Minimal two's-complement big-endian representation helper for Avro decimal 
(bytes).
+///
+/// For positive numbers, trim leading 0x00 while the next byte's MSB is 0.
+/// For negative numbers, trim leading 0xFF while the next byte's MSB is 1.
+/// The resulting slice still encodes the same signed value.
+///
+/// See Avro spec: decimal over `bytes` uses two's-complement big-endian
+/// representation of the unscaled integer value. 1.11.1 specification.
+#[inline]
+fn minimal_twos_complement(be: &[u8]) -> &[u8] {
+    if be.is_empty() {
+        return be;
+    }
+    let mut i = 0usize;
+    let sign = (be[0] & 0x80) != 0;
+    while i + 1 < be.len() {
+        let b = be[i];
+        let next = be[i + 1];
+        let trim_pos = !sign && b == 0x00 && (next & 0x80) == 0;
+        let trim_neg = sign && b == 0xFF && (next & 0x80) != 0;
+        if trim_pos || trim_neg {
+            i += 1;
+        } else {
+            break;
+        }
+    }
+    &be[i..]
+}
+
+/// Sign-extend (or validate/truncate) big-endian integer bytes to exactly `n` 
bytes.
+///
+/// If `src_be` is longer than `n`, ensure that dropped leading bytes are all 
sign bytes,
+/// and that the MSB of the first kept byte matches the sign; otherwise return 
an overflow error.
+/// If shorter than `n`, left-pad with the sign byte.
+///
+/// Used for Avro decimal over `fixed(N)`.
+#[inline]
+fn sign_extend_to_exact(src_be: &[u8], n: usize) -> Result<Vec<u8>, 
ArrowError> {
+    let len = src_be.len();
+    let sign_byte = if len > 0 && (src_be[0] & 0x80) != 0 {
+        0xFF
+    } else {
+        0x00
+    };
+    if len == n {
+        return Ok(src_be.to_vec());
+    }
+    if len > n {
+        let extra = len - n;
+        if src_be[..extra].iter().any(|&b| b != sign_byte) {
+            return Err(ArrowError::InvalidArgumentError(format!(
+                "Decimal value with {} bytes cannot be represented in {} bytes 
without overflow",
+                len, n
+            )));
+        }
+        if n > 0 {
+            let first_kept = src_be[extra];
+            if ((first_kept ^ sign_byte) & 0x80) != 0 {
+                return Err(ArrowError::InvalidArgumentError(format!(
+                    "Decimal value with {} bytes cannot be represented in {} 
bytes without overflow",
+                    len, n
+                )));
+            }
+        }
+        return Ok(src_be[extra..].to_vec());
+    }
+    let mut out = vec![sign_byte; n];
+    out[n - len..].copy_from_slice(src_be);
+    Ok(out)
+}

Review Comment:
   Can't argue with that performance improvement. I found similar results after 
implementing. Excellent suggestion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Adds additional type support to arrow-avro writer [arrow-rs]

Reply via email to