QuakeWang commented on code in PR #133:
URL: https://github.com/apache/paimon-rust/pull/133#discussion_r2937976765


##########
crates/paimon/src/spec/data_file.rs:
##########
@@ -24,30 +24,203 @@ use std::fmt::{Display, Formatter};
 
 pub const EMPTY_BINARY_ROW: BinaryRow = BinaryRow::new(0);
 
-/// An implementation of InternalRow.
+/// Highest bit mask for detecting inline vs variable-length encoding.
+///
+/// If the highest bit of the 8-byte fixed-part value is 1, the data is stored
+/// inline (≤7 bytes). If 0, the data is in the variable-length part.
+///
+/// Reference: `BinarySection.HIGHEST_FIRST_BIT` in Java Paimon.
+const HIGHEST_FIRST_BIT: u64 = 0x80 << 56;
+
+/// Mask to extract the 7-bit length from an inline-encoded value.
+///
+/// Reference: `BinarySection.HIGHEST_SECOND_TO_EIGHTH_BIT` in Java Paimon.
+const HIGHEST_SECOND_TO_EIGHTH_BIT: u64 = 0x7F << 56;
+
+/// An implementation of InternalRow backed by raw binary bytes.
+///
+/// Binary layout (little-endian):
+/// ```text
+/// | header (8 bytes) | null bit set (8-byte aligned) | fixed-length (8B per 
field) | variable-length |
+/// ```
+///
+/// - **Header**: byte 0 = RowKind, bytes 1-7 reserved.
+/// - **Null bit set**: starts at bit index 8 (skip header bits), 1 bit per 
field, padded to 8-byte boundary.
+/// - **Fixed-length part**: 8 bytes per field. Primitives stored directly; 
variable-length types store offset+length.
+/// - **Variable-length part**: String / Binary data referenced by 
offset+length from fixed part.
 ///
 /// Impl Reference: 
<https://github.com/apache/paimon/blob/release-0.8.2/paimon-common/src/main/java/org/apache/paimon/data/BinaryRow.java>
-#[derive(Debug, Eq, PartialEq, Serialize, Deserialize)]
+#[derive(Debug, Clone, Eq, PartialEq, Serialize, Deserialize)]
 #[serde(rename_all = "camelCase")]
 pub struct BinaryRow {
     arity: i32,
     null_bits_size_in_bytes: i32,
+
+    /// Raw binary data backing this row. Empty when constructed via `new()`.
+    /// Populated via `from_bytes()` for partition data from manifest entries.
+    #[serde(skip)]

Review Comment:
   > won't `SerializationUtils.serializeBinaryRow` serialize the data bytes?
   
   Yes, `SerializationUtils.serializeBinaryRow` does serialize the full row 
bytes (including the 4-byte arity prefix), but that is a separate binary codec 
used to fill a parent bytes field such as `_PARTITION`.
   
   On the Rust side, `_PARTITION` is first deserialized into 
`ManifestEntry.partition: Vec<u8>`, and only then converted to `BinaryRow` via 
`BinaryRow::from_bytes(...)` at runtime.
   
   `#[serde(skip)]` only affects `BinaryRow`'s own derived 
`Serialize`/`Deserialize`. I added it so introducing this runtime backing 
buffer does not change BinaryRow's current serde shape or the existing 
empty-row roundtrip behavior. If we ever need serde to roundtrip a populated 
`BinaryRow`, we should switch to a dedicated custom serializer instead of 
relying on the derived struct serde.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to