luoyuxia commented on code in PR #133:
URL: https://github.com/apache/paimon-rust/pull/133#discussion_r2938009613


##########
crates/paimon/src/spec/data_file.rs:
##########
@@ -24,30 +24,203 @@ use std::fmt::{Display, Formatter};
 
 pub const EMPTY_BINARY_ROW: BinaryRow = BinaryRow::new(0);
 
-/// An implementation of InternalRow.
+/// Highest bit mask for detecting inline vs variable-length encoding.
+///
+/// If the highest bit of the 8-byte fixed-part value is 1, the data is stored
+/// inline (≤7 bytes). If 0, the data is in the variable-length part.
+///
+/// Reference: `BinarySection.HIGHEST_FIRST_BIT` in Java Paimon.
+const HIGHEST_FIRST_BIT: u64 = 0x80 << 56;
+
+/// Mask to extract the 7-bit length from an inline-encoded value.
+///
+/// Reference: `BinarySection.HIGHEST_SECOND_TO_EIGHTH_BIT` in Java Paimon.
+const HIGHEST_SECOND_TO_EIGHTH_BIT: u64 = 0x7F << 56;
+
+/// An implementation of InternalRow backed by raw binary bytes.
+///
+/// Binary layout (little-endian):
+/// ```text
+/// | header (8 bytes) | null bit set (8-byte aligned) | fixed-length (8B per 
field) | variable-length |
+/// ```
+///
+/// - **Header**: byte 0 = RowKind, bytes 1-7 reserved.
+/// - **Null bit set**: starts at bit index 8 (skip header bits), 1 bit per 
field, padded to 8-byte boundary.
+/// - **Fixed-length part**: 8 bytes per field. Primitives stored directly; 
variable-length types store offset+length.
+/// - **Variable-length part**: String / Binary data referenced by 
offset+length from fixed part.
 ///
 /// Impl Reference: 
<https://github.com/apache/paimon/blob/release-0.8.2/paimon-common/src/main/java/org/apache/paimon/data/BinaryRow.java>
-#[derive(Debug, Eq, PartialEq, Serialize, Deserialize)]
+#[derive(Debug, Clone, Eq, PartialEq, Serialize, Deserialize)]
 #[serde(rename_all = "camelCase")]
 pub struct BinaryRow {
     arity: i32,
     null_bits_size_in_bytes: i32,
+
+    /// Raw binary data backing this row. Empty when constructed via `new()`.
+    /// Populated via `from_bytes()` for partition data from manifest entries.
+    #[serde(skip)]

Review Comment:
   Currently BinaryRow's Serialize/Deserialize is not used anywhere — DataSplit 
only derives Debug, and DataFileMeta._MIN_KEY/_MAX_KEY are Vec<u8>, not 
BinaryRow.
   
   The #[serde(skip)] on data is problematic: if we later need DataSplit (which 
contains BinaryRow) to be serializable (e.g. for distributed task dispatch), 
the data field must participate in serde — otherwise the receiver gets an empty 
partition. Skipping it now risks a subtle bug later.
   
   I'd suggest either:
   
   Remove Serialize/Deserialize and #[serde(skip)] entirely since nothing uses 
them today, or
   Keep Serialize/Deserialize but remove #[serde(skip)] (use #[serde(with = 
"serde_bytes")] instead) so that when DataSplit eventually derives 
Serialize/Deserialize, BinaryRow is already correct and we don't forget to fix 
this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to