QuakeWang commented on code in PR #133:
URL: https://github.com/apache/paimon-rust/pull/133#discussion_r2938214535
##########
crates/paimon/src/spec/data_file.rs:
##########
@@ -24,30 +24,203 @@ use std::fmt::{Display, Formatter};
pub const EMPTY_BINARY_ROW: BinaryRow = BinaryRow::new(0);
-/// An implementation of InternalRow.
+/// Highest bit mask for detecting inline vs variable-length encoding.
+///
+/// If the highest bit of the 8-byte fixed-part value is 1, the data is stored
+/// inline (≤7 bytes). If 0, the data is in the variable-length part.
+///
+/// Reference: `BinarySection.HIGHEST_FIRST_BIT` in Java Paimon.
+const HIGHEST_FIRST_BIT: u64 = 0x80 << 56;
+
+/// Mask to extract the 7-bit length from an inline-encoded value.
+///
+/// Reference: `BinarySection.HIGHEST_SECOND_TO_EIGHTH_BIT` in Java Paimon.
+const HIGHEST_SECOND_TO_EIGHTH_BIT: u64 = 0x7F << 56;
+
+/// An implementation of InternalRow backed by raw binary bytes.
+///
+/// Binary layout (little-endian):
+/// ```text
+/// | header (8 bytes) | null bit set (8-byte aligned) | fixed-length (8B per
field) | variable-length |
+/// ```
+///
+/// - **Header**: byte 0 = RowKind, bytes 1-7 reserved.
+/// - **Null bit set**: starts at bit index 8 (skip header bits), 1 bit per
field, padded to 8-byte boundary.
+/// - **Fixed-length part**: 8 bytes per field. Primitives stored directly;
variable-length types store offset+length.
+/// - **Variable-length part**: String / Binary data referenced by
offset+length from fixed part.
///
/// Impl Reference:
<https://github.com/apache/paimon/blob/release-0.8.2/paimon-common/src/main/java/org/apache/paimon/data/BinaryRow.java>
-#[derive(Debug, Eq, PartialEq, Serialize, Deserialize)]
+#[derive(Debug, Clone, Eq, PartialEq, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct BinaryRow {
arity: i32,
null_bits_size_in_bytes: i32,
+
+ /// Raw binary data backing this row. Empty when constructed via `new()`.
+ /// Populated via `from_bytes()` for partition data from manifest entries.
+ #[serde(skip)]
Review Comment:
> Currently BinaryRow's Serialize/Deserialize is not used anywhere —
DataSplit only derives Debug, and DataFileMeta._MIN_KEY/_MAX_KEY are Vec, not
BinaryRow.
>
> The #[serde(skip)] on data is problematic: if we later need DataSplit
(which contains BinaryRow) to be serializable (e.g. for distributed task
dispatch), the data field must participate in serde — otherwise the receiver
gets an empty partition. Skipping it now risks a subtle bug later.
>
> I'd suggest either:
>
> Remove Serialize/Deserialize and #[serde(skip)] entirely since nothing
uses them today, or Keep Serialize/Deserialize but remove #[serde(skip)] (use
#[serde(with = "serde_bytes")] instead) so that when DataSplit eventually
derives Serialize/Deserialize, BinaryRow is already correct and we don't forget
to fix this.
Good point, thanks for the review. I adopted option B by replacing
`#[serde(skip)` with `#[serde(with = "serde_bytes")]`, so `BinaryRow.data` is
preserved in Rust serde roundtrips.
I also added test_serde_roundtrip_populated to verify that a populated
BinaryRow roundtrips with its backing bytes intact, in addition to the
empty-row case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]