QuakeWang commented on code in PR #52:
URL: https://github.com/apache/paimon-mosaic/pull/52#discussion_r3362567040
##########
core/src/bucket_writer.rs:
##########
@@ -566,6 +643,56 @@ impl BucketWriter {
}
}
+ for lc in &self.list_columns {
+ let col = lc.physical_col;
+ let values_paged = lc.values_writer.finish_paged();
+
+ let mut composite = Vec::new();
+ varint::encode(&mut composite, lc.total_elements as u32);
+
+ if let Some(ref lengths_page) = column_pages[col] {
Review Comment:
This writes only the raw lengths column page data into the ARRAY composite
slot, but `parse_list_column_slot()` later passes it to
`parse_simple_column_slot()`, which expects a full page_content buffer starting
with `[encoding, flags, const/meta...]`. With paged layout forced
(`page_size_threshold = 1`), ARRAY roundtrip fails with `InvalidData: column
page truncated: plain fixed-width data`. Please either store a complete
simple-column page_content for lengths/values, or teach the reader to parse the
exact raw page format written here, and add a forced-paged ARRAY regression
test.
##########
core/src/bucket_writer.rs:
##########
@@ -926,6 +1057,24 @@ fn i128_to_biginteger_bytes(val: i128) -> Vec<u8> {
bytes[start..].to_vec()
}
+fn extract_list_lengths(list_array: &ListArray) -> Int32Array {
+ let offsets = list_array.value_offsets();
+ let num_rows = list_array.len();
+ let mut lengths = Vec::with_capacity(num_rows);
+ for i in 0..num_rows {
+ lengths.push(offsets[i + 1] - offsets[i]);
Review Comment:
For null list rows, this still records the physical offset delta as the
length. Arrow permits null list slots to have a non-empty offset range, but
semantically those child values should not be consumed by the logical ARRAY
column. Combined with `flatten_list_values()` slicing the full offsets range, a
later non-null row can read values that belonged to a null row. Please write
length 0 for null list rows and flatten only non-null list ranges; add a
regression test with a null list row whose offsets cover child values.
##########
core/src/types.rs:
##########
@@ -258,6 +263,10 @@ pub fn deserialize_field(name: &str, buf: &[u8], pos: &mut
usize) -> Result<Fiel
DataType::Timestamp(TimeUnit::Nanosecond, Some(tz))
}
}
+ 18 => {
+ let element_field = deserialize_field("element", buf, pos)?;
Review Comment:
This hard-codes the deserialized list child field name to `"element"`, so
schema roundtrip does not preserve Arrow's original child field name, e.g.
`List(Field { name: "custom_item", ... })` becomes `List(Field { name:
"element", ... })`. If Mosaic's schema contract is intended to preserve Arrow
schema fidelity, the serialized ARRAY descriptor needs to include the child
field name, or the limitation should be documented and tested explicitly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]