QuakeWang commented on code in PR #52:
URL: https://github.com/apache/paimon-mosaic/pull/52#discussion_r3362567040


##########
core/src/bucket_writer.rs:
##########
@@ -566,6 +643,56 @@ impl BucketWriter {
             }
         }
 
+        for lc in &self.list_columns {
+            let col = lc.physical_col;
+            let values_paged = lc.values_writer.finish_paged();
+
+            let mut composite = Vec::new();
+            varint::encode(&mut composite, lc.total_elements as u32);
+
+            if let Some(ref lengths_page) = column_pages[col] {

Review Comment:
   This writes only the raw lengths column page data into the ARRAY composite 
slot, but `parse_list_column_slot()` later passes it to 
`parse_simple_column_slot()`, which expects a full page_content buffer starting 
with `[encoding, flags, const/meta...]`. With paged layout forced 
(`page_size_threshold = 1`), ARRAY roundtrip fails with `InvalidData: column 
page truncated: plain fixed-width data`. Please either store a complete 
simple-column page_content for lengths/values, or teach the reader to parse the 
exact raw page format written here, and add a forced-paged ARRAY regression 
test.



##########
core/src/bucket_writer.rs:
##########
@@ -926,6 +1057,24 @@ fn i128_to_biginteger_bytes(val: i128) -> Vec<u8> {
     bytes[start..].to_vec()
 }
 
+fn extract_list_lengths(list_array: &ListArray) -> Int32Array {
+    let offsets = list_array.value_offsets();
+    let num_rows = list_array.len();
+    let mut lengths = Vec::with_capacity(num_rows);
+    for i in 0..num_rows {
+        lengths.push(offsets[i + 1] - offsets[i]);

Review Comment:
   For null list rows, this still records the physical offset delta as the 
length. Arrow permits null list slots to have a non-empty offset range, but 
semantically those child values should not be consumed by the logical ARRAY 
column. Combined with `flatten_list_values()` slicing the full offsets range, a 
later non-null row can read values that belonged to a null row. Please write 
length 0 for null list rows and flatten only non-null list ranges; add a 
regression test with a null list row whose offsets cover child values.



##########
core/src/types.rs:
##########
@@ -258,6 +263,10 @@ pub fn deserialize_field(name: &str, buf: &[u8], pos: &mut 
usize) -> Result<Fiel
                 DataType::Timestamp(TimeUnit::Nanosecond, Some(tz))
             }
         }
+        18 => {
+            let element_field = deserialize_field("element", buf, pos)?;

Review Comment:
   This hard-codes the deserialized list child field name to `"element"`, so 
schema roundtrip does not preserve Arrow's original child field name, e.g. 
`List(Field { name: "custom_item", ... })` becomes `List(Field { name: 
"element", ... })`. If Mosaic's schema contract is intended to preserve Arrow 
schema fidelity, the serialized ARRAY descriptor needs to include the child 
field name, or the limitation should be documented and tested explicitly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to