Re: [PR] Support ARRAY type with ORC-style flattened columnar storage [paimon-mosaic]

via GitHub Fri, 05 Jun 2026 19:31:52 -0700


QuakeWang commented on code in PR #52:
URL: https://github.com/apache/paimon-mosaic/pull/52#discussion_r3366415324



##########
core/src/reader.rs:
##########
@@ -979,12 +1080,41 @@ impl RowGroupReader {
 
             match state {
                 BucketState::Paged { column_readers } => {
+                    // Read all physical columns (N+C)
+                    let mut phys_arrays: Vec<ArrayRef> = Vec::new();
+                    for cr_opt in column_readers {
+                        if let Some(ref cr) = cr_opt {
+                            phys_arrays.push(cr.read_all()?);
+                        } else {
+                            phys_arrays

Review Comment:
   This still breaks paged partial projection when a bucket contains multiple 
ARRAY columns and only one of them is projected. Unprojected physical columns 
are materialized as Int32 null placeholders here, but the code below still 
calls reassemble_list_columns_pub for all ARRAY children in the bucket. For 
example, with arr_a: ARRAY<INT32> and arr_b: ARRAY<INT64> in the same paged 
bucket, projecting only arr_a tries to rebuild arr_b with an Int32 child 
placeholder and panics with "ListArray expected data type Int64 got Int32". 
Please either reassemble only projected ARRAY parents, or create placeholders 
using the expanded physical type and correct child row count. A regression test 
should cover multiple ARRAY columns + paged layout + projecting only one ARRAY.



##########
docs/design.html:
##########
@@ -739,9 +739,10 @@ <h3>TypeDescriptor</h3>
                     <tr><td>15</td><td>TIME</td><td>varint precision</td></tr>
                     <tr><td>16</td><td>TIMESTAMP</td><td>varint 
precision</td></tr>
                     <tr><td>17</td><td>TIMESTAMP_LTZ</td><td>varint precision, 
varint timezoneLength, bytes timezone</td></tr>
+                    <tr><td>18</td><td>ARRAY</td><td>TypeDescriptor (recursive 
element type)</td></tr>

Review Comment:
   This ARRAY type descriptor is stale after the schema roundtrip fix. The 
implementation now serializes the ARRAY element field name before the recursive 
element Field descriptor (`name length + name bytes + 
serialize_field(element_field)`), but the format spec still says only 
"TypeDescriptor". Please update this row so cross-language readers/writers know 
the exact on-disk schema encoding. Also note that the value serialization table 
below still mentions a "values sub-bucket", while the current layout stores 
ARRAY child values as physical columns in the same bucket.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Support ARRAY type with ORC-style flattened columnar storage [paimon-mosaic]

Reply via email to