QuakeWang commented on code in PR #52:
URL: https://github.com/apache/paimon-mosaic/pull/52#discussion_r3366415324
##########
core/src/reader.rs:
##########
@@ -979,12 +1080,41 @@ impl RowGroupReader {
match state {
BucketState::Paged { column_readers } => {
+ // Read all physical columns (N+C)
+ let mut phys_arrays: Vec<ArrayRef> = Vec::new();
+ for cr_opt in column_readers {
+ if let Some(ref cr) = cr_opt {
+ phys_arrays.push(cr.read_all()?);
+ } else {
+ phys_arrays
Review Comment:
This still breaks paged partial projection when a bucket contains multiple
ARRAY columns and only one of them is projected. Unprojected physical columns
are materialized as Int32 null placeholders here, but the code below still
calls reassemble_list_columns_pub for all ARRAY children in the bucket. For
example, with arr_a: ARRAY<INT32> and arr_b: ARRAY<INT64> in the same paged
bucket, projecting only arr_a tries to rebuild arr_b with an Int32 child
placeholder and panics with "ListArray expected data type Int64 got Int32".
Please either reassemble only projected ARRAY parents, or create placeholders
using the expanded physical type and correct child row count. A regression test
should cover multiple ARRAY columns + paged layout + projecting only one ARRAY.
##########
docs/design.html:
##########
@@ -739,9 +739,10 @@ <h3>TypeDescriptor</h3>
<tr><td>15</td><td>TIME</td><td>varint precision</td></tr>
<tr><td>16</td><td>TIMESTAMP</td><td>varint
precision</td></tr>
<tr><td>17</td><td>TIMESTAMP_LTZ</td><td>varint precision,
varint timezoneLength, bytes timezone</td></tr>
+ <tr><td>18</td><td>ARRAY</td><td>TypeDescriptor (recursive
element type)</td></tr>
Review Comment:
This ARRAY type descriptor is stale after the schema roundtrip fix. The
implementation now serializes the ARRAY element field name before the recursive
element Field descriptor (`name length + name bytes +
serialize_field(element_field)`), but the format spec still says only
"TypeDescriptor". Please update this row so cross-language readers/writers know
the exact on-disk schema encoding. Also note that the value serialization table
below still mentions a "values sub-bucket", while the current layout stores
ARRAY child values as physical columns in the same bucket.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]