JingsongLi commented on PR #52:
URL: https://github.com/apache/paimon-mosaic/pull/52#issuecomment-4637773488

   I re-reviewed the latest head (`60ef99a`). The previous ARRAY DICT, 
multi-ARRAY reassembly, and Date32 null-parent issues look fixed, and the 
flattened ORC-style storage design itself looks reasonable: representing ARRAY 
as a primary INT32 lengths column plus child physical columns in the same 
bucket is simple, projection-friendly, and lets lengths/values reuse the 
existing CONST/DICT/PLAIN encoders.
   
   I still found two issues that should be addressed:
   
   1. The format version still stayed at 1 even though the bucket payload 
layout changed for all files.
   
   `BucketWriter::finish()` now always prefixes monolithic bucket payloads with 
`varint(num_primary) + varint(num_children) + child counts`, even when the 
schema has no ARRAY columns. Paged buckets also now have the child-count header 
before the directory. However `spec::VERSION` remains `1`, so older v1 readers 
will accept new files and parse the new bucket header as old encoding flags / 
directory bytes, while the new reader will also try to parse old v1 files as if 
the new header existed.
   
   This is not only an ARRAY compatibility issue; a new file with only 
primitive columns is no longer v1-compatible. Please bump the on-disk format 
version and either reject old v1 files cleanly or add a v1/v2 bucket parser 
split.
   
   2. `ARRAY<legacy timestamp nanos struct>` is accepted by schema validation 
but can still panic on null parent rows.
   
   `validate_data_type` accepts the legacy timestamp nanos struct 
(`Struct{millis: Int64, nanos_of_milli: Int32}`), and `DataType::List(field)` 
recursively accepts it as an ARRAY element type. The normal write path can 
downcast that struct element type, but when a list has null parent rows, 
`flatten_list_values()` calls `take_array()`. `take_array()` has no 
`Struct(fields) if is_timestamp_nanos_struct(fields)` branch and now panics in 
the fallback.
   
   I reproduced this with a temporary test using `ListArray<Struct(millis, 
nanos_of_milli)>` with offsets `[0, 2, 4, 5]`, validity `[valid, null, valid]`, 
and struct child values. The write panics with:
   
   ```text
   take_array: unsupported DataType Struct([Field { name: "millis", data_type: 
Int64 }, Field { name: "nanos_of_milli", data_type: Int32 }])
   ```
   
   Please either reject legacy timestamp nanos structs when they appear inside 
ARRAY, or add a `take_array` branch for this accepted struct type and a 
regression test with a null parent list row.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to