JingsongLi commented on PR #52:
URL: https://github.com/apache/paimon-mosaic/pull/52#issuecomment-4637773488
I re-reviewed the latest head (`60ef99a`). The previous ARRAY DICT,
multi-ARRAY reassembly, and Date32 null-parent issues look fixed, and the
flattened ORC-style storage design itself looks reasonable: representing ARRAY
as a primary INT32 lengths column plus child physical columns in the same
bucket is simple, projection-friendly, and lets lengths/values reuse the
existing CONST/DICT/PLAIN encoders.
I still found two issues that should be addressed:
1. The format version still stayed at 1 even though the bucket payload
layout changed for all files.
`BucketWriter::finish()` now always prefixes monolithic bucket payloads with
`varint(num_primary) + varint(num_children) + child counts`, even when the
schema has no ARRAY columns. Paged buckets also now have the child-count header
before the directory. However `spec::VERSION` remains `1`, so older v1 readers
will accept new files and parse the new bucket header as old encoding flags /
directory bytes, while the new reader will also try to parse old v1 files as if
the new header existed.
This is not only an ARRAY compatibility issue; a new file with only
primitive columns is no longer v1-compatible. Please bump the on-disk format
version and either reject old v1 files cleanly or add a v1/v2 bucket parser
split.
2. `ARRAY<legacy timestamp nanos struct>` is accepted by schema validation
but can still panic on null parent rows.
`validate_data_type` accepts the legacy timestamp nanos struct
(`Struct{millis: Int64, nanos_of_milli: Int32}`), and `DataType::List(field)`
recursively accepts it as an ARRAY element type. The normal write path can
downcast that struct element type, but when a list has null parent rows,
`flatten_list_values()` calls `take_array()`. `take_array()` has no
`Struct(fields) if is_timestamp_nanos_struct(fields)` branch and now panics in
the fallback.
I reproduced this with a temporary test using `ListArray<Struct(millis,
nanos_of_milli)>` with offsets `[0, 2, 4, 5]`, validity `[valid, null, valid]`,
and struct child values. The write panics with:
```text
take_array: unsupported DataType Struct([Field { name: "millis", data_type:
Int64 }, Field { name: "nanos_of_milli", data_type: Int32 }])
```
Please either reject legacy timestamp nanos structs when they appear inside
ARRAY, or add a `take_array` branch for this accepted struct type and a
regression test with a null parent list row.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]