sdf-jkl commented on PR #10015: URL: https://github.com/apache/arrow-rs/pull/10015#issuecomment-4633398313
> Meanwhile -- requesting changes because AFAIK this tweak (hack?) only applies to top-level fields of a shredded variant column. We have no testing for nested uuid columns, but I'm pretty sure they would still write out to parquet as bare binary arrays? Parquet writer recursively travels through each field and runs `arrow_to_parquet_type` on it, so nested fields ext type is preserved. (I added a nested uuid unit test that passes - https://github.com/apache/arrow-rs/pull/10015/commits/70d2242b5615e70151d7926cd03e688e189a2c00) https://github.com/apache/arrow-rs/blob/e5e66fa05ce986fa6a3f61c36c11ff6b476cb76c/parquet/src/arrow/schema/mod.rs#L776-L783 ---- > Probably not even validated (because VariantArray::from_parts does not invoke canonicalize_shredded_types)? I think the point of `from_parts` is that we assume we created valid parts for it to build from. `canonicalize_shredded_types` is more for validating already created `VariantArray`s that came from somewhere and we don't know if they're valid (that's why used in `VariantArray::try_new`). ---- > Thinking more about this -- should canonicalize_and_verify_data_type add the UUID extension type to FixedSizeBinary(16) fields it encounters? (it already has a match arm that enforces the length). In theory it would be easier to plumb the necessary Field information through that private helper, but I'm not sure how it would work with top-level UUID fields? > > Also -- why is VariantArray::from_parts the correct (and only) place we should be imposing the UUID extension type? Do other code paths need similar treatment so the behavior is consistent? I added uuid support to `canonicalize_and_verify_field`. It will add uuid ext type on read for `VariantArray` that come from different writers that potentially omit it. I'm not sure about that last change, so let's think about it :nerd_face: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
