BlakeOrth commented on code in PR #8801:
URL: https://github.com/apache/arrow-rs/pull/8801#discussion_r2501098970
##########
parquet/src/arrow/schema/mod.rs:
##########
@@ -76,16 +77,18 @@ pub(crate) fn parquet_to_arrow_schema_and_fields(
key_value_metadata: Option<&Vec<KeyValue>>,
) -> Result<(Schema, Option<ParquetField>)> {
let mut metadata =
parse_key_value_metadata(key_value_metadata).unwrap_or_default();
- let maybe_schema = metadata
+ let mut maybe_schema = metadata
.remove(super::ARROW_SCHEMA_META_KEY)
.map(|value| get_arrow_schema_from_metadata(&value))
.transpose()?;
// Add the Arrow metadata to the Parquet metadata skipping keys that
collide
- if let Some(arrow_schema) = &maybe_schema {
+ if let Some(arrow_schema) = maybe_schema.as_mut() {
arrow_schema.metadata().iter().for_each(|(k, v)| {
metadata.entry(k.clone()).or_insert_with(|| v.clone());
});
+ #[cfg(feature = "geospatial")]
+ parquet_geospatial::crs::parquet_to_arrow(arrow_schema, &metadata)
Review Comment:
The primary question I have is related to these lines here. I started off
trying to solve the most complex case of geospatial CRS metdata noted by
@paleolimbot in this comment:
- https://github.com/apache/arrow-rs/pull/8222#discussion_r2299643001
The issue is when there's an Arrow metadata key of `projjson:<key>` the
actual `projjson` CRS data exists in the Parquet `KeyValue` metdata. When
reading parquet, this seems to be the last place in the schema conversion chain
that `metadata` is still available.
If we want to avoid breaking the existing public API for building extension
types I think converting from the parquet CRS to an Arrow CRS has to happen
here since we need access to `metadata`. An alternate approach would be passing
the parquet metadata further down the stack so it's available for users to
reference. At the current time it looks like only `GEOMETRY` and `GEOGRAPHY`
would be consumers of that, but it could be a more future-proof solution if
we'd prefer to take that route. (edit) I also considered trying to place the
metadata on the `hint` Field, but each field requires an owned copy of the
metadata and I didn't think cloning the metadata once for each field in the
schema seemed desirable.
If my review of the PRs was correct, I believe at least some of this tooling
is relatively new for the Variant type support from @alamb, so any guidance on
a preferred way to handle this case would be appreciated!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]