scovich commented on code in PR #8365:
URL: https://github.com/apache/arrow-rs/pull/8365#discussion_r2362174150


##########
parquet/src/arrow/schema/extension.rs:
##########
@@ -0,0 +1,62 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Arrow Extension Type Support for Parquet
+//!
+//! This module contains mapping code to map Parquet [`LogicalType`]s to/from
+//! Arrow [`ExtensionType`]s.
+//!
+//! Extension types are represented using the metadata from Arrow [`Field`]s
+//! with the key "ARROW:extension:name".
+
+use crate::basic::LogicalType;
+use crate::schema::types::Type;
+use arrow_schema::extension::ExtensionType;
+use arrow_schema::Field;
+
+/// Adds extension type metadata, if necessary, based on the Parquet field's
+/// [`LogicalType`]
+///
+/// Some Parquet logical types, such as Variant, do not map directly to an
+/// Arrow DataType, and instead are represented by an Arrow ExtensionType.
+/// Extension types are attached to Arrow Fields via metadata.
+pub(crate) fn add_extension_type(arrow_field: Field, parquet_type: &Type) -> 
Field {
+    let result = match parquet_type.get_basic_info().logical_type() {
+        #[cfg(feature = "variant_experimental")]
+        Some(LogicalType::Variant) => {
+            
arrow_field.with_extension_type(parquet_variant_compute::VariantType)
+        }
+        // TODO add other LogicalTypes here

Review Comment:
   Out of curiosity, how do the other parquet logical types map to arrow types? 
Seems like JSON, BSON, UUID, etc will face the same problem as Variant?



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -242,19 +243,15 @@ fn shredded_get_path(
 /// quickly become annoying (and inefficient) to call `variant_get` for each 
leaf value in a struct or
 /// list and then try to assemble the results.
 pub fn variant_get(input: &ArrayRef, options: GetOptions) -> Result<ArrayRef> {
-    let variant_array: &VariantArray = 
input.as_any().downcast_ref().ok_or_else(|| {
-        ArrowError::InvalidArgumentError(
-            "expected a VariantArray as the input for variant_get".to_owned(),
-        )
-    })?;
+    let variant_array = VariantArray::try_new(&input)?;

Review Comment:
   Isn't `&input` a double-ref?



##########
parquet/src/arrow/schema/extension.rs:
##########
@@ -0,0 +1,62 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Arrow Extension Type Support for Parquet
+//!
+//! This module contains mapping code to map Parquet [`LogicalType`]s to/from
+//! Arrow [`ExtensionType`]s.
+//!
+//! Extension types are represented using the metadata from Arrow [`Field`]s
+//! with the key "ARROW:extension:name".
+
+use crate::basic::LogicalType;
+use crate::schema::types::Type;
+use arrow_schema::extension::ExtensionType;
+use arrow_schema::Field;
+
+/// Adds extension type metadata, if necessary, based on the Parquet field's
+/// [`LogicalType`]
+///
+/// Some Parquet logical types, such as Variant, do not map directly to an
+/// Arrow DataType, and instead are represented by an Arrow ExtensionType.
+/// Extension types are attached to Arrow Fields via metadata.
+pub(crate) fn add_extension_type(arrow_field: Field, parquet_type: &Type) -> 
Field {
+    let result = match parquet_type.get_basic_info().logical_type() {
+        #[cfg(feature = "variant_experimental")]
+        Some(LogicalType::Variant) => {
+            
arrow_field.with_extension_type(parquet_variant_compute::VariantType)
+        }
+        // TODO add other LogicalTypes here
+        _ => arrow_field,
+    };
+    result
+}
+
+/// Return the Parquet logical type to use for the specified Arrow field, if 
any.
+#[cfg(feature = "variant_experimental")]
+pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> {
+    use parquet_variant_compute::VariantType;
+    if field.extension_type_name()? == VariantType::NAME {
+        return Some(LogicalType::Variant);
+    };
+    None

Review Comment:
   But shouldn't we follow the example in other parts of the code, and 
`try_extension_type` rather than merely checking the name? Otherwise, somebody 
could slap the variant annotation on some completely irrelevant struct, with no 
validation to catch the problem.



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -1651,7 +1633,7 @@ mod test {
             }
         }
 
-        Arc::new(builder.build())
+        builder.build().into()

Review Comment:
   aside: ironically, `into` would have worked with the original code as well, 
but I had chosen `Arc::new` as more readable.
   
   But what decides between `into` (here) vs. `ArrayRef::from` (below)?



##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -2295,7 +2295,7 @@ mod tests {
         let batch = record_reader.next().unwrap().unwrap();
         assert_eq!(batch.num_rows(), 1);
 
-        let expected_schema = Schema::new(Fields::from(vec![Field::new(
+        let expected_schema = Schema::new(vec![Field::new(

Review Comment:
   I guess `Fields::from` was never needed?
   (seems unrelated to the VariantArray changes?)



##########
parquet/src/variant.rs:
##########
@@ -108,6 +98,202 @@
 //! ```
 //!
 //! # Example: Reading a Parquet file with Variant column
-//! (TODO: add example)
+//!
+//! Use the [`VariantType`] extension type to find the Variant column:
+//!
+//! ```
+//! # use std::sync::Arc;
+//! # use std::path::PathBuf;
+//! # use arrow_array::{ArrayRef, RecordBatch, RecordBatchReader};
+//! # use parquet::variant::{Variant, VariantArray, VariantType};
+//! # use parquet::arrow::arrow_reader::ArrowReaderBuilder;
+//! # fn main() -> Result<(), parquet::errors::ParquetError> {
+//! # use arrow_array::StructArray;
+//! # fn file_path() -> PathBuf { // return a testing file path
+//! #    PathBuf::from(arrow::util::test_util::parquet_test_data())
+//! #   .join("..")
+//! #   .join("shredded_variant")
+//! #   .join("case-075.parquet")
+//! # }
+//! // Read the Parquet file using standard Arrow Parquet reader
+//! let file = std::fs::File::open(file_path())?;
+//! let mut reader = ArrowReaderBuilder::try_new(file)?.build()?;
+//!
+//! // You can find the Variant using the VariantType extension type

Review Comment:
   "find" ?



##########
parquet/src/arrow/schema/extension.rs:
##########
@@ -0,0 +1,62 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Arrow Extension Type Support for Parquet
+//!
+//! This module contains mapping code to map Parquet [`LogicalType`]s to/from
+//! Arrow [`ExtensionType`]s.
+//!
+//! Extension types are represented using the metadata from Arrow [`Field`]s
+//! with the key "ARROW:extension:name".
+
+use crate::basic::LogicalType;
+use crate::schema::types::Type;
+use arrow_schema::extension::ExtensionType;
+use arrow_schema::Field;
+
+/// Adds extension type metadata, if necessary, based on the Parquet field's
+/// [`LogicalType`]
+///
+/// Some Parquet logical types, such as Variant, do not map directly to an
+/// Arrow DataType, and instead are represented by an Arrow ExtensionType.
+/// Extension types are attached to Arrow Fields via metadata.
+pub(crate) fn add_extension_type(arrow_field: Field, parquet_type: &Type) -> 
Field {
+    let result = match parquet_type.get_basic_info().logical_type() {
+        #[cfg(feature = "variant_experimental")]
+        Some(LogicalType::Variant) => {
+            
arrow_field.with_extension_type(parquet_variant_compute::VariantType)
+        }
+        // TODO add other LogicalTypes here
+        _ => arrow_field,
+    };
+    result
+}
+
+/// Return the Parquet logical type to use for the specified Arrow field, if 
any.
+#[cfg(feature = "variant_experimental")]
+pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> {
+    use parquet_variant_compute::VariantType;
+    if field.extension_type_name()? == VariantType::NAME {
+        return Some(LogicalType::Variant);
+    };
+    None

Review Comment:
   It seems a bit odd that this API call's existence would be gated by the 
variant feature? 
   
   ```suggestion
   pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> {
       let type_name = field.extension_type_name()?;
   
       #[cfg(feature = "variant_experimental")]
       {
           use parquet_variant_compute::VariantType;
           if type_name == VariantType::NAME {
               return Some(LogicalType::Variant);
           }
       }
       None
   ```



##########
parquet/src/variant.rs:
##########
@@ -108,6 +98,202 @@
 //! ```
 //!
 //! # Example: Reading a Parquet file with Variant column
-//! (TODO: add example)
+//!
+//! Use the [`VariantType`] extension type to find the Variant column:
+//!
+//! ```
+//! # use std::sync::Arc;
+//! # use std::path::PathBuf;
+//! # use arrow_array::{ArrayRef, RecordBatch, RecordBatchReader};
+//! # use parquet::variant::{Variant, VariantArray, VariantType};
+//! # use parquet::arrow::arrow_reader::ArrowReaderBuilder;
+//! # fn main() -> Result<(), parquet::errors::ParquetError> {
+//! # use arrow_array::StructArray;
+//! # fn file_path() -> PathBuf { // return a testing file path
+//! #    PathBuf::from(arrow::util::test_util::parquet_test_data())
+//! #   .join("..")
+//! #   .join("shredded_variant")
+//! #   .join("case-075.parquet")
+//! # }
+//! // Read the Parquet file using standard Arrow Parquet reader
+//! let file = std::fs::File::open(file_path())?;
+//! let mut reader = ArrowReaderBuilder::try_new(file)?.build()?;
+//!
+//! // You can find the Variant using the VariantType extension type
+//! let schema = reader.schema();
+//! let field = schema.field_with_name("var")?;
+//! assert!(field.try_extension_type::<VariantType>().is_ok());
+//!
+//! // The reader will yield RecordBatches with a StructArray
+//! // to convert them to VariantArray, use VariantArray::try_new
+//! let batch = reader.next().unwrap().unwrap();
+//! let col = batch.column_by_name("var").unwrap();
+//! let var_array = VariantArray::try_new(&col)?;
+//! assert_eq!(var_array.len(), 1);
+//! let var_value: Variant = var_array.value(0);
+//! assert_eq!(var_value, Variant::from("iceberg")); // the value in 
case-075.parquet
+//! # Ok(())
+//! # }
+//! ```
 pub use parquet_variant::*;
-pub use parquet_variant_compute as compute;
+pub use parquet_variant_compute::*;
+
+#[cfg(test)]
+mod tests {
+    use crate::arrow::arrow_reader::ArrowReaderBuilder;
+    use crate::arrow::ArrowWriter;
+    use crate::file::metadata::{ParquetMetaData, ParquetMetaDataReader};
+    use crate::file::reader::ChunkReader;
+    use arrow::util::test_util::parquet_test_data;
+    use arrow_array::{ArrayRef, RecordBatch};
+    use arrow_schema::Schema;
+    use bytes::Bytes;
+    use parquet_variant::{Variant, VariantBuilderExt};
+    use parquet_variant_compute::{VariantArray, VariantArrayBuilder, 
VariantType};
+    use std::path::PathBuf;
+    use std::sync::Arc;
+
+    #[test]
+    fn roundtrip_basic() {
+        roundtrip(variant_array());
+    }
+
+    /// Ensure a file with Variant LogicalType, written by another writer in
+    /// parquet-testing, can be read as a VariantArray
+    #[test]
+    fn read_logical_type() {

Review Comment:
   Isn't this the ~same as the doctest above? Do we really need both?



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -154,11 +155,11 @@ fn shredded_get_path(
 
     // Peel away the prefix of path elements that traverses the shredded parts 
of this variant
     // column. Shredding will traverse the rest of the path on a per-row basis.
-    let mut shredding_state = input.shredding_state();
+    let mut shredding_state = input.shredding_state().clone();

Review Comment:
   Cloning is a cheap reference count bump on the two internal arrays, right?



##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -87,20 +88,20 @@ pub(crate) fn follow_shredded_path_element<'a>(
                 return Ok(missing_path_step());
             };
 
-            let field = field
-                .as_any()
-                .downcast_ref::<ShreddedVariantFieldArray>()
-                .ok_or_else(|| {
-                    // TODO: Should we blow up? Or just end the traversal and 
let the normal
-                    // variant pathing code sort out the mess that it must 
anyway be
-                    // prepared to handle?
-                    ArrowError::InvalidArgumentError(format!(
-                        "Expected a ShreddedVariantFieldArray, got {:?} 
instead",
-                        field.data_type(),
-                    ))
-                })?;
-
-            Ok(ShreddedPathStep::Success(field.shredding_state()))
+            let struct_array = field.as_struct_opt().ok_or_else(|| {
+                // TODO: Should we blow up? Or just end the traversal and let 
the normal
+                // variant pathing code sort out the mess that it must anyway 
be
+                // prepared to handle?
+                ArrowError::InvalidArgumentError(format!(
+                    "Expected Struct array while following path, got {}",
+                    field.data_type(),
+                ))
+            })?;
+
+            let shredding_state =
+                
ShreddedVariantFieldArray::shredding_state_from_struct_array(struct_array)?;

Review Comment:
   Why not just `ShreddingState::from`, out of curiosity?
   
   The operation is actually infallible, it just has a fallible signature at 
the moment. 
   And AFAIK there's no difference in how we create the shredding state for 
top-level vs. object field.



##########
parquet/src/arrow/schema/extension.rs:
##########
@@ -0,0 +1,62 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Arrow Extension Type Support for Parquet
+//!
+//! This module contains mapping code to map Parquet [`LogicalType`]s to/from
+//! Arrow [`ExtensionType`]s.
+//!
+//! Extension types are represented using the metadata from Arrow [`Field`]s
+//! with the key "ARROW:extension:name".
+
+use crate::basic::LogicalType;
+use crate::schema::types::Type;
+use arrow_schema::extension::ExtensionType;
+use arrow_schema::Field;
+
+/// Adds extension type metadata, if necessary, based on the Parquet field's
+/// [`LogicalType`]
+///
+/// Some Parquet logical types, such as Variant, do not map directly to an
+/// Arrow DataType, and instead are represented by an Arrow ExtensionType.
+/// Extension types are attached to Arrow Fields via metadata.
+pub(crate) fn add_extension_type(arrow_field: Field, parquet_type: &Type) -> 
Field {
+    let result = match parquet_type.get_basic_info().logical_type() {
+        #[cfg(feature = "variant_experimental")]
+        Some(LogicalType::Variant) => {
+            
arrow_field.with_extension_type(parquet_variant_compute::VariantType)
+        }
+        // TODO add other LogicalTypes here
+        _ => arrow_field,
+    };
+    result
+}
+
+/// Return the Parquet logical type to use for the specified Arrow field, if 
any.
+#[cfg(feature = "variant_experimental")]
+pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> {
+    use parquet_variant_compute::VariantType;
+    if field.extension_type_name()? == VariantType::NAME {
+        return Some(LogicalType::Variant);
+    };
+    None

Review Comment:
   Ah, there's a cfg-not version below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to