alamb opened a new issue, #7423:
URL: https://github.com/apache/arrow-rs/issues/7423

   - Part of https://github.com/apache/arrow-rs/issues/6736
   
   
   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   The first part of supporting the Variant type in Parquet and Arrow is
   programmatic access to values encoded with the binary format described in
   [VariantEncoding.md]. This ticket covers the API to read such values, but not
   creating such values, or representing it using arrow or parquet which are
   covered in other tickets
   
   **Describe the solution you'd like**
   I would like a Rust API, similar to the Json::Value and similar APIs to 
dynamically access variant values.
   
   Here is some example binary data for testing:
   * https://github.com/apache/parquet-testing/pull/76
   
   
   **Describe alternatives you've considered**
   
   I think a Rust enum approach with references would be a good model.
   
   I suggest creating a  new crate, `arrow-variant`, and marking it as
   experimental, etc saying it will contain breaking changes for the next 
several
   releases (maybe we can even version it 0.1, etc)
   
   For example:
   
   ## Sketch of structures
   
   ```rust
   /// Variant value. May contain references to metadata and value
   /// 'a is lifetime for metadata
   /// 'b is lifetime for value
   pub enum Variant<'a, 'b> {
     Variant::Null,
     Variant::Int8
     ...
     // strings are stored in the value and thus have references to that value
     Variant::String(&'b str),
     Variant::ShortString(&'b str),
     // Objects and Arrays need the metadata and values, so store both.
     Variant::Object(VariantObject<'a, 'b>),
     VariantArray(VariantArray<'a, 'b>)
   }
   
   /// Wrapper over Variant Metadata
   pub struct VariantMetadata<'a> {
     metadata: &'a[u8],
     // perhaps access to header fields like dict length and is_sorted
   }
   
   /// Represents a Variant Object with references to the underlying metadata
   /// and value fields
   pub enum VariantObject<'a, 'b> {
     // pointer to metadata
     metadata: VariantMetadata<'a>,
     // pointer to value
     value: &'a [u8],
   }
   ```
   
   ## Creating `Variants` from buffers
   ```rust
   // Each variant has a metadata and value buffer:
   let metadata: &[u8] = ...;
   let value: &[u8] = ....;
   // The Rust API should NOT require allocations or copy the metadata/values
   let variant = Variant::try_new(metadata, value)?;
   ```
   
   ## Working with Primitive `Variants`
   
   ```rust
   // Act based on the type of variant
   match variant {
     Variant::Int8(val) => println!("The value was int8: {val}"),
     ...
     Variant::SmallString(val) => println!("The value was a small string: 
{val}"),
     ...
     Variant::Object(object) => {
       println("The variant was in object. The fields are:");
       for (field_name, field_value) in object.fields()? {
         // The inner field value is also a variant
         match field_value {
           Variant::...
         }
       }
     }
     // similarly for Variant::Array
   }
   ```
   
   I personally suggest doing this over a few PRs:
   1. Scaffolding: `Variant` struct/enum, support a few basic variant primtive 
types
   2. Basic nested type support: basic support for objects
   3. Array support: support for arrays
   4. Complete APIs, etc
   
   
   **Additional context**
   * @PinkCrow007s: https://github.com/apache/arrow-rs/pull/7404.  
https://github.com/apache/arrow-rs/blob/0220e97b407cf6690ac8b51d53fa0e92273f1d8c/arrow-schema/src/extension/canonical/variant.rs#L38-L37
   * Spark implementation: 
https://github.com/apache/spark/blob/007c31df6da7b741f3a2a43c859ebed9f801dcfa/common/variant/src/main/java/org/apache/spark/types/variant/Variant.java#L43
   * Python implementation (also in spark): 
https://github.com/apache/spark/blob/master/python/pyspark/sql/variant_utils.py
   * Examples from @jonhoo and @wjones127  in 
https://github.com/datafusion-contrib/datafusion-functions-variant
   
   
   **Open Questions**:When should validation be done?
   
   I do think there should be an API like:
   ```rust
   /// ensure that metadata and value are valid according to the Variant spec, 
returns error if not.
   Variant::validate(&metadata, &value)?;
   ```
   
   However, the API sketched above proposes doing validation on access (when the
   values are accessed). An alternate approach would be to validate everything 
on
   creation and then use unchecked APIs during access.
   
   I think validating once upfront is better if most fields are accessed or 
certain
   fields are read multiple times. For the usecase where only some fields are 
read
   I think verifying on access would be faster.
   
   The spec also allows metadata to contain dictionary values that do not 
appear as
   struct names in the variant value itself, so eager validation would 
potentially
   verify string data uncessairly.
   
   
   
   I suggest starting with an API that is fallible (aka creating a Variant or
   accessing a field returns  `Result<Variant>`. We can always add unsafe 
versions
   of the APIs for usecases where validation overhead is significant (e.g. 
writing
   utf8 validation for field names when writing json), and justified with 
benchmarks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to