alamb opened a new issue, #7423: URL: https://github.com/apache/arrow-rs/issues/7423
- Part of https://github.com/apache/arrow-rs/issues/6736 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** The first part of supporting the Variant type in Parquet and Arrow is programmatic access to values encoded with the binary format described in [VariantEncoding.md]. This ticket covers the API to read such values, but not creating such values, or representing it using arrow or parquet which are covered in other tickets **Describe the solution you'd like** I would like a Rust API, similar to the Json::Value and similar APIs to dynamically access variant values. Here is some example binary data for testing: * https://github.com/apache/parquet-testing/pull/76 **Describe alternatives you've considered** I think a Rust enum approach with references would be a good model. I suggest creating a new crate, `arrow-variant`, and marking it as experimental, etc saying it will contain breaking changes for the next several releases (maybe we can even version it 0.1, etc) For example: ## Sketch of structures ```rust /// Variant value. May contain references to metadata and value /// 'a is lifetime for metadata /// 'b is lifetime for value pub enum Variant<'a, 'b> { Variant::Null, Variant::Int8 ... // strings are stored in the value and thus have references to that value Variant::String(&'b str), Variant::ShortString(&'b str), // Objects and Arrays need the metadata and values, so store both. Variant::Object(VariantObject<'a, 'b>), VariantArray(VariantArray<'a, 'b>) } /// Wrapper over Variant Metadata pub struct VariantMetadata<'a> { metadata: &'a[u8], // perhaps access to header fields like dict length and is_sorted } /// Represents a Variant Object with references to the underlying metadata /// and value fields pub enum VariantObject<'a, 'b> { // pointer to metadata metadata: VariantMetadata<'a>, // pointer to value value: &'a [u8], } ``` ## Creating `Variants` from buffers ```rust // Each variant has a metadata and value buffer: let metadata: &[u8] = ...; let value: &[u8] = ....; // The Rust API should NOT require allocations or copy the metadata/values let variant = Variant::try_new(metadata, value)?; ``` ## Working with Primitive `Variants` ```rust // Act based on the type of variant match variant { Variant::Int8(val) => println!("The value was int8: {val}"), ... Variant::SmallString(val) => println!("The value was a small string: {val}"), ... Variant::Object(object) => { println("The variant was in object. The fields are:"); for (field_name, field_value) in object.fields()? { // The inner field value is also a variant match field_value { Variant::... } } } // similarly for Variant::Array } ``` I personally suggest doing this over a few PRs: 1. Scaffolding: `Variant` struct/enum, support a few basic variant primtive types 2. Basic nested type support: basic support for objects 3. Array support: support for arrays 4. Complete APIs, etc **Additional context** * @PinkCrow007s: https://github.com/apache/arrow-rs/pull/7404. https://github.com/apache/arrow-rs/blob/0220e97b407cf6690ac8b51d53fa0e92273f1d8c/arrow-schema/src/extension/canonical/variant.rs#L38-L37 * Spark implementation: https://github.com/apache/spark/blob/007c31df6da7b741f3a2a43c859ebed9f801dcfa/common/variant/src/main/java/org/apache/spark/types/variant/Variant.java#L43 * Python implementation (also in spark): https://github.com/apache/spark/blob/master/python/pyspark/sql/variant_utils.py * Examples from @jonhoo and @wjones127 in https://github.com/datafusion-contrib/datafusion-functions-variant **Open Questions**:When should validation be done? I do think there should be an API like: ```rust /// ensure that metadata and value are valid according to the Variant spec, returns error if not. Variant::validate(&metadata, &value)?; ``` However, the API sketched above proposes doing validation on access (when the values are accessed). An alternate approach would be to validate everything on creation and then use unchecked APIs during access. I think validating once upfront is better if most fields are accessed or certain fields are read multiple times. For the usecase where only some fields are read I think verifying on access would be faster. The spec also allows metadata to contain dictionary values that do not appear as struct names in the variant value itself, so eager validation would potentially verify string data uncessairly. I suggest starting with an API that is fallible (aka creating a Variant or accessing a field returns `Result<Variant>`. We can always add unsafe versions of the APIs for usecases where validation overhead is significant (e.g. writing utf8 validation for field names when writing json), and justified with benchmarks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org