alamb opened a new issue, #7424: URL: https://github.com/apache/arrow-rs/issues/7424
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** - Part of https://github.com/apache/arrow-rs/issues/6736 - https://github.com/apache/arrow-rs/issues/7423 tracks the API for **Reading** Variant values. Part of supporting the Variant type in Parquet and Arrow is programmatically **creating** values in the binary format described in [VariantEncoding.md]. This is important in the short term for writing tests, as well as for converting from other types (specifically JSON). Note this ticket covers the API to create such values, but not reading them (see https://github.com/apache/arrow-rs/issues/7423) or reading/writing variant values to JSON. [VariantEncoding.md]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md **Describe the solution you'd like** What I would like is a Rust API, that can efficiently create such values. I think it is also important to design an API that supports reusing the metadata. **Describe alternatives you've considered** What I suggest is a Builder-style API, modeled on the Arrow array builder APIs such as [StringBuilder] that can efficiently create Variant values. [StringBuilder]: https://github.com/apache/arrow-rs/blob/936dc59968a3be6698ebf51aa17c46b2d4eddc80/arrow-array/src/builder/mod.rs#L417-L416 For example: ```rust // Location to write metadata // Should be anything that implements std::io::Write or a trait let mut metadata_buffer = vec![] // Create a builder for constructing variant values let builder = VariantBuilder::new(&mut metadata_buffer); ``` ## Example creating a primitive `Variant` value`: ```rust // Create the equivalent of {"foo": 1, "bar": 100} let mut value_buffer = vec![]; let mut object_builder = builder.new_object(&mut value_buffer); // object_builder has reference to builder object_builder.append_value("foo", 1); object_builder.append_value("bar", 100); object_builder.finish(); // value_buffer now contains a valid variant 🎉 // builder contains a metadata header with fields "foo" and "bar" ``` ## Example of creating a nested `VariantValue`: Here is how we might create an Object: ```rust // Create nested object: the equivalent of {"foo": {"bar": 100}} // note we haven't finalized the metadata yet so we reuse it here let mut value_buffer2 = vec![]; let mut object_builder2 = builder.new_object(&mut value_buffer); let mut foo_object_builder = object_builder.append_object("bar"); // builder for "bar" foo_object_builder.append_value("bar", 100); foo_object_builder.finish(); object_builder.finish(); // value_buffer2 contains a valid variant ``` ## Finish the builder to finalize the metadata When the builder is finished, it finalizes / writes metadata as needed. ```rust // complete writing the metadata builder.finish(); // metadata_buffer contains valid variant metadata bytes ``` # Considerations: ## Reusing metadata The metadata mostly contains a dictionary of field names, and so I believe an important optimization will be reusing the same metadata to create multiple values. For example the three following JSON values can use the same metadata (with field names "foo" and "bar"): ```json { "foo": 1, "bar": 100 } ``` ```json { "foo": 2, "bar": 200 } ``` ```json { "foo": 3, } ``` ## Sorted dictionaries: The metadata encoding spec permits writing [sorted dictionaries] in the metadata header. However, when writing sorted dictionaries, once an object has been created, it is in general not possible to add new metadata dictionary values because the variant object value itself contains offsets to the dictionary, and thus inserting any new values into the metadata would invalidate it. One API that might work would be to supply a pre-existing metadata to the builder and reusing that when possible and creating an new metadata when it isn't [sorted dictionaries]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#metadata-encoding **Additional context** <!-- Add any other context or screenshots about the feature request here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org