alamb opened a new issue, #7424:
URL: https://github.com/apache/arrow-rs/issues/7424

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   - Part of https://github.com/apache/arrow-rs/issues/6736
   - https://github.com/apache/arrow-rs/issues/7423 tracks the API for 
**Reading** Variant values.
   
   Part of supporting the Variant type in Parquet and Arrow is programmatically
   **creating** values in the binary format described in [VariantEncoding.md]. 
This
   is important in the short term for writing tests, as well as for converting 
from
   other types (specifically JSON).
   
   Note this ticket covers the API to create such values, but not reading them
   (see https://github.com/apache/arrow-rs/issues/7423) or reading/writing 
variant values to JSON.
   
   [VariantEncoding.md]: 
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
   
   
   **Describe the solution you'd like**
   
   What I would like is a Rust API, that can efficiently create such values. I
   think it is also important to design an API that supports reusing the 
metadata.
   
   
   **Describe alternatives you've considered**
   
   What I suggest is a Builder-style API, modeled on the Arrow array builder 
APIs
   such as [StringBuilder] that can efficiently create Variant values.
   
   [StringBuilder]: 
https://github.com/apache/arrow-rs/blob/936dc59968a3be6698ebf51aa17c46b2d4eddc80/arrow-array/src/builder/mod.rs#L417-L416
   
   For example:
   ```rust
   // Location to write metadata
   // Should be anything that implements std::io::Write or a trait
   let mut metadata_buffer = vec![]
   // Create a builder for constructing variant values
   let builder = VariantBuilder::new(&mut metadata_buffer);
   ```
   
   ## Example creating a primitive `Variant` value`:
   ```rust
   // Create the equivalent of {"foo": 1, "bar": 100}
   let mut value_buffer = vec![];
   let mut object_builder = builder.new_object(&mut value_buffer); // 
object_builder has reference to builder
   object_builder.append_value("foo", 1);
   object_builder.append_value("bar", 100);
   object_builder.finish();
   // value_buffer now contains a valid variant 🎉
   // builder contains a metadata header with fields "foo" and "bar"
   ```
   
   ## Example of creating a nested `VariantValue`:
   
   Here is how we might create an Object:
   
   ```rust
   // Create nested object: the equivalent of {"foo": {"bar": 100}}
   // note we haven't finalized the metadata yet so we reuse it here
   let mut value_buffer2 = vec![];
   let mut object_builder2 = builder.new_object(&mut value_buffer);
   let mut foo_object_builder = object_builder.append_object("bar"); // builder 
for "bar"
   foo_object_builder.append_value("bar", 100);
   foo_object_builder.finish();
   object_builder.finish();
   // value_buffer2 contains a valid variant
   ```
   
   ## Finish the builder to finalize the metadata
   When the builder is finished, it finalizes / writes metadata as needed.
   ```rust
   // complete writing the metadata
   builder.finish();
   // metadata_buffer contains valid variant metadata bytes
   ```
   
   # Considerations:
   
   ## Reusing metadata
   
   The metadata mostly contains a dictionary of field names, and so I believe an
   important optimization will be reusing the same metadata to create multiple
   values. For example the three following JSON values can use the same metadata
   (with field names "foo" and "bar"):
   
   ```json
   {
   "foo": 1,
   "bar": 100
   }
   ```
   
   ```json
   {
   "foo": 2,
   "bar": 200
   }
   ```
   
   
   ```json
   {
   "foo": 3,
   }
   ```
   
   ## Sorted dictionaries:
   
   The metadata encoding spec permits writing [sorted dictionaries] in the 
metadata
   header. However, when writing sorted dictionaries, once an object has been
   created, it is in general not possible to add new metadata dictionary values
   because the variant object value itself contains offsets to the dictionary, 
and thus inserting any new values into
   the metadata would invalidate it.
   
   One API that might work would be to supply a pre-existing metadata to the 
builder
   and reusing that when possible and creating an new metadata when it isn't
   
   [sorted dictionaries]: 
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#metadata-encoding
   
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to