alamb opened a new issue, #7425:
URL: https://github.com/apache/arrow-rs/issues/7425

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   - Part of https://github.com/apache/arrow-rs/issues/6736
   - Depends on  https://github.com/apache/arrow-rs/issues/7424
   
   A  major usecase for Variant values in Parquet and Arrow is efficiently
   processing JSON encoded data. Thus an important capability is being able to
   efficiently read JSON encoded bytes into the Variant binary encoding 
described in
   [VariantEncoding.md]. This ticket covers an API to parse one JSON value to
   one `Variant` value. Other tickets will cover converting converting Variants 
to
   JSON as well as converting to/from Arrow Utf8* arrays and `Variant` arrays as
   well as writing this to/from parquet.
   
   [VariantEncoding.md]: 
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
   
   
   **Describe the solution you'd like**
   
   I would like an API to convert JSON encoded bytes to Variant encoded bytes
   
   
   
   **Describe alternatives you've considered**
   
   
   I suggest is an API like this:
   ```rust
   // Provide location to write metadata, and value output
   // (should be anything that implements `std::io::Write` or some trait)
   let mut metadata_buffer = vec![];
   let mut value_buffer = vec![];
   // Input json encoded bytes
   let json_data: &[u8] = ...;
   // Call the new API
   json_to_variant(&mut metadata_buffer, &mut value_buffer, json_data)?;
   // metadata_buffer and value_buffer contain the variant information
   ```
   
   Prior art:
   - @ scovich's  PR here: #7403 
   - 
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   
   # Considerations:
   
   ## Reusing metadata across values?
   
   I think it will be common that same metadata is used across many different
   variant values (e.g. because the schema of the json documents is the same). 
Thus
   we should probably permit reusing validated metadata somehow (rather than
   requiring recreating it for each decoded json value)
   
   One option would be to add a json function to the `VariantBuilder` 
envisioned in
   the ticket linked above for reading `Variant` values.
   
   ```rust
   // Location to write metadata
   let mut metadata_buffer = vec![]
   // Create a builder
   let builder = VariantBuilder::new(&mut metadata_buffer);
   // Location to write the output variant value
   let mut value_buffer = vec![];
   builder.json(&mut value_buffer, json_data)?;
   // value_buffer contains the result of converting json_data to `Variant`)
   ```
   
   ## Support for  "streaming"  / a push API
   
   As sketched above, this API would require the entire JSON value in a single
   buffer. A potentially more efficient API might be a "push" api, similar to 
how
   the arrow JSON reader works, which would support smaller buffer sizes and 
lower
   peak memory usage as well as interleaving variant parsing with IO fetch.
   
   Perhaps something like
   
   ```rust
   let mut metadata_buffer = vec![]
   let builder = VariantBuilder::new(&mut metadata_buffer);
   let mut value_buffer = vec![];
   let mut parser = builder.json_parser(&mut value_buffer)?;
   // json data comes in from some source
   while let Some(json_data) = source.next() {
     parser.push(json_data); // incrementally parses json,
   }
   parser.finish(); // complete in-progress variant
   // value_buffer contains the result of converting json_data to Variant)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to