Re: [I] Parsing a string column containing JSON values into a typed array [arrow-rs]

via GitHub Tue, 04 Feb 2025 11:49:30 -0800


scovich commented on issue #6522:
URL: https://github.com/apache/arrow-rs/issues/6522#issuecomment-2634918695


   > I prototyped this last month for polars, could share, it's a lot
   
   This seems a bit surprising, given that the feature request is to define new 
`num_buffered_rows` and `has_partial_record` methods that publicly expose 
information the JSON parser already tracks internally?
   
   > one big issue though is the struct field isn't suited for json, because 
struct needs a schema and assumes json documents are homogenous.
   
   This is definitely a general problem when parsing arbitrary JSON data, but 
IMO solving it is out of scope for the main part of this feature request. 
Especially given that arrow-rs/json already has public API methods that parse 
JSON data with a homogenous schema. Spark and other systems have the same. It's 
just that the existing arrow-rs support is a pain to use if the JSON bytes come 
from a `StringArray` instead of a file. Thus, the basic ask is super simple: 
Expose a utility that maps from `StringArray` to `StructArray` using the _exact 
same existing capability_ (with all the same limitations) that 
[arrow_json::reader](https://arrow.apache.org/rust/arrow_json/reader/index.html)
 already provides. The only difference is the source of the raw json bytes.
   
   > for arbitrary json, like mappings with heterogenous keys, nested lists or 
list values in mappings, offsets arrays don't make sense for deeply nested 
paths. Also, heterogeneous flat leaf values with no keys, are valid json.
   
   and
   
   > I suggest adding a new datatype to Arrow which is identical to string 
datatype except it is named "json" to facilitate different handling of that 
kind of string (with serde)
   
   Might I suggest taking a look at the new "variant" data type that 
[spark](https://github.com/apache/spark/blob/master/common/variant/README.md) 
added last year, and which will likely become an official 
[parquet](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)
 data type soon? 
   
   It's specifically designed to handle deeply nested and strongly heterogenous 
data as efficiently as possible. I looks like there's already a general 
tracking issue for arrow (https://github.com/apache/arrow/issues/42069), and 
people are already exploring adding that support to arrow-rs parquet 
(https://github.com/apache/arrow-rs/issues/6736).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parsing a string column containing JSON values into a typed array [arrow-rs]

Reply via email to