scovich commented on issue #6522: URL: https://github.com/apache/arrow-rs/issues/6522#issuecomment-2634918695
> I prototyped this last month for polars, could share, it's a lot This seems a bit surprising, given that the feature request is to define new `num_buffered_rows` and `has_partial_record` methods that publicly expose information the JSON parser already tracks internally? > one big issue though is the struct field isn't suited for json, because struct needs a schema and assumes json documents are homogenous. This is definitely a general problem when parsing arbitrary JSON data, but IMO solving it is out of scope for the main part of this feature request. Especially given that arrow-rs/json already has public API methods that parse JSON data with a homogenous schema. Spark and other systems have the same. It's just that the existing arrow-rs support is a pain to use if the JSON bytes come from a `StringArray` instead of a file. Thus, the basic ask is super simple: Expose a utility that maps from `StringArray` to `StructArray` using the _exact same existing capability_ (with all the same limitations) that [arrow_json::reader](https://arrow.apache.org/rust/arrow_json/reader/index.html) already provides. The only difference is the source of the raw json bytes. > for arbitrary json, like mappings with heterogenous keys, nested lists or list values in mappings, offsets arrays don't make sense for deeply nested paths. Also, heterogeneous flat leaf values with no keys, are valid json. and > I suggest adding a new datatype to Arrow which is identical to string datatype except it is named "json" to facilitate different handling of that kind of string (with serde) Might I suggest taking a look at the new "variant" data type that [spark](https://github.com/apache/spark/blob/master/common/variant/README.md) added last year, and which will likely become an official [parquet](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) data type soon? It's specifically designed to handle deeply nested and strongly heterogenous data as efficiently as possible. I looks like there's already a general tracking issue for arrow (https://github.com/apache/arrow/issues/42069), and people are already exploring adding that support to arrow-rs parquet (https://github.com/apache/arrow-rs/issues/6736). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
