scovich opened a new issue, #7230:
URL: https://github.com/apache/arrow-rs/issues/7230

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   JSON is notoriously semistructured, but as far as I can tell arrow-json's 
[`ReaderBuilder`](https://arrow.apache.org/rust/arrow_json/reader/struct.ReaderBuilder.html)
 only allows a very strict parsing of JSON rows to a specific homogenous 
schema. There is an option to ignore unwanted columns 
(`with_strict_mode(false)`), and for converting boolean/numeric values to 
strings (`with_coerce_primitive(true)`), but any other kind of schema mismatch 
on even a single row produces a hard error for the whole batch.
   
   **Describe the solution you'd like**
   
   It would be nice to have some `ReaderBuilder` option that converted 
wrong-type values to NULL instead of forcing a parsing error. For example, 
Spark's (woefully underdocumented) 
[`from_json`](https://spark.apache.org/docs/latest/api/sql/index.html#from_json)
 function does this by default. 
   
   NOTE: This is _NOT_ a request to handle malformed JSON. If it can't even 
json-parse it should be an error. This is about handling the case where e.g. 
the schema requested an int and the JSON provides an array of ints.
   
   Errors like this frequently come up in the context of delta-kernel-rs (and 
delta-spark as well), when different clients format JSON data in slightly 
different ways. 
   
   **Describe alternatives you've considered**
   
   One alternative might be to request a schema with all-string leaf fields and 
manually parse values afterward. Spark's `from_json` has such an option. But 
that doesn't handle the case where the data provides an array or object where 
the schema expected a leaf, and it also doesn't handle the case where the 
schema expected a non-leaf column like a struct and got an array or primitive 
instead.
   
   **Additional context**
   
   Some examples where it would be helpful to tolerate partially incompatible 
schemas:
   https://github.com/delta-io/delta/issues/2419
   https://github.com/delta-io/delta-kernel-rs/issues/501
   https://github.com/delta-io/delta-kernel-rs/issues/712
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to