[ 
https://issues.apache.org/jira/browse/ARROW-15978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510116#comment-17510116
 ] 

Ben Schmidt commented on ARROW-15978:
-------------------------------------

That is a reasonable concern. For my specific case a useful in-between option 
would be to make automatic list promotion an option to json.ParseOptions. 

To give some context, I work a lot with library records and other cultural 
heritage data, and I find myself encountering this all of the time in otherwise 
high-quality data. My hunch is that it often happens in sets where fields used 
to be in XML, and listiness was inferred from how many times they appear on a 
per-record basis.

As a more real world example of a case where this would help, here's a [single 
record|httphttps://gist.github.com/bmschmidt/e038634264135fb25b6cb46ca1631f36] 
out of several million from the National Archives and Record Administration of 
the United States. (To be clear, this is *one record* split over several lines 
for clarity–NARA released several million rows of json like this). Although the 
schema is extremely complicated, pyarrow's json reader is generally able to 
correctly infer the times and convert the whole thing to a nested set of 
struct-columns that I can then flatten and work with.

It's a miracle that it does so well, and I am in awe of the Arrow group's work! 
But one thing holding it back right now is the key in line 151 of that gist, 
where the field at `description.specificRecordsTypeArray.specificRecordsType` 
(if I'm following it right) might be an object or an array of objects. There 
are many others like this in the dataset. Although schema-following would make 
it possible to read without rewriting the underlying json, to infer a schema 
would require require reading the first few rows to get a schema, then reading 
until it encountered a single error, altering the schema, reading until the 
next error, parsing that one, etc.

> [C++] Have JSON reader treat mixed singleton/array fields as arrays.
> --------------------------------------------------------------------
>
>                 Key: ARROW-15978
>                 URL: https://issues.apache.org/jira/browse/ARROW-15978
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Python, R
>    Affects Versions: 7.0.0
>            Reporter: Ben Schmidt
>            Priority: Minor
>
> I frequently encounter real-world files that mix array types and singletons 
> across entries for a single field. For example, consider an ndjson file 
> consisting of:
>  
> {code:java}
> {"author": "Hunter Thompson"}
> {"author": ["Bob Woodward", "Carl Bernstein"]} {code}
> Widely used specs promote writing JSON like this, where a singleton isn't be 
> wrapped in array brackets. For example, the 'target' field in the w3 
> annotation model [may be a string or an array of 
> strings.|https://www.w3.org/TR/annotation-model/#:~:text=The%20body%20and/or%20target%20relationships%20of%20the%20Annotation%20may%20be%20arrays%20rather%20than%20a%20single%20object.]
>  
>  
> Currently I see no way to read this sort of data with the C++ json reader. It 
> would be nice if arrow's ndjson reader could do two things to support data 
> like this.
>  # When inferring types, silently promote entries of type <T> to type <T[]> 
> if the column is mixed;
>  # When passed an explicit schema that includes a ListArray, promote all 
> instances of the field to an array if they aren't already. 
> My sense is that this might be pretty simple.
> Thanks to everyone who works on this project.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to