Ben Schmidt created ARROW-15978:
-----------------------------------

             Summary: Have JSON reader treat mixed singleton/array fields as 
arrays.
                 Key: ARROW-15978
                 URL: https://issues.apache.org/jira/browse/ARROW-15978
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++, Python, R
    Affects Versions: 7.0.0
            Reporter: Ben Schmidt


I frequently encounter real-world files that mix array types and singletons 
across entries for a single field. For example, consider an ndjson file 
consisting of:

 
{code:java}
{"author": "Hunter Thompson"}
{"author": ["Bob Woodward", "Carl Bernstein"]} {code}
Widely used specs promote writing JSON like this, where a singleton isn't be 
wrapped in array brackets. For example, the 'target' field in the w3 annotation 
model [may be a string or an array of 
strings.|https://www.w3.org/TR/annotation-model/#:~:text=The%20body%20and/or%20target%20relationships%20of%20the%20Annotation%20may%20be%20arrays%20rather%20than%20a%20single%20object.]
 

 

Currently I see no way to read his sort of data with the C++ json reader. It 
would be nice if arrow's ndjson reader could do two things to support data like 
this.
 # When inferring types, silently promote entries of type <T> to type <T[]> if 
the column is mixed;
 # When passed an explicit schema that includes a ListArray, promote all 
instances of the field to an array if they aren't already. 

My sense is that this might be pretty simple.

Thanks to everyone who works on this project.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to