Johan Forsberg created ARROW-7647:
-------------------------------------

             Summary: Problem with read_json and arrays
                 Key: ARROW-7647
                 URL: https://issues.apache.org/jira/browse/ARROW-7647
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1
         Environment: Ubuntu Linux 18.04
Python 3.7.5

            Reporter: Johan Forsberg


Hi! I'm trying to load some nested JSON data and am running into a problem with 
arrays. I can reproduce it with a slightly modified example from the 
documentation:
{code:python}
from pyarrow import json
import pyarrow as pa

with open("test.json", "w") as f:
    test_json = """{"a": [1], "b": {"c": true, "d": "1991-02-03"}}
{"a": [], "b": {"c": false, "d": "2019-04-01"}}
"""
    f.write(test_json)

json.read_json("test.json")
{code}
Running this code with pyarrow 0.15.1 (I also tried 0.14) gives the following 
error:
{code:java}
Traceback (most recent call last):
  File "issue.py", line 11, in <module>
    ccs = json.read_json("test.json")
  File "pyarrow/_json.pyx", line 195, in pyarrow._json.read_json
  File "pyarrow/public-api.pxi", line 285, in pyarrow.lib.pyarrow_wrap_table
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 0 named a expected length 2 but got length 1
{code}
I've tried various combinations and it seems like the error only appears when 
the *total* number of elements in all the "a" arrays is less than the number of 
*rows* in the file. I did not expect there to be any relationship between those 
things and have found nothing in the documentation about it. Is this 
intentional? If not, I'd suspect there's some problem in the validation step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to