[ 
https://issues.apache.org/jira/browse/ARROW-9547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edwin Jung updated ARROW-9547:
------------------------------
    Environment: Debian in Docker. Python 2 and Python 3  (was: Debian in 
Docker)

> json.read_json crashes due to possible race
> -------------------------------------------
>
>                 Key: ARROW-9547
>                 URL: https://issues.apache.org/jira/browse/ARROW-9547
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0, 0.17.0, 0.17.1
>         Environment: Debian in Docker. Python 2 and Python 3
>            Reporter: Edwin Jung
>            Priority: Major
>
> Simple calls to `read_json` will crash with an exception like below.  The 
> crashing can be non-deterministic, depending on the input file.
> ---
> Traceback (most recent call last):
>  File "test_arrow.py", line 11, in <module>
>  data = json.read_json(f, json.ReadOptions(use_threads=True))
>  File "pyarrow/_json.pyx", line 193, in pyarrow._json.read_json
>  File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: JSON conversion to struct<continent: 
> timestamp[s], subcontinent: timestamp[s], country: timestamp[s]> is not 
> supported
> ---
> The input file is several thousand lines of ndjson, where each record looks 
> similar to:
> ```
> {
>   "title": "Black Friday 2019: Our Tips for Finding the Best Deals",
>   "text": ".... <bunch of text with arbitrary length>"
>   <bunch of other string and integer fields with arbitrary length>
>   "geoLocations": [
>     {
>       "continent": "Americas",
>       "subcontinent": "Northern America",
>       "country": "United States"
>     }
>   ]
> }
> ```
> and any particular record may have an empty array for a geoLocation.
> Workarounds include:
> * shuffling the input file (not guaranteed to work)
> * partitioning the input file into separate pieces (not guaranteed to work)
> * disabling threaded reading (always works)
> * changing block size (not guaranteed to work)
> Other things that stop the crash include:
> * deleting fields from the input records
> I'm guessing that anything that changes the data partitioning and/or 
> multi-threading affects the auto-schema introspection, which is the source of 
> conflict.   Supplying an explicit schema may also be a workaround.
> It's arguable that this is not a bug, but updating the API docs with a 
> warning would be very helpful.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to