[ 
https://issues.apache.org/jira/browse/ARROW-12065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick updated ARROW-12065:
----------------------------
    Description: 
I noticed this when doing some analysis on a not very complex, but reasonably 
large json file and I've simplified it to a fairly minimal reproduction:

```

import pyarrow.json
 pyarrow.json.read_json('test.json')

```

and `test.json` is

```

{"A":"<0 repeated 1.6 million times>"}

{"B":[]}

```

this seems like it shouldn't be too large to load into memory all-at-once, so 
I'm surprised there is a segfault

running via gdb and getting a backtrace gives

```

(gdb) bt
 #0 0x00007ffff5c1965d in std::__shared_ptr<arrow::Buffer, 
(__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<arrow::Buffer, 
(__gnu_cxx::_Lock_policy)2> const&) () from 
/home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
 #1 0x00007ffff5ca8d9e in arrow::json::ChunkedListArrayBuilder::Insert(long, 
std::shared_ptr<arrow::Field> const&, std::shared_ptr<arrow::Array> const&) () 
from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
 #2 0x00007ffff5cabcc8 in 
arrow::json::ChunkedStructArrayBuilder::Finish(std::shared_ptr<arrow::ChunkedArray>*)
 () from 
/home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
 #3 0x00007ffff5c1fc16 in arrow::json::TableReaderImpl::Read() () from 
/home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
 #4 0x00007fffcf73da69 in __pyx_pw_7pyarrow_5_json_1read_json(_object*, 
_object*, _object*) () from 
/home/patrick/.local/lib/python3.9/site-packages/pyarrow/_json.cpython-39-x86_64-linux-gnu.so
 #5 0x00007ffff7d35a43 in ?? () from /usr/lib/libpython3.9.so.1.0
 #6 0x00007ffff7d1be6d in _PyObject_MakeTpCall () from 
/usr/lib/libpython3.9.so.1.0
 #7 0x00007ffff7d17b3a in _PyEval_EvalFrameDefault () from 
/usr/lib/libpython3.9.so.1.0
 #8 0x00007ffff7d119ad in ?? () from /usr/lib/libpython3.9.so.1.0
 #9 0x00007ffff7d11371 in _PyEval_EvalCodeWithName () from 
/usr/lib/libpython3.9.so.1.0
 #10 0x00007ffff7dd3f83 in PyEval_EvalCode () from /usr/lib/libpython3.9.so.1.0
 #11 0x00007ffff7de43dd in ?? () from /usr/lib/libpython3.9.so.1.0
 #12 0x00007ffff7ddfc7b in ?? () from /usr/lib/libpython3.9.so.1.0
 #13 0x00007ffff7cf38ab in ?? () from /usr/lib/libpython3.9.so.1.0
 #14 0x00007ffff7cf3a63 in PyRun_InteractiveLoopFlags () from 
/usr/lib/libpython3.9.so.1.0
 #15 0x00007ffff7c81f6b in PyRun_AnyFileExFlags () from 
/usr/lib/libpython3.9.so.1.0
 #16 0x00007ffff7c7665c in ?? () from /usr/lib/libpython3.9.so.1.0
 #17 0x00007ffff7dc6fa9 in Py_BytesMain () from /usr/lib/libpython3.9.so.1.0
 #18 0x00007ffff7a43b25 in __libc_start_main () from /usr/lib/libc.so.6
 #19 0x000055555555504e in _start ()
 (gdb)

```

 

  was:
I noticed this when doing some analysis on a not very complex, but reasonably 
large json file and I've simplified it to a fairly minimal reproduction:

{{import pyarrow.json}}
{{ pyarrow.json.read_json('test.json')}}

and test.json is

{{{"A":"<0 repeated 1.6 million times>"}}}

{{{"B":[]}}}

this seems like it shouldn't be too large to load into memory all-at-once, so 
I'm surprised there is a segfault

running via gdb and getting a backtrace gives

{{(gdb) bt}}
{{ #0 0x00007ffff5c1965d in std::__shared_ptr<arrow::Buffer, 
(__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<arrow::Buffer, 
(__gnu_cxx::_Lock_policy)2> const&) () from 
/home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300}}
{{ #1 0x00007ffff5ca8d9e in arrow::json::ChunkedListArrayBuilder::Insert(long, 
std::shared_ptr<arrow::Field> const&, std::shared_ptr<arrow::Array> const&) () 
from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300}}
{{ #2 0x00007ffff5cabcc8 in 
arrow::json::ChunkedStructArrayBuilder::Finish(std::shared_ptr<arrow::ChunkedArray>*)
 () from 
/home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300}}
{{ #3 0x00007ffff5c1fc16 in arrow::json::TableReaderImpl::Read() () from 
/home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300}}
{{ #4 0x00007fffcf73da69 in __pyx_pw_7pyarrow_5_json_1read_json(_object*, 
_object*, _object*) () from 
/home/patrick/.local/lib/python3.9/site-packages/pyarrow/_json.cpython-39-x86_64-linux-gnu.so}}
{{ #5 0x00007ffff7d35a43 in ?? () from /usr/lib/libpython3.9.so.1.0}}
{{ #6 0x00007ffff7d1be6d in _PyObject_MakeTpCall () from 
/usr/lib/libpython3.9.so.1.0}}
{{ #7 0x00007ffff7d17b3a in _PyEval_EvalFrameDefault () from 
/usr/lib/libpython3.9.so.1.0}}
{{ #8 0x00007ffff7d119ad in ?? () from /usr/lib/libpython3.9.so.1.0}}
{{ #9 0x00007ffff7d11371 in _PyEval_EvalCodeWithName () from 
/usr/lib/libpython3.9.so.1.0}}
{{ #10 0x00007ffff7dd3f83 in PyEval_EvalCode () from 
/usr/lib/libpython3.9.so.1.0}}
{{ #11 0x00007ffff7de43dd in ?? () from /usr/lib/libpython3.9.so.1.0}}
{{ #12 0x00007ffff7ddfc7b in ?? () from /usr/lib/libpython3.9.so.1.0}}
{{ #13 0x00007ffff7cf38ab in ?? () from /usr/lib/libpython3.9.so.1.0}}
{{ #14 0x00007ffff7cf3a63 in PyRun_InteractiveLoopFlags () from 
/usr/lib/libpython3.9.so.1.0}}
{{ #15 0x00007ffff7c81f6b in PyRun_AnyFileExFlags () from 
/usr/lib/libpython3.9.so.1.0}}
{{ #16 0x00007ffff7c7665c in ?? () from /usr/lib/libpython3.9.so.1.0}}
{{ #17 0x00007ffff7dc6fa9 in Py_BytesMain () from /usr/lib/libpython3.9.so.1.0}}
{{ #18 0x00007ffff7a43b25 in __libc_start_main () from /usr/lib/libc.so.6}}
{{ #19 0x000055555555504e in _start ()}}
{{ (gdb)}}

 


> segfault in pyarrow read_json
> -----------------------------
>
>                 Key: ARROW-12065
>                 URL: https://issues.apache.org/jira/browse/ARROW-12065
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0
>         Environment: arch linux, 31G ram
>            Reporter: Patrick
>            Priority: Major
>
> I noticed this when doing some analysis on a not very complex, but reasonably 
> large json file and I've simplified it to a fairly minimal reproduction:
> ```
> import pyarrow.json
>  pyarrow.json.read_json('test.json')
> ```
> and `test.json` is
> ```
> {"A":"<0 repeated 1.6 million times>"}
> {"B":[]}
> ```
> this seems like it shouldn't be too large to load into memory all-at-once, so 
> I'm surprised there is a segfault
> running via gdb and getting a backtrace gives
> ```
> (gdb) bt
>  #0 0x00007ffff5c1965d in std::__shared_ptr<arrow::Buffer, 
> (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<arrow::Buffer, 
> (__gnu_cxx::_Lock_policy)2> const&) () from 
> /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
>  #1 0x00007ffff5ca8d9e in arrow::json::ChunkedListArrayBuilder::Insert(long, 
> std::shared_ptr<arrow::Field> const&, std::shared_ptr<arrow::Array> const&) 
> () from 
> /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
>  #2 0x00007ffff5cabcc8 in 
> arrow::json::ChunkedStructArrayBuilder::Finish(std::shared_ptr<arrow::ChunkedArray>*)
>  () from 
> /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
>  #3 0x00007ffff5c1fc16 in arrow::json::TableReaderImpl::Read() () from 
> /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300
>  #4 0x00007fffcf73da69 in __pyx_pw_7pyarrow_5_json_1read_json(_object*, 
> _object*, _object*) () from 
> /home/patrick/.local/lib/python3.9/site-packages/pyarrow/_json.cpython-39-x86_64-linux-gnu.so
>  #5 0x00007ffff7d35a43 in ?? () from /usr/lib/libpython3.9.so.1.0
>  #6 0x00007ffff7d1be6d in _PyObject_MakeTpCall () from 
> /usr/lib/libpython3.9.so.1.0
>  #7 0x00007ffff7d17b3a in _PyEval_EvalFrameDefault () from 
> /usr/lib/libpython3.9.so.1.0
>  #8 0x00007ffff7d119ad in ?? () from /usr/lib/libpython3.9.so.1.0
>  #9 0x00007ffff7d11371 in _PyEval_EvalCodeWithName () from 
> /usr/lib/libpython3.9.so.1.0
>  #10 0x00007ffff7dd3f83 in PyEval_EvalCode () from 
> /usr/lib/libpython3.9.so.1.0
>  #11 0x00007ffff7de43dd in ?? () from /usr/lib/libpython3.9.so.1.0
>  #12 0x00007ffff7ddfc7b in ?? () from /usr/lib/libpython3.9.so.1.0
>  #13 0x00007ffff7cf38ab in ?? () from /usr/lib/libpython3.9.so.1.0
>  #14 0x00007ffff7cf3a63 in PyRun_InteractiveLoopFlags () from 
> /usr/lib/libpython3.9.so.1.0
>  #15 0x00007ffff7c81f6b in PyRun_AnyFileExFlags () from 
> /usr/lib/libpython3.9.so.1.0
>  #16 0x00007ffff7c7665c in ?? () from /usr/lib/libpython3.9.so.1.0
>  #17 0x00007ffff7dc6fa9 in Py_BytesMain () from /usr/lib/libpython3.9.so.1.0
>  #18 0x00007ffff7a43b25 in __libc_start_main () from /usr/lib/libc.so.6
>  #19 0x000055555555504e in _start ()
>  (gdb)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to