hrushikesh198 opened a new issue #11044:
URL: https://github.com/apache/arrow/issues/11044


   Hi,
   
   I am trying to load a ~300MB, 1.4M Lines file in JSONL format(one json per 
line). It generates Segmentation fault. I saw some past issues mentioned that 
and there were fixes merged. But I still see this issue with pyarrow 
3.0.0/4.0.0/5.0.0. 
   When I try to load a subset of the file it works fine, with the complete 
data it fails.
   
   Thank you for any help you can offer.
   
   Here is a sample of the data(I can not share the full file since it is 
private to my company):
   ```
   $ head -n3 data.json
   {"item_id": "100000663", "product_type": "Facial Masks", "brand": "Andalou 
Naturals", "color": "Other", "gender": "Unisex", "product_name": "Andalou 
Naturals Face Mask, Instant Luminous, 0.28 Oz"}
   {"item_id": "100001838", "product_type": "Dining Tables", "brand": "Liberty 
Furniture", "color": "Gray", "gender": "", "product_name": "Liberty Furniture 
Industries Summer House Rectangular Dining Table"}
   {"item_id": "100002700", "product_type": "Facial Treatments", "brand": 
"SkinCeuticals", "color": "", "gender": "Male", "product_name": "SkinCeuticals 
B3 Metacell Renewal 1.7 Oz"}
   ```
   I am using python: 3.7.10
   pyarrow: 4.0.1 installed using conda(4.9.2) on a Debian 10 machine
   
   Here is a the tiny python script 
   ```
   import faulthandler
   
   # import datasets
   from pyarrow import json as paj
   
   faulthandler.enable()  # print stack trace for seg faults
   
   if __name__ == '__main__':
       f1 = "data.json"
       paj.read_json(f1)  # fails with seg fault
   ```
   
   
   Here is the stacktrace from gdb:
   ```
   (gdb) run segf.py
   Starting program: segf.py
   [Thread debugging using libthread_db enabled]
   Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
   [New Thread 0x7ffff47ff700 (LWP 17933)]
   [New Thread 0x7ffff0b77700 (LWP 17934)]
   [New Thread 0x7fffdb3fb700 (LWP 17935)]
   [New Thread 0x7fffda2ff700 (LWP 17936)]
   [New Thread 0x7fffd9afe700 (LWP 17937)]
   [New Thread 0x7fffd8dff700 (LWP 17938)]
   [New Thread 0x7fffcbfff700 (LWP 17939)]
   [New Thread 0x7fffcb7fe700 (LWP 17940)]
   [New Thread 0x7fffcabff700 (LWP 17941)]
   [New Thread 0x7fffc9b7f700 (LWP 17942)]
   [New Thread 0x7fffc8dff700 (LWP 17943)]
   [New Thread 0x7fffafbff700 (LWP 17944)]
   [New Thread 0x7fffae5ff700 (LWP 17945)]
   [New Thread 0x7fffadbfe700 (LWP 17946)]
   [New Thread 0x7fffa3fff700 (LWP 17947)]
   [New Thread 0x7fffa37fe700 (LWP 17948)]
   [New Thread 0x7fffa27fd700 (LWP 17949)]
   
   Thread 1 "python" received signal SIGSEGV, Segmentation fault.
   0x00007ffff68ca21a in 
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from 
/opt/conda/lib/python3.7/site-packages/pyarrow/../../../libarrow.so.400
   (gdb) backtrace
   #0  0x00007ffff68ca21a in 
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from 
/opt/conda/lib/python3.7/site-packages/pyarrow/../../../libarrow.so.400
   #1  0x00007ffff70b84f2 in arrow::json::ChunkedListArrayBuilder::Insert(long, 
std::shared_ptr<arrow::Field> const&, std::shared_ptr<arrow::Array> const&) ()
      from 
/opt/conda/lib/python3.7/site-packages/pyarrow/../../../libarrow.so.400
   #2  0x00007ffff70b6d86 in 
arrow::json::ChunkedStructArrayBuilder::Finish(std::shared_ptr<arrow::ChunkedArray>*)
 ()
      from 
/opt/conda/lib/python3.7/site-packages/pyarrow/../../../libarrow.so.400
   #3  0x00007ffff70e9d13 in arrow::json::TableReaderImpl::Read() () from 
/opt/conda/lib/python3.7/site-packages/pyarrow/../../../libarrow.so.400
   #4  0x00007ffff0b82eba in __pyx_pw_7pyarrow_5_json_1read_json(_object*, 
_object*, _object*) () from 
/opt/conda/lib/python3.7/site-packages/pyarrow/_json.cpython-37m-x86_64-linux-gnu.so
   #5  0x00005555556e2427 in _PyMethodDef_RawFastCallKeywords 
(method=<optimized out>, self=0x0, args=0x7ffff79a85c8, nargs=<optimized out>, 
kwnames=<optimized out>)
       at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:693
   #6  0x00005555556e3ad8 in _PyCFunction_FastCallKeywords (kwnames=<optimized 
out>, nargs=<optimized out>, args=0x7ffff79a85c8, func=0x7ffff32f9af0)
       at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:723
   #7  call_function (pp_stack=0x7fffffffd8e0, oparg=<optimized out>, 
kwnames=<optimized out>) at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:4568
   #8  0x000055555570e74a in _PyEval_EvalFrameDefault (f=<optimized out>, 
throwflag=<optimized out>)
       at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3093
   #9  0x0000555555651af2 in PyEval_EvalFrameEx (throwflag=0, f=0x7ffff79a8450) 
at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3930
   #10 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, 
locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, 
kwnames=<optimized out>,
       kwargs=<optimized out>, kwcount=<optimized out>, kwstep=<optimized out>, 
defs=<optimized out>, defcount=<optimized out>, kwdefs=<optimized out>, 
closure=<optimized out>,
       name=<optimized out>, qualname=<optimized out>) at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3930
   #11 0x0000555555652d09 in PyEval_EvalCodeEx (_co=<optimized out>, 
globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
argcount=<optimized out>, kws=<optimized out>,
       kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3959
   #12 0x000055555572d8ab in PyEval_EvalCode (co=<optimized out>, 
globals=<optimized out>, locals=<optimized out>)
       at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:524
   #13 0x0000555555791f53 in run_mod (mod=<optimized out>, filename=<optimized 
out>, globals=0x7ffff7a11eb0, locals=0x7ffff7a11eb0, flags=<optimized out>, 
arena=<optimized out>)
       at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/pythonrun.c:1035
   #14 0x000055555579bfd7 in PyRun_FileExFlags (fp=0x555555925d30, 
filename_str=<optimized out>, start=<optimized out>, globals=0x7ffff7a11eb0, 
locals=0x7ffff7a11eb0, closeit=1,
       flags=0x7fffffffdbd0) at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/pythonrun.c:988
   #15 0x000055555579c1ac in PyRun_SimpleFileExFlags (fp=0x555555925d30, 
filename=<optimized out>, closeit=1, flags=0x7fffffffdbd0)
       at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/pythonrun.c:429
   #16 0x000055555579c709 in pymain_run_file (p_cf=0x7fffffffdbd0, 
filename=<optimized out>, fp=0x555555925d30)
       at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Modules/main.c:456
   #17 pymain_run_filename (cf=0x7fffffffdbd0, pymain=0x7fffffffdce0) at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Modules/main.c:1646
   #18 pymain_run_python (pymain=0x7fffffffdce0) at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Modules/main.c:2907
   #19 pymain_main (pymain=0x7fffffffdce0) at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Modules/main.c:3068
   #20 0x000055555579c85c in _Py_UnixMain (argc=<optimized out>, 
argv=<optimized out>) at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Modules/main.c:3103
   #21 0x00007ffff7c6b09b in __libc_start_main (main=0x555555631100 <main>, 
argc=2, argv=0x7fffffffde38, init=<optimized out>, fini=<optimized out>, 
rtld_fini=<optimized out>,
       stack_end=0x7fffffffde28) at ../csu/libc-start.c:308
   #22 0x0000555555719901 in _start () at 
/home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Parser/parser.c:325
   (gdb)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to