alippai opened a new issue, #36389:
URL: https://github.com/apache/arrow/issues/36389
### Describe the bug, including details regarding any error messages,
version, and platform.
Hi,
I cannot get a simple pd.DataFrame ->pq.dataset working on pyarrow 11.0.0,
12.0.0, 12.0.1.
```python
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame({"a": [0.0] * 100})
pq.write_to_dataset(pa.Table.from_pandas(df), "/tmp/dump", use_threads=False)
```
This file yields:
```
terminate called without an active exception
Aborted (core dumped)
```
instantly.
Setting use_threads, partitioning, set_cpu_count, any env vars from
https://arrow.apache.org/docs/cpp/env_vars.html#cpp-env-vars doesn't change the
behavior. Python versions 3.11 or 3.10 produce the error too.
If I write:
```python
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame({"a": [0.0] * 100})
t = pa.Table.from_pandas(df)
pq.write_to_dataset(t, "/tmp/dump", use_threads=False)
```
or
```python
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import time
df = pd.DataFrame({"a": [0.0] * 100})
pq.write_to_dataset(pa.Table.from_pandas(df), "/tmp/dump", use_threads=False)
time.sleep(0.001)
```
The errors are less frequent / gone, so I assume it's about the interaction
of pyarrow and python gc (some GIL magic?).
Multiple package versions were tried, all from conda eg: `conda create -n
arrowcrash python=3.11 pyarrow=12.0.1 pandas` using regular linux x86_64 with
many cores.
Limiting the number of cores to a few using `taskset` reduces the number of
crashes too.
Downgrading to pyarrow=10.0.1 fixes the issue as well.
Some sanitized gdb output, maybe it helps:
```
(gdb) bt
#0 0x00007ffff6e9037f in raise () from /lib64/libc.so.6
#1 0x00007ffff6e7adb5 in abort () from /lib64/libc.so.6
#2 0x00007ffff4ec0ed0 in __gnu_cxx::__verbose_terminate_handler () at
/home/conda/feedstock_root/build_artifacts/gcc_compilers_1685813977163/work/build/x86_64-conda-linux-gnu/libstdc++-v3/libsupc++/vterminate.cc:95
#3 0x00007ffff4ebf40c in __cxxabiv1::__terminate (handler=<optimized out>)
at
/home/conda/feedstock_root/build_artifacts/gcc_compilers_1685813977163/work/build/x86_64-conda-linux-gnu/libstdc++-v3/libsupc++/eh_terminate.cc:48
#4 0x00007ffff4ebf45e in std::terminate () at
/home/conda/feedstock_root/build_artifacts/gcc_compilers_1685813977163/work/build/x86_64-conda-linux-gnu/libstdc++-v3/libsupc++/eh_terminate.cc:58
#5 0x00007ffff4ebf0d9 in __cxxabiv1::__gxx_personality_v0
(version=<optimized out>, actions=10, exception_class=0,
ue_header=0x7ff979ffbd70, context=0x7ff979ff9990) at
../../../../libstdc++-v3/libsupc++/unwind-pe.h:681
#6 0x00007ffff7e248ed in _Unwind_ForcedUnwind_Phase2
(exc=exc@entry=0x7ff979ffbd70, context=context@entry=0x7ff979ff9990,
frames_p=frames_p@entry=0x7ff979ff9898) at ../../../libgcc/gthr-default.h:183
#7 0x00007ffff7e24c50 in _Unwind_ForcedUnwind (exc=0x7ff979ffbd70,
stop=0x7ffff7bc13c0 <unwind_stop>, stop_argument=<optimized out>) at
../../../libgcc/gthr-default.h:218
#8 0x00007ffff7bc1556 in __pthread_unwind () from /lib64/libpthread.so.0
#9 0x00007ffff7bb940b in pthread_exit () from /lib64/libpthread.so.0
#10 0x00005555556e8bd2 in PyThread_exit_thread () at
/usr/local/src/conda/python-3.11.4/Include/internal/object.h:366
#11 0x0000555555647b09 in take_gil (tstate=<optimized out>) at
/usr/local/src/conda/python-3.11.4/Programs/pystate.c:226
#12 0x000055555573e352 in PyEval_RestoreThread (tstate=0x7ff9680056a0) at
/usr/local/src/conda/python-3.11.4/Programs/ceval_gil.h:535
#13 0x00005555558249cd in PyGILState_Ensure () at
/usr/local/src/conda/python-3.11.4/Modules/obmalloc.c:1708
#14 0x00007ffff689e49e in arrow::py::NumPyBuffer::~NumPyBuffer() () from
/miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/libarrow_python.so
#15 0x00007ffff6878a53 in std::_Sp_counted_ptr_inplace<arrow::ArrayData,
std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
from
/miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/libarrow_python.so
#16 0x00007ffff5452c22 in arrow::SimpleRecordBatch::~SimpleRecordBatch() ()
from
/miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/../../../libarrow.so.1100
#17 0x00007ffb186e4ca2 in
arrow::dataset::InMemoryFragment::~InMemoryFragment() () from
/miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/../../../libarrow_dataset.so.1100
#18 0x00007ffb186e3dda in
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release_last_use_cold()
() from
/miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/../../../libarrow_dataset.so.1100
#19 0x00007ffb186e31d2 in
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() [clone .part.0]
() from
/miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/../../../libarrow_dataset.so.1100
#20 0x00007ffb186e7800 in
std::_Function_handler<arrow::Future<std::shared_ptr<arrow::RecordBatch> > (),
arrow::dataset::InMemoryFragment::ScanBatchesAsync(std::shared_ptr<arrow::dataset::ScanOptions>
const&)::Generator>::_M_manager(std::_Any_data&, std::_Any_data const&,
std::_Manager_operation) () from
/miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/../../../libarrow_dataset.so.1100
#21 0x00007ffb18740f19 in
std::_Sp_counted_ptr_inplace<arrow::DefaultIfEmptyGenerator<std::shared_ptr<arrow::RecordBatch>
>::State, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
from
/miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/../../../libarrow_dataset.so.1100```
```
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff21ff700 (LWP 2586858)]
[New Thread 0x7ffff12d0700 (LWP 2587561)]
...
[New Thread 0x7ff9797fa700 (LWP 2587921)]
[New Thread 0x7ff963fff700 (LWP 2587922)]
[New Thread 0x7ff9637fe700 (LWP 2587923)]
terminate called without an active exception
Thread ... "python3.11" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ff979ffb700 (LWP 2587920)]
0x00007ffff6e9037f in raise () from /lib64/libc.so.6
```
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]