I'm not sure of the exact error you are getting but I suspect this may
be related to something I am currently working on[1].  I can reproduce
it fairly easily without GCS:

```
import pyarrow as pa
import pyarrow.dataset as ds

my_dataset = ds.dataset(['/some/big/file.csv'], format='csv')
batch_iter = my_dataset.to_batches()
```

This program will occasionally emit a segmentation fault on exit.  The
call to `to_batches` starts reading the dataset in the background
(this is true even if use_threads=False which is another bug, [2]).

When the program exits this is what should happen:

 * The batch_iter destructor should signal cancellation / abort in the
underlying scanner.  Any pending, but not yet submitted, tasks should
be purged.  The destructor should then block until the remaining
in-progress tasks have completed.

However, today, the destructor does not block until the remaining
in-progress tasks have completed.  These tasks generally capture
copies of the state that they rely on and so they can safely execute
even after the iterator is destroyed.  However, there are a few global
resources (global CPU executor, global memory pool, maybe some GCS or
python resources?) that these tasks assume always exist and the tasks
do not (and cannot) keep these global resources alive.  So, at
shutdown, depending on th order in which things shutdown, the
in-progress tasks may try and access now-deleted global resources and
fail.

Last week and this week I have been actively working on [1] (there are
a few subtasks as this is a fairly involved change to the scanner) and
hope to return to [2] soon.  My current hope is to have something
ready by the end of the month.

[1] https://issues.apache.org/jira/browse/ARROW-16072
[2] https://issues.apache.org/jira/browse/ARROW-15732

On Wed, Aug 10, 2022 at 1:15 PM Li Jin <ice.xell...@gmail.com> wrote:
>
> Hi - Gently bump this. I suspect this is an upstream issue and wonder if
> this is a known issue. Is there any other information we can provide? (I
> think the repro is pretty straightforward but let us know otherwise)
>
> On Mon, Aug 8, 2022 at 8:16 PM Alex Libman <alex.lib...@twosigma.com> wrote:
>
> > Hi,
> >
> > I've hit an issue in Python (3.9.12) where creating a Pyarrow dataset over
> > a remote filesystem (such as GCS filesystem), and then opening a batch
> > iterator over the dataset and having the program immediately exit /
> > clean-up afterwards causes a PyGILState_Release error to get thrown. This
> > is with pyarrow version v7.0.0.
> >
> > The error looks like:
> > Fatal Python error: PyGILState_Release: thread state 0x7fbfd4002380 must
> > be current when releasing
> > Python runtime state: finalizing (tstate=0x55a079959380)
> >
> > Thread 0x00007fbfff5ee400 (most recent call first):
> > <no Python frame>
> >
> >
> > Example reproduce code:
> >
> > import pandas as pd
> >
> > import pyarrow.dataset as ds
> >
> >
> >
> > # Get GCS fsspec filesystem
> >
> > fs = get_gcs_fs()
> >
> >
> >
> > dummy_df = pd.DataFrame({"a": [1,2,3]})
> >
> >
> >
> > # Write out some dummy data for us to load a dataset from
> >
> > data_path = "test-bucket/debug-arrow-datasets/data.parquet"
> >
> > with fs.open(data_path, "wb") as f:
> >
> >     dummy_df.to_parquet(f)
> >
> >
> >
> > dummy_ds = ds.dataset([data_path], filesystem=fs)
> >
> >
> >
> > batch_iter = dummy_ds.to_batches()
> >
> > # Program finish
> >
> >
> >
> > # Putting some buffer time after the iterator is opened causes the issue
> > to go away
> >
> > # import time
> >
> > # time.sleep(1)
> >
> >
> > Using local parquet files for the dataset, adding some buffer time between
> > iterator open and program exit (via time.sleep or something else), or
> > consuming the entire iterator seems to make the issue go away. Is this
> > reproducible if you swap in your own GCS filesystem?
> >
> > Thanks,
> > Alex
> >

Reply via email to