Re: Fatal Python error for process exit after opening Pyarrow batch iterator

Li Jin Thu, 11 Aug 2022 07:24:26 -0700

Thanks Weston. (Alex can correct me if I am wrong) Currently we only saw
this in tests and not production usage. Glad to know this is being worked
on.


On Wed, Aug 10, 2022 at 8:22 PM Weston Pace <weston.p...@gmail.com> wrote:

> I'm not sure of the exact error you are getting but I suspect this may
> be related to something I am currently working on[1].  I can reproduce
> it fairly easily without GCS:
>
> ```
> import pyarrow as pa
> import pyarrow.dataset as ds
>
> my_dataset = ds.dataset(['/some/big/file.csv'], format='csv')
> batch_iter = my_dataset.to_batches()
> ```
>
> This program will occasionally emit a segmentation fault on exit.  The
> call to `to_batches` starts reading the dataset in the background
> (this is true even if use_threads=False which is another bug, [2]).
>
> When the program exits this is what should happen:
>
>  * The batch_iter destructor should signal cancellation / abort in the
> underlying scanner.  Any pending, but not yet submitted, tasks should
> be purged.  The destructor should then block until the remaining
> in-progress tasks have completed.
>
> However, today, the destructor does not block until the remaining
> in-progress tasks have completed.  These tasks generally capture
> copies of the state that they rely on and so they can safely execute
> even after the iterator is destroyed.  However, there are a few global
> resources (global CPU executor, global memory pool, maybe some GCS or
> python resources?) that these tasks assume always exist and the tasks
> do not (and cannot) keep these global resources alive.  So, at
> shutdown, depending on th order in which things shutdown, the
> in-progress tasks may try and access now-deleted global resources and
> fail.
>
> Last week and this week I have been actively working on [1] (there are
> a few subtasks as this is a fairly involved change to the scanner) and
> hope to return to [2] soon.  My current hope is to have something
> ready by the end of the month.
>
> [1] https://issues.apache.org/jira/browse/ARROW-16072
> [2] https://issues.apache.org/jira/browse/ARROW-15732
>
> On Wed, Aug 10, 2022 at 1:15 PM Li Jin <ice.xell...@gmail.com> wrote:
> >
> > Hi - Gently bump this. I suspect this is an upstream issue and wonder if
> > this is a known issue. Is there any other information we can provide? (I
> > think the repro is pretty straightforward but let us know otherwise)
> >
> > On Mon, Aug 8, 2022 at 8:16 PM Alex Libman <alex.lib...@twosigma.com>
> wrote:
> >
> > > Hi,
> > >
> > > I've hit an issue in Python (3.9.12) where creating a Pyarrow dataset
> over
> > > a remote filesystem (such as GCS filesystem), and then opening a batch
> > > iterator over the dataset and having the program immediately exit /
> > > clean-up afterwards causes a PyGILState_Release error to get thrown.
> This
> > > is with pyarrow version v7.0.0.
> > >
> > > The error looks like:
> > > Fatal Python error: PyGILState_Release: thread state 0x7fbfd4002380
> must
> > > be current when releasing
> > > Python runtime state: finalizing (tstate=0x55a079959380)
> > >
> > > Thread 0x00007fbfff5ee400 (most recent call first):
> > > <no Python frame>
> > >
> > >
> > > Example reproduce code:
> > >
> > > import pandas as pd
> > >
> > > import pyarrow.dataset as ds
> > >
> > >
> > >
> > > # Get GCS fsspec filesystem
> > >
> > > fs = get_gcs_fs()
> > >
> > >
> > >
> > > dummy_df = pd.DataFrame({"a": [1,2,3]})
> > >
> > >
> > >
> > > # Write out some dummy data for us to load a dataset from
> > >
> > > data_path = "test-bucket/debug-arrow-datasets/data.parquet"
> > >
> > > with fs.open(data_path, "wb") as f:
> > >
> > >     dummy_df.to_parquet(f)
> > >
> > >
> > >
> > > dummy_ds = ds.dataset([data_path], filesystem=fs)
> > >
> > >
> > >
> > > batch_iter = dummy_ds.to_batches()
> > >
> > > # Program finish
> > >
> > >
> > >
> > > # Putting some buffer time after the iterator is opened causes the
> issue
> > > to go away
> > >
> > > # import time
> > >
> > > # time.sleep(1)
> > >
> > >
> > > Using local parquet files for the dataset, adding some buffer time
> between
> > > iterator open and program exit (via time.sleep or something else), or
> > > consuming the entire iterator seems to make the issue go away. Is this
> > > reproducible if you swap in your own GCS filesystem?
> > >
> > > Thanks,
> > > Alex
> > >
>

Re: Fatal Python error for process exit after opening Pyarrow batch iterator

Reply via email to