I'm not sure of the exact error you are getting but I suspect this may be related to something I am currently working on[1]. I can reproduce it fairly easily without GCS:
``` import pyarrow as pa import pyarrow.dataset as ds my_dataset = ds.dataset(['/some/big/file.csv'], format='csv') batch_iter = my_dataset.to_batches() ``` This program will occasionally emit a segmentation fault on exit. The call to `to_batches` starts reading the dataset in the background (this is true even if use_threads=False which is another bug, [2]). When the program exits this is what should happen: * The batch_iter destructor should signal cancellation / abort in the underlying scanner. Any pending, but not yet submitted, tasks should be purged. The destructor should then block until the remaining in-progress tasks have completed. However, today, the destructor does not block until the remaining in-progress tasks have completed. These tasks generally capture copies of the state that they rely on and so they can safely execute even after the iterator is destroyed. However, there are a few global resources (global CPU executor, global memory pool, maybe some GCS or python resources?) that these tasks assume always exist and the tasks do not (and cannot) keep these global resources alive. So, at shutdown, depending on th order in which things shutdown, the in-progress tasks may try and access now-deleted global resources and fail. Last week and this week I have been actively working on [1] (there are a few subtasks as this is a fairly involved change to the scanner) and hope to return to [2] soon. My current hope is to have something ready by the end of the month. [1] https://issues.apache.org/jira/browse/ARROW-16072 [2] https://issues.apache.org/jira/browse/ARROW-15732 On Wed, Aug 10, 2022 at 1:15 PM Li Jin <ice.xell...@gmail.com> wrote: > > Hi - Gently bump this. I suspect this is an upstream issue and wonder if > this is a known issue. Is there any other information we can provide? (I > think the repro is pretty straightforward but let us know otherwise) > > On Mon, Aug 8, 2022 at 8:16 PM Alex Libman <alex.lib...@twosigma.com> wrote: > > > Hi, > > > > I've hit an issue in Python (3.9.12) where creating a Pyarrow dataset over > > a remote filesystem (such as GCS filesystem), and then opening a batch > > iterator over the dataset and having the program immediately exit / > > clean-up afterwards causes a PyGILState_Release error to get thrown. This > > is with pyarrow version v7.0.0. > > > > The error looks like: > > Fatal Python error: PyGILState_Release: thread state 0x7fbfd4002380 must > > be current when releasing > > Python runtime state: finalizing (tstate=0x55a079959380) > > > > Thread 0x00007fbfff5ee400 (most recent call first): > > <no Python frame> > > > > > > Example reproduce code: > > > > import pandas as pd > > > > import pyarrow.dataset as ds > > > > > > > > # Get GCS fsspec filesystem > > > > fs = get_gcs_fs() > > > > > > > > dummy_df = pd.DataFrame({"a": [1,2,3]}) > > > > > > > > # Write out some dummy data for us to load a dataset from > > > > data_path = "test-bucket/debug-arrow-datasets/data.parquet" > > > > with fs.open(data_path, "wb") as f: > > > > dummy_df.to_parquet(f) > > > > > > > > dummy_ds = ds.dataset([data_path], filesystem=fs) > > > > > > > > batch_iter = dummy_ds.to_batches() > > > > # Program finish > > > > > > > > # Putting some buffer time after the iterator is opened causes the issue > > to go away > > > > # import time > > > > # time.sleep(1) > > > > > > Using local parquet files for the dataset, adding some buffer time between > > iterator open and program exit (via time.sleep or something else), or > > consuming the entire iterator seems to make the issue go away. Is this > > reproducible if you swap in your own GCS filesystem? > > > > Thanks, > > Alex > >