Thanks Weston. (Alex can correct me if I am wrong) Currently we only saw this in tests and not production usage. Glad to know this is being worked on.
On Wed, Aug 10, 2022 at 8:22 PM Weston Pace <weston.p...@gmail.com> wrote: > I'm not sure of the exact error you are getting but I suspect this may > be related to something I am currently working on[1]. I can reproduce > it fairly easily without GCS: > > ``` > import pyarrow as pa > import pyarrow.dataset as ds > > my_dataset = ds.dataset(['/some/big/file.csv'], format='csv') > batch_iter = my_dataset.to_batches() > ``` > > This program will occasionally emit a segmentation fault on exit. The > call to `to_batches` starts reading the dataset in the background > (this is true even if use_threads=False which is another bug, [2]). > > When the program exits this is what should happen: > > * The batch_iter destructor should signal cancellation / abort in the > underlying scanner. Any pending, but not yet submitted, tasks should > be purged. The destructor should then block until the remaining > in-progress tasks have completed. > > However, today, the destructor does not block until the remaining > in-progress tasks have completed. These tasks generally capture > copies of the state that they rely on and so they can safely execute > even after the iterator is destroyed. However, there are a few global > resources (global CPU executor, global memory pool, maybe some GCS or > python resources?) that these tasks assume always exist and the tasks > do not (and cannot) keep these global resources alive. So, at > shutdown, depending on th order in which things shutdown, the > in-progress tasks may try and access now-deleted global resources and > fail. > > Last week and this week I have been actively working on [1] (there are > a few subtasks as this is a fairly involved change to the scanner) and > hope to return to [2] soon. My current hope is to have something > ready by the end of the month. > > [1] https://issues.apache.org/jira/browse/ARROW-16072 > [2] https://issues.apache.org/jira/browse/ARROW-15732 > > On Wed, Aug 10, 2022 at 1:15 PM Li Jin <ice.xell...@gmail.com> wrote: > > > > Hi - Gently bump this. I suspect this is an upstream issue and wonder if > > this is a known issue. Is there any other information we can provide? (I > > think the repro is pretty straightforward but let us know otherwise) > > > > On Mon, Aug 8, 2022 at 8:16 PM Alex Libman <alex.lib...@twosigma.com> > wrote: > > > > > Hi, > > > > > > I've hit an issue in Python (3.9.12) where creating a Pyarrow dataset > over > > > a remote filesystem (such as GCS filesystem), and then opening a batch > > > iterator over the dataset and having the program immediately exit / > > > clean-up afterwards causes a PyGILState_Release error to get thrown. > This > > > is with pyarrow version v7.0.0. > > > > > > The error looks like: > > > Fatal Python error: PyGILState_Release: thread state 0x7fbfd4002380 > must > > > be current when releasing > > > Python runtime state: finalizing (tstate=0x55a079959380) > > > > > > Thread 0x00007fbfff5ee400 (most recent call first): > > > <no Python frame> > > > > > > > > > Example reproduce code: > > > > > > import pandas as pd > > > > > > import pyarrow.dataset as ds > > > > > > > > > > > > # Get GCS fsspec filesystem > > > > > > fs = get_gcs_fs() > > > > > > > > > > > > dummy_df = pd.DataFrame({"a": [1,2,3]}) > > > > > > > > > > > > # Write out some dummy data for us to load a dataset from > > > > > > data_path = "test-bucket/debug-arrow-datasets/data.parquet" > > > > > > with fs.open(data_path, "wb") as f: > > > > > > dummy_df.to_parquet(f) > > > > > > > > > > > > dummy_ds = ds.dataset([data_path], filesystem=fs) > > > > > > > > > > > > batch_iter = dummy_ds.to_batches() > > > > > > # Program finish > > > > > > > > > > > > # Putting some buffer time after the iterator is opened causes the > issue > > > to go away > > > > > > # import time > > > > > > # time.sleep(1) > > > > > > > > > Using local parquet files for the dataset, adding some buffer time > between > > > iterator open and program exit (via time.sleep or something else), or > > > consuming the entire iterator seems to make the issue go away. Is this > > > reproducible if you swap in your own GCS filesystem? > > > > > > Thanks, > > > Alex > > > >