Hi - Gently bump this. I suspect this is an upstream issue and wonder if this is a known issue. Is there any other information we can provide? (I think the repro is pretty straightforward but let us know otherwise)
On Mon, Aug 8, 2022 at 8:16 PM Alex Libman <alex.lib...@twosigma.com> wrote: > Hi, > > I've hit an issue in Python (3.9.12) where creating a Pyarrow dataset over > a remote filesystem (such as GCS filesystem), and then opening a batch > iterator over the dataset and having the program immediately exit / > clean-up afterwards causes a PyGILState_Release error to get thrown. This > is with pyarrow version v7.0.0. > > The error looks like: > Fatal Python error: PyGILState_Release: thread state 0x7fbfd4002380 must > be current when releasing > Python runtime state: finalizing (tstate=0x55a079959380) > > Thread 0x00007fbfff5ee400 (most recent call first): > <no Python frame> > > > Example reproduce code: > > import pandas as pd > > import pyarrow.dataset as ds > > > > # Get GCS fsspec filesystem > > fs = get_gcs_fs() > > > > dummy_df = pd.DataFrame({"a": [1,2,3]}) > > > > # Write out some dummy data for us to load a dataset from > > data_path = "test-bucket/debug-arrow-datasets/data.parquet" > > with fs.open(data_path, "wb") as f: > > dummy_df.to_parquet(f) > > > > dummy_ds = ds.dataset([data_path], filesystem=fs) > > > > batch_iter = dummy_ds.to_batches() > > # Program finish > > > > # Putting some buffer time after the iterator is opened causes the issue > to go away > > # import time > > # time.sleep(1) > > > Using local parquet files for the dataset, adding some buffer time between > iterator open and program exit (via time.sleep or something else), or > consuming the entire iterator seems to make the issue go away. Is this > reproducible if you swap in your own GCS filesystem? > > Thanks, > Alex >