Hi - Gently bump this. I suspect this is an upstream issue and wonder if
this is a known issue. Is there any other information we can provide? (I
think the repro is pretty straightforward but let us know otherwise)

On Mon, Aug 8, 2022 at 8:16 PM Alex Libman <alex.lib...@twosigma.com> wrote:

> Hi,
>
> I've hit an issue in Python (3.9.12) where creating a Pyarrow dataset over
> a remote filesystem (such as GCS filesystem), and then opening a batch
> iterator over the dataset and having the program immediately exit /
> clean-up afterwards causes a PyGILState_Release error to get thrown. This
> is with pyarrow version v7.0.0.
>
> The error looks like:
> Fatal Python error: PyGILState_Release: thread state 0x7fbfd4002380 must
> be current when releasing
> Python runtime state: finalizing (tstate=0x55a079959380)
>
> Thread 0x00007fbfff5ee400 (most recent call first):
> <no Python frame>
>
>
> Example reproduce code:
>
> import pandas as pd
>
> import pyarrow.dataset as ds
>
>
>
> # Get GCS fsspec filesystem
>
> fs = get_gcs_fs()
>
>
>
> dummy_df = pd.DataFrame({"a": [1,2,3]})
>
>
>
> # Write out some dummy data for us to load a dataset from
>
> data_path = "test-bucket/debug-arrow-datasets/data.parquet"
>
> with fs.open(data_path, "wb") as f:
>
>     dummy_df.to_parquet(f)
>
>
>
> dummy_ds = ds.dataset([data_path], filesystem=fs)
>
>
>
> batch_iter = dummy_ds.to_batches()
>
> # Program finish
>
>
>
> # Putting some buffer time after the iterator is opened causes the issue
> to go away
>
> # import time
>
> # time.sleep(1)
>
>
> Using local parquet files for the dataset, adding some buffer time between
> iterator open and program exit (via time.sleep or something else), or
> consuming the entire iterator seems to make the issue go away. Is this
> reproducible if you swap in your own GCS filesystem?
>
> Thanks,
> Alex
>

Reply via email to