Joris Van den Bossche created ARROW-11983:
---------------------------------------------
Summary: [Python] ImportError calling pyarrow from_pandas within
ThreadPool
Key: ARROW-11983
URL: https://issues.apache.org/jira/browse/ARROW-11983
Project: Apache Arrow
Issue Type: Bug
Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
Fix For: 4.0.0
>From https://github.com/dask/dask/issues/7334
The referenced issue report is about an ImportError they get using Python 3.9
(and I can reproduce it). As far as I know how dask works, it's basically
calling `pa.Table.from_pandas` within a ThreadPool, and inside `from_pandas` we
do a `with futures.ThreadPoolExecutor`, which then fails with this error:
{code}
>>> df2.to_parquet('test99.parquet', engine='pyarrow-dataset')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/dataframe/core.py",
line 4127, in to_parquet
return to_parquet(self, path, *args, **kwargs)
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py",
line 671, in to_parquet
out = out.compute(**compute_kwargs)
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/base.py",
line 283, in compute
(result,) = compute(self, traverse=False, **kwargs)
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/base.py",
line 565, in compute
results = schedule(dsk, keys, **kwargs)
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/threaded.py",
line 76, in get
results = get_async(
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/local.py",
line 487, in get_async
raise_exception(exc, tb)
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/local.py",
line 317, in reraise
raise exc
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/local.py",
line 222, in execute_task
result = _execute_task(task, data)
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/core.py",
line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/utils.py",
line 35, in apply
return func(*args, **kwargs)
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py",
line 841, in write_partition
t = cls._pandas_to_arrow_table(df, preserve_index=preserve_index,
schema=schema)
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py",
line 814, in _pandas_to_arrow_table
table = pa.Table.from_pandas(df, preserve_index=preserve_index,
schema=schema)
File "pyarrow/table.pxi", line 1479, in pyarrow.lib.Table.from_pandas
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
line 596, in dataframe_to_arrays
with futures.ThreadPoolExecutor(nthreads) as executor:
File
"/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/concurrent/futures/__init__.py",
line 49, in __getattr__
from .thread import ThreadPoolExecutor as te
ImportError: cannot import name 'ThreadPoolExecutor' from partially initialized
module 'concurrent.futures.thread' (most likely due to a circular import)
(/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/concurrent/futures/thread.py)
{code}
We can probably avoid that by moving the import top-level (not inline inside
dataframe_to_arrays)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)