Nicholas Kappel created ARROW-16864:
---------------------------------------
Summary: pyarrow - external_id for S3 incorrectly set
Key: ARROW-16864
URL: https://issues.apache.org/jira/browse/ARROW-16864
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 8.0.0
Reporter: Nicholas Kappel
It looks like any attempt to read from S3 via pyarrow fails if access is
supposed to be done via Assume Role while not passing an `external_id` to
S3FileSystem.
In my understanding, `external_id` is an optional string to be passed to AWS
API, however by setting `external_id=None` by default in init and then apply
`tobytes()` to it later, it fails if external_id is None.
https://github.com/apache/arrow/blob/c72f84a48b4952796ab78a0c33b84a9fc8f893db/python/pyarrow/_s3fs.pyx#L230
This then leads to an exception like this:
{code}
(...)
df = cursor.execute(query+';').as_pandas()
File "/opt/conda/lib/python3.9/site-packages/pyathena/util.py", line 37, in
_wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/pyathena/pandas/cursor.py", line
157, in execute
self.result_set = AthenaPandasResultSet(
File "/opt/conda/lib/python3.9/site-packages/pyathena/pandas/result_set.py",
line 72, in __init__
self._fs = self.__s3_file_system()
File "/opt/conda/lib/python3.9/site-packages/pyathena/pandas/result_set.py",
line 86, in __s3_file_system
fs = fs.S3FileSystem(
File "pyarrow/_s3fs.pyx", line 217, in pyarrow._s3fs.S3FileSystem.__init__
File "stringsource", line 15, in
string.from_py.__pyx_convert_string_from_py_std__in_string
TypeError: expected bytes, NoneType found
{code}
This exception comes from using pyarrow with pyathena lib and their code [does
not pass any
external_id|https://github.com/laughingman7743/PyAthena/blob/e055742b4742e2efaa0f4b3ff1c5af1146e25b23/pyathena/pandas/result_set.py#L86].
--
This message was sent by Atlassian Jira
(v8.20.7#820007)