Nicholas Kappel created ARROW-16864:
---------------------------------------

             Summary: pyarrow - external_id for S3 incorrectly set
                 Key: ARROW-16864
                 URL: https://issues.apache.org/jira/browse/ARROW-16864
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 8.0.0
            Reporter: Nicholas Kappel


It looks like any attempt to read from S3 via pyarrow fails if access is 
supposed to be done via Assume Role while not passing an `external_id` to 
S3FileSystem.

In my understanding, `external_id` is an optional string to be passed to AWS 
API, however by setting `external_id=None` by default in init and then apply 
`tobytes()` to it later, it fails if external_id is None.
https://github.com/apache/arrow/blob/c72f84a48b4952796ab78a0c33b84a9fc8f893db/python/pyarrow/_s3fs.pyx#L230

This then leads to an exception like this:
{code}
(...)
    df = cursor.execute(query+';').as_pandas()
  File "/opt/conda/lib/python3.9/site-packages/pyathena/util.py", line 37, in 
_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/pyathena/pandas/cursor.py", line 
157, in execute
    self.result_set = AthenaPandasResultSet(
  File "/opt/conda/lib/python3.9/site-packages/pyathena/pandas/result_set.py", 
line 72, in __init__
    self._fs = self.__s3_file_system()
  File "/opt/conda/lib/python3.9/site-packages/pyathena/pandas/result_set.py", 
line 86, in __s3_file_system
    fs = fs.S3FileSystem(
  File "pyarrow/_s3fs.pyx", line 217, in pyarrow._s3fs.S3FileSystem.__init__
  File "stringsource", line 15, in 
string.from_py.__pyx_convert_string_from_py_std__in_string
TypeError: expected bytes, NoneType found
{code}

This exception comes from using pyarrow with pyathena lib and their code [does 
not pass any 
external_id|https://github.com/laughingman7743/PyAthena/blob/e055742b4742e2efaa0f4b3ff1c5af1146e25b23/pyathena/pandas/result_set.py#L86].



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to