[I] [Python] cross account assumerole config issue with pyarrow._s3fs.S3FileSystem [arrow]

via GitHub Tue, 19 Dec 2023 08:15:21 -0800


jafournier opened a new issue, #39304:
URL: https://github.com/apache/arrow/issues/39304


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I have some aws assume_role config on an aws ec2 instance (NB role is in an 
account differing from the instance ).
   ```
   [profile default]
   role_arn = arn:aws:iam::123456768910:role/super-role
   credential_source = Ec2InstanceMetadata
   role_session_name = me
   region = [some region]
   ```
   
   When using pyarrow's ParquetDataset
   ```
   from pyarrow.parquet import ParquetDataset
   path = "s3://somebucket/jfo_test.parquet"
   ParquetDataset(path, use_legacy_dataset=False)
   ```
   I got the following stacktrace:
   
   ```
   ---------------------------------------------------------------------------
   OSError                                   Traceback (most recent call last)
   Cell In[15], line 3
         1 from pyarrow.parquet import ParquetDataset
         2 path = "s3://somebucket/jfo_test.parquet"
   ----> 3 ParquetDataset(path, use_legacy_dataset=False)
   
   File 
~/.pyenv/versions/3.9.13/envs/test-jfo/lib/python3.9/site-packages/pyarrow/parquet/core.py:1793,
 in ParquetDataset.__new__(cls, path_or_paths, filesystem, schema, metadata, 
split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary, 
memory_map, buffer_size, partitioning, use_legacy_dataset, pre_buffer, 
coerce_int96_timestamp_unit, thrift_string_size_limit, 
thrift_container_size_limit)
      1790         use_legacy_dataset = False
      1792 if not use_legacy_dataset:
   -> 1793     return _ParquetDatasetV2(
      1794         path_or_paths, filesystem=filesystem,
      1795         filters=filters,
      1796         partitioning=partitioning,
      1797         read_dictionary=read_dictionary,
      1798         memory_map=memory_map,
      1799         buffer_size=buffer_size,
      1800         pre_buffer=pre_buffer,
      1801         coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
      1802         # unsupported keywords
      1803         schema=schema, metadata=metadata,
      1804         split_row_groups=split_row_groups,
      1805         validate_schema=validate_schema,
      1806         metadata_nthreads=metadata_nthreads,
      1807         thrift_string_size_limit=thrift_string_size_limit,
      1808         thrift_container_size_limit=thrift_container_size_limit,
      1809     )
      1810 warnings.warn(
      1811     "Passing 'use_legacy_dataset=True' to get the legacy behaviour 
is "
      1812     "deprecated as of pyarrow 11.0.0, and the legacy implementation "
      1813     "will be removed in a future version." + extra_msg,
      1814     FutureWarning, stacklevel=2)
      1815 self = object.__new__(cls)
   
   File 
~/.pyenv/versions/3.9.13/envs/test-jfo/lib/python3.9/site-packages/pyarrow/parquet/core.py:2482,
 in _ParquetDatasetV2.__init__(self, path_or_paths, filesystem, filters, 
partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, 
pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, 
thrift_string_size_limit, thrift_container_size_limit, **kwargs)
      2480     except ValueError:
      2481         filesystem = LocalFileSystem(use_mmap=memory_map)
   -> 2482 finfo = filesystem.get_file_info(path_or_paths)
      2483 if finfo.is_file:
      2484     single_file = path_or_paths
   
   File 
~/.pyenv/versions/3.9.13/envs/test-jfo/lib/python3.9/site-packages/pyarrow/_fs.pyx:581,
 in pyarrow._fs.FileSystem.get_file_info()
   
   File 
~/.pyenv/versions/3.9.13/envs/test-jfo/lib/python3.9/site-packages/pyarrow/error.pxi:154,
 in pyarrow.lib.pyarrow_internal_check_status()
   
   File 
~/.pyenv/versions/3.9.13/envs/test-jfo/lib/python3.9/site-packages/pyarrow/error.pxi:91,
 in pyarrow.lib.check_status()
   
   OSError: When getting information for key 'jfo_test.parquet' in bucket 
'somebucket': AWS Error ACCESS_DENIED during HeadObject operation: No response 
body.
   ```
   
   However I do am able to perform read/headobjects operations!
   Moreover using legacy dataset
   ```
   from pyarrow.parquet import ParquetDataset
   path = "s3://somebucket/jfo_test.parquet"
   ParquetDataset(path, use_legacy_dataset=True) # works!
   ```
   or using s3fs filesystem
   ```
   from pyarrow.parquet import ParquetDataset
   from s3fs.core import S3FileSystem
   path = "s3://somebucket/jfo_test.parquet"
   s3 = S3FileSystem()
   ParquetDataset(path.removeprefix("s3://"), filesystem=s3, 
use_legacy_dataset=False) # works!
   ```
   are working perfectly fine!
   
   
   When gathering a bit more aws info, running
   ```
   from pyarrow.parquet import ParquetDataset
   import pyarrow._s3fs
   pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Debug)
   path = "s3://somebucket/jfo_test.parquet"
   ParquetDataset(path, use_legacy_dataset=False)
   ```
   
   I see in logs:
   
   ```
   [INFO] 2023-12-19 14:46:51.366 ProcessCredentialsProvider [140552489178752] 
Failed to find credential process's profile: default
   ```
   
   And unfortunately the credentials used in the head-object call are the 
security credential previously calling:
   ```
   [DEBUG] 2023-12-19 14:46:51.361 EC2MetadataClient [140552489178752] Calling 
EC2MetadataService resource 
http://169.254.169.254/latest/meta-data/iam/security-credentials/[my-instance-profile]
 with token.
   ```
   and the instance profile isn't allowed to access the bucket directly...
   
   How can we make pyarrow._s3fs.S3FileSystem compatible with assumerole config 
?
   
   I am using pyarrow 14.0.1  with python 3.9
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Python] cross account assumerole config issue with pyarrow._s3fs.S3FileSystem [arrow]

Reply via email to