jafournier opened a new issue, #39304:
URL: https://github.com/apache/arrow/issues/39304
### Describe the bug, including details regarding any error messages,
version, and platform.
I have some aws assume_role config on an aws ec2 instance (NB role is in an
account differing from the instance ).
```
[profile default]
role_arn = arn:aws:iam::123456768910:role/super-role
credential_source = Ec2InstanceMetadata
role_session_name = me
region = [some region]
```
When using pyarrow's ParquetDataset
```
from pyarrow.parquet import ParquetDataset
path = "s3://somebucket/jfo_test.parquet"
ParquetDataset(path, use_legacy_dataset=False)
```
I got the following stacktrace:
```
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
Cell In[15], line 3
1 from pyarrow.parquet import ParquetDataset
2 path = "s3://somebucket/jfo_test.parquet"
----> 3 ParquetDataset(path, use_legacy_dataset=False)
File
~/.pyenv/versions/3.9.13/envs/test-jfo/lib/python3.9/site-packages/pyarrow/parquet/core.py:1793,
in ParquetDataset.__new__(cls, path_or_paths, filesystem, schema, metadata,
split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary,
memory_map, buffer_size, partitioning, use_legacy_dataset, pre_buffer,
coerce_int96_timestamp_unit, thrift_string_size_limit,
thrift_container_size_limit)
1790 use_legacy_dataset = False
1792 if not use_legacy_dataset:
-> 1793 return _ParquetDatasetV2(
1794 path_or_paths, filesystem=filesystem,
1795 filters=filters,
1796 partitioning=partitioning,
1797 read_dictionary=read_dictionary,
1798 memory_map=memory_map,
1799 buffer_size=buffer_size,
1800 pre_buffer=pre_buffer,
1801 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
1802 # unsupported keywords
1803 schema=schema, metadata=metadata,
1804 split_row_groups=split_row_groups,
1805 validate_schema=validate_schema,
1806 metadata_nthreads=metadata_nthreads,
1807 thrift_string_size_limit=thrift_string_size_limit,
1808 thrift_container_size_limit=thrift_container_size_limit,
1809 )
1810 warnings.warn(
1811 "Passing 'use_legacy_dataset=True' to get the legacy behaviour
is "
1812 "deprecated as of pyarrow 11.0.0, and the legacy implementation "
1813 "will be removed in a future version." + extra_msg,
1814 FutureWarning, stacklevel=2)
1815 self = object.__new__(cls)
File
~/.pyenv/versions/3.9.13/envs/test-jfo/lib/python3.9/site-packages/pyarrow/parquet/core.py:2482,
in _ParquetDatasetV2.__init__(self, path_or_paths, filesystem, filters,
partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes,
pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties,
thrift_string_size_limit, thrift_container_size_limit, **kwargs)
2480 except ValueError:
2481 filesystem = LocalFileSystem(use_mmap=memory_map)
-> 2482 finfo = filesystem.get_file_info(path_or_paths)
2483 if finfo.is_file:
2484 single_file = path_or_paths
File
~/.pyenv/versions/3.9.13/envs/test-jfo/lib/python3.9/site-packages/pyarrow/_fs.pyx:581,
in pyarrow._fs.FileSystem.get_file_info()
File
~/.pyenv/versions/3.9.13/envs/test-jfo/lib/python3.9/site-packages/pyarrow/error.pxi:154,
in pyarrow.lib.pyarrow_internal_check_status()
File
~/.pyenv/versions/3.9.13/envs/test-jfo/lib/python3.9/site-packages/pyarrow/error.pxi:91,
in pyarrow.lib.check_status()
OSError: When getting information for key 'jfo_test.parquet' in bucket
'somebucket': AWS Error ACCESS_DENIED during HeadObject operation: No response
body.
```
However I do am able to perform read/headobjects operations!
Moreover using legacy dataset
```
from pyarrow.parquet import ParquetDataset
path = "s3://somebucket/jfo_test.parquet"
ParquetDataset(path, use_legacy_dataset=True) # works!
```
or using s3fs filesystem
```
from pyarrow.parquet import ParquetDataset
from s3fs.core import S3FileSystem
path = "s3://somebucket/jfo_test.parquet"
s3 = S3FileSystem()
ParquetDataset(path.removeprefix("s3://"), filesystem=s3,
use_legacy_dataset=False) # works!
```
are working perfectly fine!
When gathering a bit more aws info, running
```
from pyarrow.parquet import ParquetDataset
import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Debug)
path = "s3://somebucket/jfo_test.parquet"
ParquetDataset(path, use_legacy_dataset=False)
```
I see in logs:
```
[INFO] 2023-12-19 14:46:51.366 ProcessCredentialsProvider [140552489178752]
Failed to find credential process's profile: default
```
And unfortunately the credentials used in the head-object call are the
security credential previously calling:
```
[DEBUG] 2023-12-19 14:46:51.361 EC2MetadataClient [140552489178752] Calling
EC2MetadataService resource
http://169.254.169.254/latest/meta-data/iam/security-credentials/[my-instance-profile]
with token.
```
and the instance profile isn't allowed to access the bucket directly...
How can we make pyarrow._s3fs.S3FileSystem compatible with assumerole config
?
I am using pyarrow 14.0.1 with python 3.9
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]