Eric Kim created ARROW-14325:
--------------------------------
Summary: S3FileSystem enable automatic temporary credential
refreshing for AWS Instance Profile
Key: ARROW-14325
URL: https://issues.apache.org/jira/browse/ARROW-14325
Project: Apache Arrow
Issue Type: Improvement
Affects Versions: 5.0.0
Reporter: Eric Kim
*Context*: I am running pyarrow==5.0.0 on an AWS EC2 instance that is set up
with an IAM role. AWS S3 credentials are provided via Instance Profiles, where
my python application code (eg pyarrow) receives *temporary* S3 credentials
(with a limited lifetime, eg 4 hours).
For more info on this credential setup, see:
[https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html#roles-usingrole-ec2instance-roles]
*Problem*: I am running a long-running pyarrow script on my EC2 instance (eg
one that exceeds 4 hours in duration) that is streaming data from S3 the entire
time. After ~4 hours, the script fails with a token expiration error:
{code:java}
...
File "pyarrow/_dataset.pyx", line 3042, in _iterator
File "pyarrow/_dataset.pyx", line 2813, in
pyarrow._dataset.TaggedRecordBatchIterator.__next__
File "pyarrow/error.pxi", line 143, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: Could not open Parquet input source
'some-bucket/some/path/to/some_file.parquet': AWS Error [code 100]: Unable to
parse ExceptionName: ExpiredToken Message: The provided token has expired.
{code}
Digging into the source code, I suspect that pyarrow's S3FileSystem is
currently doing the following:
{code:java}
# Highly simplified
class S3FileSystem:
def __init__(self):
credentials_provider = Aws::Auth::DefaultAWSCredentialsProviderChain>()
...
# in pyarrow.dataset code
def create_dataset(s3_path: str, s3fs: S3FileSystem) -> pyarrow.dataset.Dataset:
# Creates TEMPORARY credentials that will expire in ~4 hours
# Notably, pyarrow never tries to REFRESH these temp creds, which means
# that this returned Dataset will start failing after cred expiration, eg
# after ~4 hours
aws_session_token, aws_secret_access_key, aws_access_key_id =
s3fs.credentials_provider.get_credentials()
return create_dataset_from_s3(s3_path, s3fs, aws_session_token,
aws_secret_access_key, aws_access_key_id)
{code}
*Feature request*: it'd be really great if pyarrow.fs.S3FileSystem could
auto-refresh temporary credentials.
It's worth noting that with "typical" usage of AWS S3 SDK/Tools, the S3
temporary credentials are transparently constantly refreshed when using IAM
roles + Instance Profiles (the AWS S3 SDK should, when the temp credentials
expire, auto-regenerate the credentials from the IMDS, eg see:
[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials]
Additional notes:
I did some initial digging into how S3FileSystem uses the AWS SDK credential
providers, and I'm 99% sure that the current default credential provider does
NOT support auto credential refreshing:
# pyarrow by default will use this DefaultAWSCredentialsProviderChain, which
will (in my case) fall back to EC2ContainerCredentialsProviderWrapper
https://github.com/apache/arrow/blob/5c6f05f2bcc779a9ba82ba6920acfb7fd1ab6cd9/cpp/src/arrow/filesystem/s3fs.cc#L208
https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.java
https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.java#L46
# which uses this InstanceProfileCredentialsProvider
https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/InstanceProfileCredentialsProvider.java#L34
# i THINK that this does NOT implement temp cred refreshing, which could
explain why my job died after a few hours:
https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/InstanceMetadataServiceCredentialsFetcher.java#L70
# on the other hand, pyarrow's arn_role follows a different chain, using the
StsAssumeRoleCredentialsProvider and notably passes a `load_frequency` arg, and
does seem to have temp cred refresh enabled
https://github.com/apache/arrow/blob/5c6f05f2bcc779a9ba82ba6920acfb7fd1ab6cd9/cpp/src/arrow/filesystem/s3fs.cc#L227
https://github.com/aws/aws-sdk-java-v2/blob/master/services/sts/src/main/java/software/amazon/awssdk/services/sts/auth/StsAssumeRoleCredentialsProvider.java#L43
Finally: this PR did add support for automatic temporary credential refreshing,
but this is ONLY for the "arn_role" (assume ARN IAM role) code path:
[https://github.com/apache/arrow/pull/7803]
Sadly, for my use case I can't use the "arn_role" code path since my EC2
instance has already assumed the required IAM role, and AWS does not play
nicely with assuming the same role you already have.
I'm not aware of any workarounds, other than possibly "hot swapping" out the
S3FileSystem credential provider instance with a "fresh" one when my user code
detects that the temporary credentials have expired. Not sure if that's even
possible though.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)