Eric Kim created ARROW-14325:
--------------------------------

             Summary: S3FileSystem enable automatic temporary credential 
refreshing for AWS Instance Profile
                 Key: ARROW-14325
                 URL: https://issues.apache.org/jira/browse/ARROW-14325
             Project: Apache Arrow
          Issue Type: Improvement
    Affects Versions: 5.0.0
            Reporter: Eric Kim


*Context*: I am running pyarrow==5.0.0 on an AWS EC2 instance that is set up 
with an IAM role. AWS S3 credentials are provided via Instance Profiles, where 
my python application code (eg pyarrow) receives *temporary* S3 credentials 
(with a limited lifetime, eg  4 hours).

For more info on this credential setup, see: 
[https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html#roles-usingrole-ec2instance-roles]

 

*Problem*: I am running a long-running pyarrow script on my EC2 instance (eg 
one that exceeds 4 hours in duration) that is streaming data from S3 the entire 
time. After ~4 hours, the script fails with a token expiration error:

 
{code:java}
...
File "pyarrow/_dataset.pyx", line 3042, in _iterator
 File "pyarrow/_dataset.pyx", line 2813, in 
pyarrow._dataset.TaggedRecordBatchIterator.__next__
 File "pyarrow/error.pxi", line 143, in 
pyarrow.lib.pyarrow_internal_check_status
 File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: Could not open Parquet input source 
'some-bucket/some/path/to/some_file.parquet': AWS Error [code 100]: Unable to 
parse ExceptionName: ExpiredToken Message: The provided token has expired.
{code}
 

Digging into the source code, I suspect that pyarrow's S3FileSystem is 
currently doing the following:

 
{code:java}
# Highly simplified
class S3FileSystem:
    def __init__(self):
        credentials_provider = Aws::Auth::DefaultAWSCredentialsProviderChain>()
    ...

# in pyarrow.dataset code
def create_dataset(s3_path: str, s3fs: S3FileSystem) -> pyarrow.dataset.Dataset:
    # Creates TEMPORARY credentials that will expire in ~4 hours
    # Notably, pyarrow never tries to REFRESH these temp creds, which means
    #   that this returned Dataset will start failing after cred expiration, eg
    #   after ~4 hours
    aws_session_token, aws_secret_access_key, aws_access_key_id = 
s3fs.credentials_provider.get_credentials()
    return create_dataset_from_s3(s3_path, s3fs, aws_session_token, 
aws_secret_access_key, aws_access_key_id)

{code}
 

*Feature request*: it'd be really great if pyarrow.fs.S3FileSystem could 
auto-refresh temporary credentials.

It's worth noting that with "typical" usage of AWS S3 SDK/Tools, the S3 
temporary credentials are transparently constantly refreshed when using IAM 
roles + Instance Profiles (the AWS S3 SDK should, when the temp credentials 
expire, auto-regenerate the credentials from the IMDS, eg see:

[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials]

 

Additional notes:

I did some initial digging into how S3FileSystem uses the AWS SDK credential 
providers, and I'm 99% sure that the current default credential provider does 
NOT support auto credential refreshing:

# pyarrow by default will use this DefaultAWSCredentialsProviderChain, which 
will (in my case) fall back to EC2ContainerCredentialsProviderWrapper
https://github.com/apache/arrow/blob/5c6f05f2bcc779a9ba82ba6920acfb7fd1ab6cd9/cpp/src/arrow/filesystem/s3fs.cc#L208
https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.java
https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.java#L46

# which uses this InstanceProfileCredentialsProvider
https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/InstanceProfileCredentialsProvider.java#L34

# i THINK that this does NOT implement temp cred refreshing, which could 
explain why my job died after a few hours:
https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/InstanceMetadataServiceCredentialsFetcher.java#L70

# on the other hand, pyarrow's arn_role follows a different chain, using the 
StsAssumeRoleCredentialsProvider and notably passes a `load_frequency` arg, and 
does seem to have temp cred refresh enabled
https://github.com/apache/arrow/blob/5c6f05f2bcc779a9ba82ba6920acfb7fd1ab6cd9/cpp/src/arrow/filesystem/s3fs.cc#L227
https://github.com/aws/aws-sdk-java-v2/blob/master/services/sts/src/main/java/software/amazon/awssdk/services/sts/auth/StsAssumeRoleCredentialsProvider.java#L43

 

Finally: this PR did add support for automatic temporary credential refreshing, 
but this is ONLY for the "arn_role" (assume ARN IAM role) code path: 
[https://github.com/apache/arrow/pull/7803]

Sadly, for my use case I can't use the "arn_role" code path since my EC2 
instance has already assumed the required IAM role, and AWS does not play 
nicely with assuming the same role you already have.

 

I'm not aware of any workarounds, other than possibly "hot swapping" out the 
S3FileSystem credential provider instance with a "fresh" one when my user code 
detects that the temporary credentials have expired. Not sure if that's even 
possible though.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to