[
https://issues.apache.org/jira/browse/ARROW-14325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neal Richardson updated ARROW-14325:
------------------------------------
Summary: [C++] S3FileSystem enable automatic temporary credential
refreshing for AWS Instance Profile (was: S3FileSystem enable automatic
temporary credential refreshing for AWS Instance Profile)
> [C++] S3FileSystem enable automatic temporary credential refreshing for AWS
> Instance Profile
> --------------------------------------------------------------------------------------------
>
> Key: ARROW-14325
> URL: https://issues.apache.org/jira/browse/ARROW-14325
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 5.0.0
> Reporter: Eric Kim
> Priority: Major
> Labels: S3, S3FileSystem
>
> *Context*: I am running pyarrow==5.0.0 on an AWS EC2 instance that is set up
> with an IAM role. AWS S3 credentials are provided via Instance Profiles,
> where my python application code (eg pyarrow) receives *temporary* S3
> credentials (with a limited lifetime, eg 4 hours).
> For more info on this credential setup, see:
> [https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html#roles-usingrole-ec2instance-roles]
>
> *Problem*: I am running a long-running pyarrow script on my EC2 instance (eg
> one that exceeds 4 hours in duration) that is streaming data from S3 the
> entire time. After ~4 hours, the script fails with a token expiration error:
>
> {code:java}
> ...
> File "pyarrow/_dataset.pyx", line 3042, in _iterator
> File "pyarrow/_dataset.pyx", line 2813, in
> pyarrow._dataset.TaggedRecordBatchIterator.__next__
> File "pyarrow/error.pxi", line 143, in
> pyarrow.lib.pyarrow_internal_check_status
> File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> OSError: Could not open Parquet input source
> 'some-bucket/some/path/to/some_file.parquet': AWS Error [code 100]: Unable to
> parse ExceptionName: ExpiredToken Message: The provided token has expired.
> {code}
>
> Digging into the source code, I suspect that pyarrow's S3FileSystem is
> currently doing the following:
>
> {code:java}
> # Highly simplified
> class S3FileSystem:
> def __init__(self):
> credentials_provider =
> Aws::Auth::DefaultAWSCredentialsProviderChain>()
> ...
> # in pyarrow.dataset code
> def create_dataset(s3_path: str, s3fs: S3FileSystem) ->
> pyarrow.dataset.Dataset:
> # Creates TEMPORARY credentials that will expire in ~4 hours
> # Notably, pyarrow never tries to REFRESH these temp creds, which means
> # that this returned Dataset will start failing after cred expiration,
> eg
> # after ~4 hours
> aws_session_token, aws_secret_access_key, aws_access_key_id =
> s3fs.credentials_provider.get_credentials()
> return create_dataset_from_s3(s3_path, s3fs, aws_session_token,
> aws_secret_access_key, aws_access_key_id)
> {code}
>
> *Feature request*: it'd be really great if pyarrow.fs.S3FileSystem could
> auto-refresh temporary credentials.
> It's worth noting that with "typical" usage of AWS S3 SDK/Tools, the S3
> temporary credentials are transparently constantly refreshed when using IAM
> roles + Instance Profiles (the AWS S3 SDK should, when the temp credentials
> expire, auto-regenerate the credentials from the IMDS, eg see:
> [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials]
>
> Additional notes:
> I did some initial digging into how S3FileSystem uses the AWS SDK credential
> providers, and I'm 99% sure that the current default credential provider does
> NOT support auto credential refreshing:
> # pyarrow by default will use this DefaultAWSCredentialsProviderChain, which
> will (in my case) fall back to EC2ContainerCredentialsProviderWrapper
> https://github.com/apache/arrow/blob/5c6f05f2bcc779a9ba82ba6920acfb7fd1ab6cd9/cpp/src/arrow/filesystem/s3fs.cc#L208
> https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.java
> https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.java#L46
> # which uses this InstanceProfileCredentialsProvider
> https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/InstanceProfileCredentialsProvider.java#L34
> # i THINK that this does NOT implement temp cred refreshing, which could
> explain why my job died after a few hours:
> https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/InstanceMetadataServiceCredentialsFetcher.java#L70
> # on the other hand, pyarrow's arn_role follows a different chain, using the
> StsAssumeRoleCredentialsProvider and notably passes a `load_frequency` arg,
> and does seem to have temp cred refresh enabled
> https://github.com/apache/arrow/blob/5c6f05f2bcc779a9ba82ba6920acfb7fd1ab6cd9/cpp/src/arrow/filesystem/s3fs.cc#L227
> https://github.com/aws/aws-sdk-java-v2/blob/master/services/sts/src/main/java/software/amazon/awssdk/services/sts/auth/StsAssumeRoleCredentialsProvider.java#L43
>
> Finally: this PR did add support for automatic temporary credential
> refreshing, but this is ONLY for the "arn_role" (assume ARN IAM role) code
> path: [https://github.com/apache/arrow/pull/7803]
> Sadly, for my use case I can't use the "arn_role" code path since my EC2
> instance has already assumed the required IAM role, and AWS does not play
> nicely with assuming the same role you already have.
>
> I'm not aware of any workarounds, other than possibly "hot swapping" out the
> S3FileSystem credential provider instance with a "fresh" one when my user
> code detects that the temporary credentials have expired. Not sure if that's
> even possible though.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)