[ 
https://issues.apache.org/jira/browse/ARROW-14325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14325:
------------------------------------
    Summary: [C++] S3FileSystem enable automatic temporary credential 
refreshing for AWS Instance Profile  (was: S3FileSystem enable automatic 
temporary credential refreshing for AWS Instance Profile)

> [C++] S3FileSystem enable automatic temporary credential refreshing for AWS 
> Instance Profile
> --------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14325
>                 URL: https://issues.apache.org/jira/browse/ARROW-14325
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 5.0.0
>            Reporter: Eric Kim
>            Priority: Major
>              Labels: S3, S3FileSystem
>
> *Context*: I am running pyarrow==5.0.0 on an AWS EC2 instance that is set up 
> with an IAM role. AWS S3 credentials are provided via Instance Profiles, 
> where my python application code (eg pyarrow) receives *temporary* S3 
> credentials (with a limited lifetime, eg  4 hours).
> For more info on this credential setup, see: 
> [https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html#roles-usingrole-ec2instance-roles]
>  
> *Problem*: I am running a long-running pyarrow script on my EC2 instance (eg 
> one that exceeds 4 hours in duration) that is streaming data from S3 the 
> entire time. After ~4 hours, the script fails with a token expiration error:
>  
> {code:java}
> ...
> File "pyarrow/_dataset.pyx", line 3042, in _iterator
>  File "pyarrow/_dataset.pyx", line 2813, in 
> pyarrow._dataset.TaggedRecordBatchIterator.__next__
>  File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
>  File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> OSError: Could not open Parquet input source 
> 'some-bucket/some/path/to/some_file.parquet': AWS Error [code 100]: Unable to 
> parse ExceptionName: ExpiredToken Message: The provided token has expired.
> {code}
>  
> Digging into the source code, I suspect that pyarrow's S3FileSystem is 
> currently doing the following:
>  
> {code:java}
> # Highly simplified
> class S3FileSystem:
>     def __init__(self):
>         credentials_provider = 
> Aws::Auth::DefaultAWSCredentialsProviderChain>()
>     ...
> # in pyarrow.dataset code
> def create_dataset(s3_path: str, s3fs: S3FileSystem) -> 
> pyarrow.dataset.Dataset:
>     # Creates TEMPORARY credentials that will expire in ~4 hours
>     # Notably, pyarrow never tries to REFRESH these temp creds, which means
>     #   that this returned Dataset will start failing after cred expiration, 
> eg
>     #   after ~4 hours
>     aws_session_token, aws_secret_access_key, aws_access_key_id = 
> s3fs.credentials_provider.get_credentials()
>     return create_dataset_from_s3(s3_path, s3fs, aws_session_token, 
> aws_secret_access_key, aws_access_key_id)
> {code}
>  
> *Feature request*: it'd be really great if pyarrow.fs.S3FileSystem could 
> auto-refresh temporary credentials.
> It's worth noting that with "typical" usage of AWS S3 SDK/Tools, the S3 
> temporary credentials are transparently constantly refreshed when using IAM 
> roles + Instance Profiles (the AWS S3 SDK should, when the temp credentials 
> expire, auto-regenerate the credentials from the IMDS, eg see:
> [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials]
>  
> Additional notes:
> I did some initial digging into how S3FileSystem uses the AWS SDK credential 
> providers, and I'm 99% sure that the current default credential provider does 
> NOT support auto credential refreshing:
> # pyarrow by default will use this DefaultAWSCredentialsProviderChain, which 
> will (in my case) fall back to EC2ContainerCredentialsProviderWrapper
> https://github.com/apache/arrow/blob/5c6f05f2bcc779a9ba82ba6920acfb7fd1ab6cd9/cpp/src/arrow/filesystem/s3fs.cc#L208
> https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.java
> https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.java#L46
> # which uses this InstanceProfileCredentialsProvider
> https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/InstanceProfileCredentialsProvider.java#L34
> # i THINK that this does NOT implement temp cred refreshing, which could 
> explain why my job died after a few hours:
> https://github.com/aws/aws-sdk-java/blob/f275e02d99543886ec584f4978b01bdc1d149906/aws-java-sdk-core/src/main/java/com/amazonaws/auth/InstanceMetadataServiceCredentialsFetcher.java#L70
> # on the other hand, pyarrow's arn_role follows a different chain, using the 
> StsAssumeRoleCredentialsProvider and notably passes a `load_frequency` arg, 
> and does seem to have temp cred refresh enabled
> https://github.com/apache/arrow/blob/5c6f05f2bcc779a9ba82ba6920acfb7fd1ab6cd9/cpp/src/arrow/filesystem/s3fs.cc#L227
> https://github.com/aws/aws-sdk-java-v2/blob/master/services/sts/src/main/java/software/amazon/awssdk/services/sts/auth/StsAssumeRoleCredentialsProvider.java#L43
>  
> Finally: this PR did add support for automatic temporary credential 
> refreshing, but this is ONLY for the "arn_role" (assume ARN IAM role) code 
> path: [https://github.com/apache/arrow/pull/7803]
> Sadly, for my use case I can't use the "arn_role" code path since my EC2 
> instance has already assumed the required IAM role, and AWS does not play 
> nicely with assuming the same role you already have.
>  
> I'm not aware of any workarounds, other than possibly "hot swapping" out the 
> S3FileSystem credential provider instance with a "fresh" one when my user 
> code detects that the temporary credentials have expired. Not sure if that's 
> even possible though.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to