[jira] [Commented] (HADOOP-14556) S3A to support Delegation Tokens

Steve Loughran (JIRA) Wed, 15 Aug 2018 17:23:45 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-14556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581777#comment-16581777
 ]


Steve Loughran commented on HADOOP-14556:
-----------------------------------------

h2. Review of the Oauth S3A delegation token design.

h3. Token representation

{{org.apache.hadoop.fs.s3a.S3SessionToken}} is the token. The token identifier 
{{S3SessionToken.Identifier}} uses a URI to represent the bucket as

{code}
s3a://accesskey@bucket#sessionId
{code}

The session secret is marshalled in the base token in its password field;

h3. Credential retrieval: {{S3SessionToken.CredentialsProvider}}

This class implements {{AWSSessionCredentialsProvider}} and can be directly 
used for authenticating AWS Calls.

It gets the token for the canonical service of the FS URI passed in its 
constructor, and if it is the right kind, extracts the (access key, session 
secret, session ID) values. These are used to build the 
{{BasicSessionCredentials}} which is returned from to 
{{AWSSessionCredentialsProvider.getCredentials}}.

It implements the refresh() method to get the credentials, if (somehow) that 
UGI token were updated, it would update the credentials.

h3. Canonical service URI, {{S3AUtils.getCanonicalServiceURI}}

This is used to create a URI for mapping tokens:

{code}
public static URI getCanonicalServiceURI(URI uri) {
  String sessionKey = uri.getUserInfo();
  if (sessionKey != null) {
    sessionKey = sessionKey.split(":")[0];
  }
  if (sessionKey == null || sessionKey.isEmpty()) {
    sessionKey = "default";
  }
  return URI.create("s3://" + sessionKey + "@" + uri.getHost());
}
{code}


That is: if the URI comes in with a sessionKey as its user info, that is 
included in the canonicalisation; if not, it goes to default.

so {{getCanonicalServiceURI("s3a://bucket")}} is "s3://default:bucket", while  
"s3a://sessionId:bucket" would map to "s3://sessionId:bucket".


# Issue: why s3 over s3a? Assume they're using the s3:// prefix to ease 
migration from EMR s3 to ASF s3a; URLs can be shared. (Maybe we should think 
about easing that, or at least test that you can do it).
# Issue: when would you put a sessionId in a URL? It'd allow you to have 
different bindings to the same bucket from the same user. This seems like a 
complication.

h2. Job-submission time binding

{{S3AFileSystem.getCanonicalServiceName()}} returns that canonical URI.

{code}
public String getCanonicalServiceName() {
  return S3AUtils.getCanonicalServiceURI(uri).toString();
}
{code}

{{S3AFilesystem.getDelegationToken()}} looks at the current S3 client (the 
{{s3}} field) and gets its credentials. These are then returned as a new 
{{S3SessionToken}}


{code}

public Token<?> getDelegationToken(String renewer) throws IOException {
  Token<?> token = null;
  if (s3 instanceof AWSSessionCredentialsProvider) {
    AWSSessionCredentials sessionCreds =
      ((AWSSessionCredentialsProvider)s3).getCredentials();
    token = S3SessionToken.newInstance(
        getBucket(), sessionCreds, getCanonicalServiceName());
  }
  return token;
}
{code}

h2. Session-aware AWS client: {{AmazonS3ClientWithSTS}}


This is used as the S3 client by the filesystem; it is the one called in 
{{getDelegationToken()}} to get those session tokens for marshalling.

{code}
AmazonS3ClientWithSTS extends AmazonS3Client
  implements AWSSessionCredentialsProvider {
...
  public AWSSessionCredentials getCredentials() {
    // fetch session credentials if the current credentials are not
    // session credentials.
    AWSCredentials creds = awsCredentialsProvider.getCredentials();
    AWSSessionCredentials sessionCredentials;
    if (creds instanceof AWSSessionCredentials) {
      sessionCredentials = (AWSSessionCredentials)creds;
    } else {
      sessionCredentials = getSessionCredentials(lifetime);
    }
    return sessionCredentials;
  }
...
}
{code}

That is: if the first credentials returned from the provider list are session 
credentials, they are returned for propagation. If not, the current credentials 
are used to create a connection to STS and request those session credentials

{code}
// AmazonS3ClientWithSTS
private AWSSessionCredentials getSessionCredentials(int duration) {
  AWSSecurityTokenService stsClient = new AWSSecurityTokenServiceClient(
      awsCredentialsProvider, clientConfiguration);
  GetSessionTokenRequest tokenRequest =
      new GetSessionTokenRequest().withDurationSeconds(duration);
  Credentials stsCredentials =
      stsClient.getSessionToken(tokenRequest).getCredentials();
  return new BasicSessionCredentials(
      stsCredentials.getAccessKeyId(),
      stsCredentials.getSecretAccessKey(),
      stsCredentials.getSessionToken());
}
{code}

Note also, {{S3AUtils.createAWSCredentialProviderSet}} always inserts an 
instance of {{S3SessionToken.CredentialsProvider()}} at the top of the provider 
list, so, *irrespective of what your providers listed in 
"fs.s3a.aws.credentials.provider" do*, token extraction and retrieval takes 
priority.

{code}
// in S3AUtils
public static AWSCredentialProviderList createAWSCredentialProviderSet(
    URI binding, Configuration conf) throws IOException {
  AWSCredentialProviderList credentials = new AWSCredentialProviderList();
  // add credential provider to search ugi for tokens.
  credentials.add(new S3SessionToken.CredentialsProvider(binding));

  Class<?>[] awsClasses = loadAWSProviderClasses(conf,
      AWS_CREDENTIALS_PROVIDER);
  ...
{code}

If the token for that (canonicalised) binding URI is found, those session 
credentials take priority over everything else.

Otherwise, if any other provider offers up session tokens, they are picked up 
and propagated. This includes IAM roles, as that is what 
{{InstanceProfileCredentialsProvider}} returns (it picks up session creds from 
the EC2 metadata web service, remember).

h2. Overall analysis

(Corrections welcomed here)

The code always runs with DTs enabled, so whenever a client job submit, 
spark-submit, etc asks for DTs, AWS session credentials are demand created, or, 
if the process is already running as a session, the existing set are propagated.

Because any session creds from the current session list are used, if you submit 
a job from an EC2 VM, the current IAM session credentials are marshalled over. 
This would let you do tricks like: submit work from an EC2 VM with more rights 
than the (shared?) execution cluster, and your greater rights would trickle 
over.

Similarly, it should work if you have a session set in the AWS environment 
variables, because the {{EnvironmentVariableCredentialsProvider}} class on the 
default credential chain automatically creates session creds if the env var 
AWS_SESSION_TOKEN is set. Which means that if you create a local session 
through env vars (maybe with MFA authentication too), that will propagate.


The use of a session identifier in name canonicalization would allow you to 
bypass this in various ways, such as

* job executors to ask for a new s3a://session2@bucket/ and not have tokens for 
an existing s3a://default@bucket/ session used.
* multiple session IDs used, each with their own session tokens.

Other points

* use of a URI for marshalling aws account and session secret keys simplifies 
marshalling these in a token.
* session token lifetime is set in "fs.s3a.session.token.max-lifetime"; default 
is 36h (the max allowed)
* Although the patch predates HADOOP-15151 and the 
AssumedRoleCredentialProvider, as that returns a set of session credentials 
too, if getDelegationToken() is called on an FS running under an assumed role, 
the existing assumed role credentials will
be included in the DT.
* With HADOOP-15583 the S3A FS is asked for its credential set to auth DDB, so 
S3Guard will work. This is not an accident; the merging of the auth chains was 
done to prepare S3Guard for DT support.

h3. Limitations

This won't work with non-AWS infrastructures, because it always generates a DT, 
talking to STS. We need to make this DT support optional.

It only supports sessions. I want some flexibility here, specifically the 
option to switch to assumed roles (ideally with automatic restriction of role 
perms to the bucket & ddb table used. (i.e. I want something more complicated 
and harder to get working :))

Doesn't propagate encryption options. That's not really needed, I've just done 
it in my patch to allow client-side encryption options (including SSE-C 
secrets) to move in job submissions to a shared cluster.

Not sure how it works with s3a://user:pass@bucket/ URIs, because it's reusing 
the userinfo field in canonical names. Proposed: cut that entire feature 
because we've been telling people to stop it since Hadoop 2.8, its a security 
hole and s3Guard doesn't like it either.

Also not user how that s3a://session@bucket/ stuff interacts with other things, 
including the logic in {{S3xLoginHelper}} to strip out secrets from paths. 
Again, cutting that stuff obviates that issue.

*irrespective of the user:secret issue, I don't think I fully understand why 
there's support for multiple sessions here*

h2. Merging this with the other HADOOP-14556 patch

Some initial thoughts on actions, with a goal of getting the stuff from here 
into the patch I've been working on

# cut the user:pass support from s3a (make separate JIRA tho')
# compare token marshalling logic, see what bits I can lift.
# take {{S3AUtils.getCanonicalServiceURI}} as is.

I want to make the auth mechanism pluggable, so we can support DT auth for 
sessions, roles, in-house stuff, which means I'll need to pull out everything 
needed for this (tokens themselves, token creator, credential provider), and 
allow people to switch to them (i'd have to do this in a way which makes it 
trivial to line them all up, and to easily detect/reject mismatches). This 
session token code, especially the logic to extract session secrets, can be the 
core code for a session-DT; the role auth can follow



> S3A to support Delegation Tokens
> --------------------------------
>
>                 Key: HADOOP-14556
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14556
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.1
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-14556-001.patch, HADOOP-14556-002.patch, 
> HADOOP-14556-003.patch, HADOOP-14556-004.patch, HADOOP-14556.oath.patch
>
>
> S3A to support delegation tokens where
> * an authenticated client can request a token via 
> {{FileSystem.getDelegationToken()}}
> * Amazon's token service is used to request short-lived session secret & id; 
> these will be saved in the token and  marshalled with jobs
> * A new authentication provider will look for a token for the current user 
> and authenticate the user if found
> This will not support renewals; the lifespan of a token will be limited to 
> the initial duration. Also, as you can't request an STS token from a 
> temporary session, IAM instances won't be able to issue tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-14556) S3A to support Delegation Tokens

Reply via email to