[
https://issues.apache.org/jira/browse/HADOOP-17198?focusedWorklogId=656706&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-656706
]
ASF GitHub Bot logged work on HADOOP-17198:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 28/Sep/21 20:12
Start Date: 28/Sep/21 20:12
Worklog Time Spent: 10m
Work Description: steveloughran commented on a change in pull request
#3260:
URL: https://github.com/apache/hadoop/pull/3260#discussion_r717567393
##########
File path:
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
##########
@@ -743,11 +775,24 @@ protected void verifyBucketExists()
*/
@Retries.RetryTranslated
protected void verifyBucketExistsV2()
- throws UnknownStoreException, IOException {
+ throws UnknownStoreException, IOException {
if (!invoker.retry("doesBucketExistV2", bucket, true,
trackDurationOfOperation(getDurationTrackerFactory(),
STORE_EXISTS_PROBE.getSymbol(),
- () -> s3.doesBucketExistV2(bucket)))) {
+ () -> {
+ // Bug in SDK always returns `true` for AccessPoint ARNs with
`doesBucketExistV2()`
+ // expanding implementation to use ARNs and buckets correctly
+ try {
+ s3.getBucketAcl(bucket);
+ } catch (AmazonServiceException ex) {
+ int statusCode = ex.getStatusCode();
+ if (statusCode == 404 || (statusCode == 403 && accessPoint !=
null)) {
Review comment:
Starting to think we should use constants here, just to track down where
they come from. I see we don't do that elsewhere
(S3AUtils.translateException(), but that doesn't mean we shouldn't start)
Could you add constants here for HTTP_RESPONSE_404 & 403 in
InternalConstants & refer here. Then we could retrofit and extend elsewhere
##########
File path:
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
##########
@@ -1167,7 +1216,10 @@ public String getBucketLocation(String bucketName)
throws IOException {
final String region = trackDurationAndSpan(
STORE_EXISTS_PROBE, bucketName, null, () ->
invoker.retry("getBucketLocation()", bucketName, true, () ->
- s3.getBucketLocation(bucketName)));
+ // If accessPoint then region is known from Arn
+ accessPoint != null
Review comment:
Should we pull this up to L1216 & we can skip the entire overhead of
duration tracking, retry etc.
Currently it's overkill to wrap, but it will add it to the iostats, so maybe
it's best to leave as is
##########
File path:
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
##########
@@ -2570,6 +2614,11 @@ protected S3ListResult continueListObjects(S3ListRequest
request,
OBJECT_CONTINUE_LIST_REQUEST,
() -> {
if (useListV1) {
+ if (accessPoint != null) {
+ // AccessPoints are not compatible with V1List
+ throw new InvalidRequestException("ListV1 is not supported
by AccessPoints");
Review comment:
I see you've reverted this & letting the SDK fail it. worksforme
##########
File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
##########
@@ -1580,6 +1580,68 @@ Why explicitly declare a bucket bound to the central
endpoint? It ensures
that if the default endpoint is changed to a new region, data store in
US-east is still reachable.
+## <a name="accesspoints"></a>Configuring S3 AccessPoints usage with S3A
+S3a now supports [S3 Access
Point](https://aws.amazon.com/s3/features/access-points/) usage which
+improves VPC integration with S3 and simplifies your data's permission model
because different
+policies can be applied now on the Access Point level. For more information
about why to use and
+how to create them make sure to read the official documentation.
+
+Accessing data through an access point, is done by using its ARN, as opposed
to just the bucket name.
+You can set the Access Point ARN property using the following per bucket
configuration property:
+```xml
+<property>
+ <name>fs.s3a.sample-bucket.accesspoint.arn</name>
+ <value> {ACCESSPOINT_ARN_HERE} </value>
+ <description>Configure S3a traffic to use this AccessPoint</description>
+</property>
+```
+
+This configures access to the `sample-bucket` bucket for S3A, to go through the
+new Access Point ARN. So, for example `s3a://sample-bucket/key` will now use
your
+configured ARN when getting data from S3 instead of your bucket.
+
+You can also use an Access Point name as a path URI such as
`s3a://finance-team-access/key`, by
+configuring the `.accesspoint.arn` property as a per-bucket override:
+```xml
+<property>
+ <name>fs.s3a.finance-team-access.accesspoint.arn</name>
+ <value> {ACCESSPOINT_ARN_HERE} </value>
+ <description>Configure S3a traffic to use this AccessPoint</description>
+</property>
+```
+
+The `fs.s3a.accesspoint.required` property can also require all access to S3
to go through Access
+Points. This has the advantage of increasing security inside a VPN / VPC as
you only allow access
+to known sources of data defined through Access Points. In case there is a
need to access a bucket
+directly (without Access Points) then you can use per bucket overrides to
disable this setting on a
+bucket by bucket basis i.e. `fs.s3a.{YOUR-BUCKET}.accesspoint.required`.
+
+```xml
+<!-- Require access point only access -->
+<property>
+ <name>fs.s3a.accesspoint.required</name>
+ <value>true</value>
+</property>
+<!-- Disable it on a per-bucket basis if needed -->
+<property>
+ <name>fs.s3a.example-bucket.accesspoint.required</name>
+ <value>false</value>
+</property>
+```
+
+Before using Access Points make sure you're not impacted by the following:
+- `ListObjectsV1` is not supported, this is also deprecated on AWS S3 for
performance reasons;
+- The endpoint for S3 requests will automatically change from
`s3.amazonaws.com` to use
+`s3-accesspoint.REGION.amazonaws.{com | com.cn}` depending on the Access Point
ARN. This **only**
+happens if the `fs.s3a.endpoint` property isn't set. The endpoint property
overwrites any changes,
Review comment:
this still true?
##########
File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
##########
@@ -1576,6 +1576,81 @@ Why explicitly declare a bucket bound to the central
endpoint? It ensures
that if the default endpoint is changed to a new region, data store in
US-east is still reachable.
+## <a name="accesspoints"></a>Configuring S3 AccessPoints usage with S3a
+S3a now supports [S3 Access
Point](https://aws.amazon.com/s3/features/access-points/) usage which
+improves VPC integration with S3 and simplifies your data's permission model
because different
+policies can be applied now on the Access Point level. For more information
about why to use them
+make sure to read the official documentation.
+
+Accessing data through an access point, is done by using its ARN, as opposed
to just the bucket name.
+You can set the Access Point ARN property using the following configuration
property:
+```xml
+<property>
+ <name>fs.s3a.accesspoint.arn</name>
+ <value> {ACCESSPOINT_ARN_HERE} </value>
+ <description>Configure S3a traffic to use this AccessPoint</description>
+</property>
+```
+
+Be mindful that this configures **all access** to S3a, and in turn S3, to go
through that ARN.
+So for example `s3a://yourbucket/key` will now use your configured ARN when
getting data from S3
+instead of your bucket. The flip side to this is that if you're working with
multiple buckets
+`s3a://yourbucket` and `s3a://yourotherbucket` both of their requests will go
through the same
+Access Point ARN. To configure different Access Point ARNs, per bucket
overrides can be used with
+access point names instead of bucket names as such:
+
+- Let's assume you have an existing workflow with the following paths
`s3a://data-bucket`,
+`s3a://output-bucket` and you want to work with a new Access Point called
`finance-accesspoint`. All
+you would then need to add is the following per bucket configuration change:
+```xml
+<property>
+ <name>fs.s3a.bucket.finance-accesspoint.accesspoint.arn</name>
+ <value> arn:aws:s3:eu-west-1:123456789101:accesspoint/finance-accesspoint
</value>
+</property>
+```
+
+While keeping the global `accesspoint.arn` property set to empty `" "` which
is the default.
Review comment:
sometimes " " can be set in a conf to force an override which
getTrimmed() will then downgrade to "". No need to worry about that in these
docs
##########
File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
##########
@@ -1580,6 +1580,68 @@ Why explicitly declare a bucket bound to the central
endpoint? It ensures
that if the default endpoint is changed to a new region, data store in
US-east is still reachable.
+## <a name="accesspoints"></a>Configuring S3 AccessPoints usage with S3A
+S3a now supports [S3 Access
Point](https://aws.amazon.com/s3/features/access-points/) usage which
+improves VPC integration with S3 and simplifies your data's permission model
because different
+policies can be applied now on the Access Point level. For more information
about why to use and
+how to create them make sure to read the official documentation.
+
+Accessing data through an access point, is done by using its ARN, as opposed
to just the bucket name.
+You can set the Access Point ARN property using the following per bucket
configuration property:
+```xml
+<property>
+ <name>fs.s3a.sample-bucket.accesspoint.arn</name>
+ <value> {ACCESSPOINT_ARN_HERE} </value>
+ <description>Configure S3a traffic to use this AccessPoint</description>
+</property>
+```
+
+This configures access to the `sample-bucket` bucket for S3A, to go through the
+new Access Point ARN. So, for example `s3a://sample-bucket/key` will now use
your
+configured ARN when getting data from S3 instead of your bucket.
+
+You can also use an Access Point name as a path URI such as
`s3a://finance-team-access/key`, by
+configuring the `.accesspoint.arn` property as a per-bucket override:
+```xml
+<property>
+ <name>fs.s3a.finance-team-access.accesspoint.arn</name>
+ <value> {ACCESSPOINT_ARN_HERE} </value>
+ <description>Configure S3a traffic to use this AccessPoint</description>
+</property>
+```
+
+The `fs.s3a.accesspoint.required` property can also require all access to S3
to go through Access
+Points. This has the advantage of increasing security inside a VPN / VPC as
you only allow access
+to known sources of data defined through Access Points. In case there is a
need to access a bucket
+directly (without Access Points) then you can use per bucket overrides to
disable this setting on a
+bucket by bucket basis i.e. `fs.s3a.{YOUR-BUCKET}.accesspoint.required`.
+
+```xml
+<!-- Require access point only access -->
+<property>
+ <name>fs.s3a.accesspoint.required</name>
+ <value>true</value>
+</property>
+<!-- Disable it on a per-bucket basis if needed -->
+<property>
+ <name>fs.s3a.example-bucket.accesspoint.required</name>
+ <value>false</value>
+</property>
+```
+
+Before using Access Points make sure you're not impacted by the following:
+- `ListObjectsV1` is not supported, this is also deprecated on AWS S3 for
performance reasons;
+- The endpoint for S3 requests will automatically change from
`s3.amazonaws.com` to use
+`s3-accesspoint.REGION.amazonaws.{com | com.cn}` depending on the Access Point
ARN. This **only**
+happens if the `fs.s3a.endpoint` property isn't set. The endpoint property
overwrites any changes,
+this is intentional so FIPS or DualStack endpoints can be set. While
considering endpoints,
+if you have any custom signers that use the host endpoint property make sure
to update them if
+needed;
+- Access Point names don't have to be globally unique, in the same way that
bucket names have to.
Review comment:
as we support it per-bucket only, this bullet point can be cut
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 656706)
Time Spent: 13h (was: 12h 50m)
> Support S3 Access Points
> ------------------------
>
> Key: HADOOP-17198
> URL: https://issues.apache.org/jira/browse/HADOOP-17198
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 3.3.0
> Reporter: Steve Loughran
> Assignee: Bogdan Stolojan
> Priority: Major
> Labels: pull-request-available
> Time Spent: 13h
> Remaining Estimate: 0h
>
> Improve VPC integration by supporting access points for buckets
> https://docs.aws.amazon.com/AmazonS3/latest/dev/access-points.html
> Not sure how to do this *at all*;
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]