[
https://issues.apache.org/jira/browse/HADOOP-19044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812690#comment-17812690
]
ASF GitHub Bot commented on HADOOP-19044:
-----------------------------------------
ahmarsuhail commented on code in PR #6479:
URL: https://github.com/apache/hadoop/pull/6479#discussion_r1472764364
##########
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AEndpointRegion.java:
##########
@@ -257,6 +283,83 @@ public void testWithVPCE() throws Throwable {
expectInterceptorException(client);
}
+ @Test
+ public void testCentralEndpointWithUSWest2Region() throws Throwable {
+ describe("Access bucket using central endpoint and us-west-2 region");
+ final Configuration conf = getConfiguration();
+ removeBaseAndBucketOverrides(conf, ENDPOINT, AWS_REGION);
+
+ final Configuration newConf = new Configuration(conf);
+
+ newConf.set(ENDPOINT, CENTRAL_ENDPOINT);
+ newConf.set(AWS_REGION, US_WEST_2);
+
+ newFS = new S3AFileSystem();
+ newFS.initialize(getFileSystem().getUri(), newConf);
+
+ assertOpsUsingNewFs();
+ }
+
+ @Test
+ public void testCentralEndpointWithEUWest2Region() throws Throwable {
Review Comment:
the value in having both `testCentralEndpointWithUSWest2Region` and
`testCentralEndpointWithEUWest2Region` only comes in when your test bucket is
in us-west-2, and then with the eu-west-2 test you check if cross region access
behaves as expected. since we can't guarantee where users test against, I don't
think we need both tests. Unless we end up using one of the public buckets, as
for that we know the region.
##########
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AEndpointRegion.java:
##########
@@ -146,7 +150,29 @@ public void testCentralEndpoint() throws Throwable {
describe("Create a client with the central endpoint");
Configuration conf = getConfiguration();
- S3Client client = createS3Client(conf, CENTRAL_ENDPOINT, null, US_EAST_1,
false);
+ S3Client client = createS3Client(conf, CENTRAL_ENDPOINT, null, US_EAST_2,
false);
+
+ expectInterceptorException(client);
+ }
+
+ @Test
+ public void testCentralEndpointWithRegion() throws Throwable {
+ describe("Create a client with the central endpoint but also specify
region");
+ Configuration conf = getConfiguration();
+
+ S3Client client = createS3Client(conf, CENTRAL_ENDPOINT, US_WEST_2,
US_WEST_2, false);
Review Comment:
it's not clear to me what these different test cases are doing. looks like
they call check if you set endpoint to central and also configure a region,
it's always the configured region that gets set. do we really need all of them?
##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/aws_sdk_v2_changelog.md:
##########
@@ -93,6 +93,20 @@ the bucket. The behaviour has now been changed to:
response. Otherwise it will throw an error with status code 301 permanently
moved. This error
contains the region of the bucket in its header, which we can then use to
configure the client.
Review Comment:
A lot of what is in this doc is outdated, it was written when we raised the
initial PR for the upgrade to help with reviewing. For example, we don't do the
HEAD call to get the region of the bucket. We had to do this when crossRegion
was not supported in SDK V2.
Let's create a new section in `index.md` which explains region handling,
something like:
* `fs.s3a.endpoint` and `fs.s3a.endpoint.region` can be used to set values
for endpoint and region.
* If a value is set in `fs.s3a.endpoint.region` , S3A will configure the S3
Client to use this value. If this is set to a region that does not match your
bucket, you will receive a 301 redirect response.
* If no region is set, but an endpoint is set, S3A will attempt to parse
your region from the endpoint.
* If your endpoint is set to the central `s3.amazonaws.com`, S3A will enable
cross region access. This means that even if the region
`fs.s3a.endpoint.region` is incorrect, the SDK will determine the region. This
is done by making the request, and if the SDK receives a 301 redirect, it
determines the region at the cost of a HEAD request, and caches it.
* If no region is no set, and none could be determined from the endpoint,
S3A will use US_EAST_2 as the default region and enable cross region access.
Again, this means that requests will be not fail, but an initial HEAD request
will be made to get the region of your bucket.
* If the configured region is an empty string, we fallback to using the SDK
region [resolution
chain](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html).
I don't think we need to explain specifically what the code is doing in too
much detail, rather just the end behaviour the user will see.
> AWS SDK V2 - Update S3A region logic
> -------------------------------------
>
> Key: HADOOP-19044
> URL: https://issues.apache.org/jira/browse/HADOOP-19044
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 3.4.0
> Reporter: Ahmar Suhail
> Assignee: Viraj Jasani
> Priority: Major
> Labels: pull-request-available
>
> If both fs.s3a.endpoint & fs.s3a.endpoint.region are empty, Spark will set
> fs.s3a.endpoint to
> s3.amazonaws.com here:
> [https://github.com/apache/spark/blob/9a2f39318e3af8b3817dc5e4baf52e548d82063c/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L540]
>
>
> HADOOP-18908, updated the region logic such that if fs.s3a.endpoint.region is
> set, or if a region can be parsed from fs.s3a.endpoint (which will happen in
> this case, region will be US_EAST_1), cross region access is not enabled.
> This will cause 400 errors if the bucket is not in US_EAST_1.
>
> Proposed: Updated the logic so that if the endpoint is the global
> s3.amazonaws.com , cross region access is enabled.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]