[
https://issues.apache.org/jira/browse/HADOOP-18448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607121#comment-17607121
]
Steve Loughran commented on HADOOP-18448:
-----------------------------------------
bq. how do you know it is an emr problem?
# EMR is their own private fork of hadoop.
# fs.s3a.endpoint.region was added in HADOOP-17705, with multiple followup
changes until it worked.
# We have no idea as to whether or not that commit chain has been cherrypicked
into the EMR fork. talk to them and see what they say.
Invalid or Cannot-reproduce is exactly the same response as you'd get if anyone
filed the same issue against HD/Insight, Cloudera products, or anyone else's:
you have to talk to the vendor about their private fork. Now, if you did have a
problem with CDH it'd end up with me or my colleagues, but there'd be internal
tracking/escalation JIRAs and we would have the source in front of us; we'd
know whether the feature was in that release, etc etc.
For EMR, the EMR team have to handle it. If they find a bug in the ASF
hadoop-aws code, provide PRs against our branches etc, test them &c we are all
happy to review and merge. But they will still be the people to talk to when
you have problems with their own releases.
> s3a endpoint per bucket configuration in pyspark is ignored
> -----------------------------------------------------------
>
> Key: HADOOP-18448
> URL: https://issues.apache.org/jira/browse/HADOOP-18448
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Environment: Amazon EMR emr-6.5.0 cluster
> Reporter: Einav Hollander
> Priority: Major
>
> I'm using EMR emr-6.5.0 cluster in us-east-1 with ec2 instances. cluster is
> running spark application using pyspark 3.2.1
> EMR is using Hadoop distribution:Amazon 3.2.1
> my spark application is reading from one bucket in us-west-2 and writing to a
> bucket in us-east-1.
> since I'm processing a large amount of data I'm paying a lot of money for the
> network transport . in order to reduce the cost I have create a vpc interface
> to s3 endpoint in us-west-2. inside the spark application I'm using aws cli
> for reading the file names from us-west-2 bucket and it is working through
> the s3 interface endpoint but when I use pyspark to read the data it is using
> the us-east-1 s3 endpoint instead of the us-west-2 endpoint.
> I tried to use per bucket configuration but it is being ignored although I
> added it to the defualt configuration and to spark submit call.
> I tried to set the following configuration but they are ignored:
> '--conf',
> "spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
> '--conf', "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem",
> '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket -name>.endpoint=<my
> vpc endpoint>",
> '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket
> -name>.endpoint.region=us-west-2",
> '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket -name>.endpoint=<vpc
> gateway endpoint>",
> '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket
> -name>.endpoint.region=us-east-1",
> '--conf', "spark.hadoop.fs.s3a.path.style.access=false"
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]