Einav Hollander created HADOOP-18448:
----------------------------------------
Summary: s3a endpoint per bucket configuration in pyspark is
ignored
Key: HADOOP-18448
URL: https://issues.apache.org/jira/browse/HADOOP-18448
Project: Hadoop Common
Issue Type: Sub-task
Components: conf
Affects Versions: 3.2.1
Reporter: Einav Hollander
I'm using EMR emr-6.5.0 cluster in us-east-1 with ec2 instances. cluster is
running spark application using pyspark 3.2.1
EMR is using Hadoop distribution:Amazon 3.2.1
my spark application is reading from one bucket in us-west-2 and writing to a
bucket in us-east-1.
since I'm processing a large amount of data I'm paying a lot of money for the
network transport . in order to reduce the cost I have create a vpc interface
to s3 endpoint in us-west-2. inside the spark application I'm using aws cli for
reading the file names from us-west-2 bucket and it is working through the s3
interface endpoint but when I use pyspark to read the data it is using the
us-east-1 s3 endpoint instead of the us-west-2 endpoint.
I tried to use per bucket configuration but it is being ignored although I
added it to the defualt configuration and to spark submit call.
I tried to set the following configuration but they are ignored:
'--conf',
"spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
'--conf', "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem",
'--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket -name>.endpoint=<my
vpc endpoint>",
'--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket
-name>.endpoint.region=us-west-2",
'--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket -name>.endpoint=<vpc
gateway endpoint>",
'--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket
-name>.endpoint.region=us-east-1",
'--conf', "spark.hadoop.fs.s3a.path.style.access=false",
'--conf',
"spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true",
'--conf',
"spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true",
'--conf', "spark.eventLog.enabled=false",
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]