Einav Hollander created HADOOP-18448:
----------------------------------------

             Summary: s3a endpoint per bucket configuration in pyspark is 
ignored
                 Key: HADOOP-18448
                 URL: https://issues.apache.org/jira/browse/HADOOP-18448
             Project: Hadoop Common
          Issue Type: Sub-task
          Components: conf
    Affects Versions: 3.2.1
            Reporter: Einav Hollander


I'm using EMR emr-6.5.0 cluster in us-east-1 with ec2 instances. cluster is 
running spark application using pyspark 3.2.1
 EMR is using Hadoop distribution:Amazon 3.2.1
my spark application is reading from one bucket in us-west-2 and writing to a 
bucket in us-east-1.
since I'm processing a large amount of data I'm paying a lot of money for the 
network transport . in order to reduce the cost I have create a vpc interface 
to s3 endpoint in us-west-2. inside the spark application I'm using aws cli for 
reading the file names from us-west-2 bucket and it is working through the s3 
interface endpoint but when I use pyspark to read the data it is using the 
us-east-1 s3 endpoint instead of the us-west-2 endpoint.
 I tried to use per bucket configuration but it is being ignored although I 
added it to the defualt configuration and to spark submit call.
I tried to set the following configuration but they are ignored:
 '--conf', 
"spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
 '--conf', "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem",
 '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket -name>.endpoint=<my 
vpc endpoint>",
 '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket 
-name>.endpoint.region=us-west-2",
 '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket -name>.endpoint=<vpc 
gateway endpoint>",
 '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket 
-name>.endpoint.region=us-east-1",
 '--conf', "spark.hadoop.fs.s3a.path.style.access=false",
 '--conf', 
"spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true",
 '--conf', 
"spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true",
 '--conf', "spark.eventLog.enabled=false",



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to