[ 
https://issues.apache.org/jira/browse/HADOOP-18448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Einav Hollander updated HADOOP-18448:
-------------------------------------
    Description: 
I'm using EMR emr-6.5.0 cluster in us-east-1 with ec2 instances. cluster is 
running spark application using pyspark 3.2.1
 EMR is using Hadoop distribution:Amazon 3.2.1
my spark application is reading from one bucket in us-west-2 and writing to a 
bucket in us-east-1.
since I'm processing a large amount of data I'm paying a lot of money for the 
network transport . in order to reduce the cost I have create a vpc interface 
to s3 endpoint in us-west-2. inside the spark application I'm using aws cli for 
reading the file names from us-west-2 bucket and it is working through the s3 
interface endpoint but when I use pyspark to read the data it is using the 
us-east-1 s3 endpoint instead of the us-west-2 endpoint.
 I tried to use per bucket configuration but it is being ignored although I 
added it to the defualt configuration and to spark submit call.
I tried to set the following configuration but they are ignored:
 '--conf', 
"spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
 '--conf', "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem",
 '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket -name>.endpoint=<my 
vpc endpoint>",
 '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket 
-name>.endpoint.region=us-west-2",
 '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket -name>.endpoint=<vpc 
gateway endpoint>",
 '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket 
-name>.endpoint.region=us-east-1",
 '--conf', "spark.hadoop.fs.s3a.path.style.access=false"

  was:
I'm using EMR emr-6.5.0 cluster in us-east-1 with ec2 instances. cluster is 
running spark application using pyspark 3.2.1
 EMR is using Hadoop distribution:Amazon 3.2.1
my spark application is reading from one bucket in us-west-2 and writing to a 
bucket in us-east-1.
since I'm processing a large amount of data I'm paying a lot of money for the 
network transport . in order to reduce the cost I have create a vpc interface 
to s3 endpoint in us-west-2. inside the spark application I'm using aws cli for 
reading the file names from us-west-2 bucket and it is working through the s3 
interface endpoint but when I use pyspark to read the data it is using the 
us-east-1 s3 endpoint instead of the us-west-2 endpoint.
 I tried to use per bucket configuration but it is being ignored although I 
added it to the defualt configuration and to spark submit call.
I tried to set the following configuration but they are ignored:
 '--conf', 
"spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
 '--conf', "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem",
 '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket -name>.endpoint=<my 
vpc endpoint>",
 '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket 
-name>.endpoint.region=us-west-2",
 '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket -name>.endpoint=<vpc 
gateway endpoint>",
 '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket 
-name>.endpoint.region=us-east-1",
 '--conf', "spark.hadoop.fs.s3a.path.style.access=false",
 '--conf', "spark.eventLog.enabled=false",


> s3a endpoint per bucket configuration in pyspark is ignored
> -----------------------------------------------------------
>
>                 Key: HADOOP-18448
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18448
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.1
>            Reporter: Einav Hollander
>            Priority: Major
>
> I'm using EMR emr-6.5.0 cluster in us-east-1 with ec2 instances. cluster is 
> running spark application using pyspark 3.2.1
>  EMR is using Hadoop distribution:Amazon 3.2.1
> my spark application is reading from one bucket in us-west-2 and writing to a 
> bucket in us-east-1.
> since I'm processing a large amount of data I'm paying a lot of money for the 
> network transport . in order to reduce the cost I have create a vpc interface 
> to s3 endpoint in us-west-2. inside the spark application I'm using aws cli 
> for reading the file names from us-west-2 bucket and it is working through 
> the s3 interface endpoint but when I use pyspark to read the data it is using 
> the us-east-1 s3 endpoint instead of the us-west-2 endpoint.
>  I tried to use per bucket configuration but it is being ignored although I 
> added it to the defualt configuration and to spark submit call.
> I tried to set the following configuration but they are ignored:
>  '--conf', 
> "spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
>  '--conf', "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem",
>  '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket -name>.endpoint=<my 
> vpc endpoint>",
>  '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket 
> -name>.endpoint.region=us-west-2",
>  '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket -name>.endpoint=<vpc 
> gateway endpoint>",
>  '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket 
> -name>.endpoint.region=us-east-1",
>  '--conf', "spark.hadoop.fs.s3a.path.style.access=false"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to