[jira] [Created] (HADOOP-17789) S3 read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop

Arghya Saha (Jira) Fri, 02 Jul 2021 09:00:06 -0700

Arghya Saha created HADOOP-17789:
------------------------------------

             Summary: S3 read performance with Spark with Hadoop 3.3.1 is 
slower than older Hadoop
                 Key: HADOOP-17789
                 URL: https://issues.apache.org/jira/browse/HADOOP-17789
             Project: Hadoop Common
          Issue Type: Improvement
    Affects Versions: 3.3.1
            Reporter: Arghya Saha



This is issue is continuation to 
https://issues.apache.org/jira/browse/HADOOP-17755

The input data reported by Spark(Hadoop 3.3.1) was almost double and read 
runtime also increased (around 20%) compared to Spark(Hadoop 3.2.0) with same 
exact amount of resource and same configuration. And this is happening with 
other jobs as well which was not impacted by read fully error as stated above.

*I was having the same exact issue when I was using the workaround  
fs.s3a.readahead.range = 1G with Hadoop 3.2.0*

Below is further details :

 
|Hadoop Version|Actual size of the files(in SQL Tab)|Reported size of the 
file(In Stages)|Time to complete the Stage|fs.s3a.readahead.range|
|Hadoop 3.2.0|29.3 GiB|29.3 GiB|23 min|64K|
|Hadoop 3.3.1|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}27 
min{color}*|{color:#172b4d}64K{color}|
|Hadoop 3.2.0|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}~27 
min{color}*|{color:#172b4d}1G{color}|
 * *Shuffle Write* is same (95.9 GiB) for all the above three cases

I was expecting some improvement(or same as 3.2.0) with Hadoop 3.3.1 with read 
operations, please suggest how to approach this and resolve this.

I have used the default s3a config along with below and also using EKS cluster
{code:java}
spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
spark.hadoop.fs.s3a.committer.name: magic
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a: 
org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.downgrade.syncable.exceptions: "true"{code}
 * I did not use 
{code:java}
spark.hadoop.fs.s3a.experimental.input.fadvise=random{code}

And as already mentioned I have used same Spark, same amount of resources and 
same config.  Only change is Hadoop 3.2.0 to Hadoop 3.3.1 (Built with Spark 
using ./dev/make-distribution.sh --name spark-patched --pip -Pkubernetes -Phive 
-Phive-thriftserver -Dhadoop.version="3.3.1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

[jira] [Created] (HADOOP-17789) S3 read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop

Reply via email to