[jira] [Commented] (HADOOP-17789) S3 read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop

Arghya Saha (Jira) Sun, 18 Jul 2021 05:38:06 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-17789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382824#comment-17382824
 ]


Arghya Saha commented on HADOOP-17789:
--------------------------------------

Also, I am have tried with config 
(*spark.hadoop.fs.s3a.experimental.input.fadvise=random*) but result is same. 
Just to clarify with Hadoop 3.3.1, I am not adjusting headrange, its kept 
default and yes I am using s3a committers and magic works good (other than 
tables with 200+ plus partition and many small files, where the commit time is 
much more when compared with EMR - I will raise an issue shortly to understand 
why magic committer can not do the magic EMRFS does)

Coming back, I have actually deployed Hadoop 3.3.1 with Spark 3.1.1 in our 
production cluster(reading/writing 5-10TB data everyday) from last 10 days and 
I have more observation to share.

For ORC the read performance of Hadoop 3.3.1 is almost same compared to Hadoop 
3.2.0, seems the optimization done with Hadoop 3.3..1 from Hadoop 3.2.0 is 
balanced with the issue I am highlighting(if its a real issue).

But for CSV the difference is noticeable - its around 10-20% slower compared to 
Hadoop 3.2.0 with same Spark version and same configuration.

I understand there have been no changes but could you check if fix for 
https://issues.apache.org/jira/browse/HADOOP-16109 may have any adverse impact? 
The reason I am asking to check is when I am applying high value of headrange 
with Hadoop 3.2.0, the behavior is exactly same with Hadoop 3.3.1 (without 
adjusting readrange)

Lastly I will share the details once the above mentioned issue is resolved.

> S3 read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-17789
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17789
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.3.1
>            Reporter: Arghya Saha
>            Priority: Major
>
> This is issue is continuation to 
> https://issues.apache.org/jira/browse/HADOOP-17755
> The input data reported by Spark(Hadoop 3.3.1) was almost double and read 
> runtime also increased (around 20%) compared to Spark(Hadoop 3.2.0) with same 
> exact amount of resource and same configuration. And this is happening with 
> other jobs as well which was not impacted by read fully error as stated above.
> *I was having the same exact issue when I was using the workaround  
> fs.s3a.readahead.range = 1G with Hadoop 3.2.0*
> Below is further details :
>  
> |Hadoop Version|Actual size of the files(in SQL Tab)|Reported size of the 
> file(In Stages)|Time to complete the Stage|fs.s3a.readahead.range|
> |Hadoop 3.2.0|29.3 GiB|29.3 GiB|23 min|64K|
> |Hadoop 3.3.1|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}27 
> min{color}*|{color:#172b4d}64K{color}|
> |Hadoop 3.2.0|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}~27 
> min{color}*|{color:#172b4d}1G{color}|
>  * *Shuffle Write* is same (95.9 GiB) for all the above three cases
> I was expecting some improvement(or same as 3.2.0) with Hadoop 3.3.1 with 
> read operations, please suggest how to approach this and resolve this.
> I have used the default s3a config along with below and also using EKS cluster
> {code:java}
> spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
> spark.hadoop.fs.s3a.committer.name: magic
> spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a: 
> org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
> spark.hadoop.fs.s3a.downgrade.syncable.exceptions: "true"{code}
>  * I did not use 
> {code:java}
> spark.hadoop.fs.s3a.experimental.input.fadvise=random{code}
> And as already mentioned I have used same Spark, same amount of resources and 
> same config.  Only change is Hadoop 3.2.0 to Hadoop 3.3.1 (Built with Spark 
> using ./dev/make-distribution.sh --name spark-patched --pip -Pkubernetes 
> -Phive -Phive-thriftserver -Dhadoop.version="3.3.1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-17789) S3 read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop

Reply via email to