[jira] [Commented] (HADOOP-17789) S3 read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop

Steve Loughran (Jira) Wed, 28 Jul 2021 08:18:05 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-17789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388847#comment-17388847
 ]


Steve Loughran commented on HADOOP-17789:
-----------------------------------------

bq. Can we use the latest one wildfly-openssl 2.1.x ? 

the one shipped in hadoop is the one tested. Anything else: you are on your own.

My recommendation: you check out hadoop trunk source, change the version in the 
poms, rebuilt and restest everything to see what S3A and ABFS do, then create a 
release build and retest in a test cluster you've created where the container 
images are all using the version of openssl native you intend to use. If all 
works, then submit the hadoop PR, which, if taken up, means more people will 
test it and it would get supported.

closing this JIRA as invalid as it was a configuration issue.


> S3 read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-17789
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17789
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 3.3.1
>            Reporter: Arghya Saha
>            Priority: Minor
>         Attachments: storediag.log
>
>
> This is issue is continuation to 
> https://issues.apache.org/jira/browse/HADOOP-17755
> The input data reported by Spark(Hadoop 3.3.1) was almost double and read 
> runtime also increased (around 20%) compared to Spark(Hadoop 3.2.0) with same 
> exact amount of resource and same configuration. And this is happening with 
> other jobs as well which was not impacted by read fully error as stated above.
> *I was having the same exact issue when I was using the workaround  
> fs.s3a.readahead.range = 1G with Hadoop 3.2.0*
> Below is further details :
>  
> |Hadoop Version|Actual size of the files(in SQL Tab)|Reported size of the 
> file(In Stages)|Time to complete the Stage|fs.s3a.readahead.range|
> |Hadoop 3.2.0|29.3 GiB|29.3 GiB|23 min|64K|
> |Hadoop 3.3.1|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}27 
> min{color}*|{color:#172b4d}64K{color}|
> |Hadoop 3.2.0|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}~27 
> min{color}*|{color:#172b4d}1G{color}|
>  * *Shuffle Write* is same (95.9 GiB) for all the above three cases
> I was expecting some improvement(or same as 3.2.0) with Hadoop 3.3.1 with 
> read operations, please suggest how to approach this and resolve this.
> I have used the default s3a config along with below and also using EKS cluster
> {code:java}
> spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
> spark.hadoop.fs.s3a.committer.name: magic
> spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a: 
> org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
> spark.hadoop.fs.s3a.downgrade.syncable.exceptions: "true"{code}
>  * I did not use 
> {code:java}
> spark.hadoop.fs.s3a.experimental.input.fadvise=random{code}
> And as already mentioned I have used same Spark, same amount of resources and 
> same config.  Only change is Hadoop 3.2.0 to Hadoop 3.3.1 (Built with Spark 
> using ./dev/make-distribution.sh --name spark-patched --pip -Pkubernetes 
> -Phive -Phive-thriftserver -Dhadoop.version="3.3.1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-17789) S3 read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop

Reply via email to