[
https://issues.apache.org/jira/browse/HADOOP-17789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran updated HADOOP-17789:
------------------------------------
Summary: S3 CSV read performance with Spark with Hadoop 3.3.1 is slower
than older Hadoop (was: S3 read performance with Spark with Hadoop 3.3.1 is
slower than older Hadoop)
> S3 CSV read performance with Spark with Hadoop 3.3.1 is slower than older
> Hadoop
> --------------------------------------------------------------------------------
>
> Key: HADOOP-17789
> URL: https://issues.apache.org/jira/browse/HADOOP-17789
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs/s3
> Affects Versions: 3.3.1
> Reporter: Arghya Saha
> Priority: Minor
> Attachments: storediag.log
>
>
> This is issue is continuation to
> https://issues.apache.org/jira/browse/HADOOP-17755
> The input data reported by Spark(Hadoop 3.3.1) was almost double and read
> runtime also increased (around 20%) compared to Spark(Hadoop 3.2.0) with same
> exact amount of resource and same configuration. And this is happening with
> other jobs as well which was not impacted by read fully error as stated above.
> *I was having the same exact issue when I was using the workaround
> fs.s3a.readahead.range = 1G with Hadoop 3.2.0*
> Below is further details :
>
> |Hadoop Version|Actual size of the files(in SQL Tab)|Reported size of the
> file(In Stages)|Time to complete the Stage|fs.s3a.readahead.range|
> |Hadoop 3.2.0|29.3 GiB|29.3 GiB|23 min|64K|
> |Hadoop 3.3.1|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}27
> min{color}*|{color:#172b4d}64K{color}|
> |Hadoop 3.2.0|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}~27
> min{color}*|{color:#172b4d}1G{color}|
> * *Shuffle Write* is same (95.9 GiB) for all the above three cases
> I was expecting some improvement(or same as 3.2.0) with Hadoop 3.3.1 with
> read operations, please suggest how to approach this and resolve this.
> I have used the default s3a config along with below and also using EKS cluster
> {code:java}
> spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
> spark.hadoop.fs.s3a.committer.name: magic
> spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a:
> org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
> spark.hadoop.fs.s3a.downgrade.syncable.exceptions: "true"{code}
> * I did not use
> {code:java}
> spark.hadoop.fs.s3a.experimental.input.fadvise=random{code}
> And as already mentioned I have used same Spark, same amount of resources and
> same config. Only change is Hadoop 3.2.0 to Hadoop 3.3.1 (Built with Spark
> using ./dev/make-distribution.sh --name spark-patched --pip -Pkubernetes
> -Phive -Phive-thriftserver -Dhadoop.version="3.3.1")
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]