Arghya Saha created HADOOP-17789:
------------------------------------
Summary: S3 read performance with Spark with Hadoop 3.3.1 is
slower than older Hadoop
Key: HADOOP-17789
URL: https://issues.apache.org/jira/browse/HADOOP-17789
Project: Hadoop Common
Issue Type: Improvement
Affects Versions: 3.3.1
Reporter: Arghya Saha
This is issue is continuation to
https://issues.apache.org/jira/browse/HADOOP-17755
The input data reported by Spark(Hadoop 3.3.1) was almost double and read
runtime also increased (around 20%) compared to Spark(Hadoop 3.2.0) with same
exact amount of resource and same configuration. And this is happening with
other jobs as well which was not impacted by read fully error as stated above.
*I was having the same exact issue when I was using the workaround
fs.s3a.readahead.range = 1G with Hadoop 3.2.0*
Below is further details :
|Hadoop Version|Actual size of the files(in SQL Tab)|Reported size of the
file(In Stages)|Time to complete the Stage|fs.s3a.readahead.range|
|Hadoop 3.2.0|29.3 GiB|29.3 GiB|23 min|64K|
|Hadoop 3.3.1|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}27
min{color}*|{color:#172b4d}64K{color}|
|Hadoop 3.2.0|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}~27
min{color}*|{color:#172b4d}1G{color}|
* *Shuffle Write* is same (95.9 GiB) for all the above three cases
I was expecting some improvement(or same as 3.2.0) with Hadoop 3.3.1 with read
operations, please suggest how to approach this and resolve this.
I have used the default s3a config along with below and also using EKS cluster
{code:java}
spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
spark.hadoop.fs.s3a.committer.name: magic
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a:
org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.downgrade.syncable.exceptions: "true"{code}
* I did not use
{code:java}
spark.hadoop.fs.s3a.experimental.input.fadvise=random{code}
And as already mentioned I have used same Spark, same amount of resources and
same config. Only change is Hadoop 3.2.0 to Hadoop 3.3.1 (Built with Spark
using ./dev/make-distribution.sh --name spark-patched --pip -Pkubernetes -Phive
-Phive-thriftserver -Dhadoop.version="3.3.1")
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]