[jira] [Updated] (HADOOP-17789) S3 CSV read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop

Steve Loughran (Jira) Thu, 29 Jul 2021 02:25:06 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-17789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran updated HADOOP-17789:
------------------------------------
    Summary: S3 CSV read performance with Spark with Hadoop 3.3.1 is slower 
than older Hadoop  (was: S3 read performance with Spark with Hadoop 3.3.1 is 
slower than older Hadoop)

> S3 CSV read performance with Spark with Hadoop 3.3.1 is slower than older 
> Hadoop
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-17789
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17789
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 3.3.1
>            Reporter: Arghya Saha
>            Priority: Minor
>         Attachments: storediag.log
>
>
> This is issue is continuation to 
> https://issues.apache.org/jira/browse/HADOOP-17755
> The input data reported by Spark(Hadoop 3.3.1) was almost double and read 
> runtime also increased (around 20%) compared to Spark(Hadoop 3.2.0) with same 
> exact amount of resource and same configuration. And this is happening with 
> other jobs as well which was not impacted by read fully error as stated above.
> *I was having the same exact issue when I was using the workaround  
> fs.s3a.readahead.range = 1G with Hadoop 3.2.0*
> Below is further details :
>  
> |Hadoop Version|Actual size of the files(in SQL Tab)|Reported size of the 
> file(In Stages)|Time to complete the Stage|fs.s3a.readahead.range|
> |Hadoop 3.2.0|29.3 GiB|29.3 GiB|23 min|64K|
> |Hadoop 3.3.1|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}27 
> min{color}*|{color:#172b4d}64K{color}|
> |Hadoop 3.2.0|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}~27 
> min{color}*|{color:#172b4d}1G{color}|
>  * *Shuffle Write* is same (95.9 GiB) for all the above three cases
> I was expecting some improvement(or same as 3.2.0) with Hadoop 3.3.1 with 
> read operations, please suggest how to approach this and resolve this.
> I have used the default s3a config along with below and also using EKS cluster
> {code:java}
> spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
> spark.hadoop.fs.s3a.committer.name: magic
> spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a: 
> org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
> spark.hadoop.fs.s3a.downgrade.syncable.exceptions: "true"{code}
>  * I did not use 
> {code:java}
> spark.hadoop.fs.s3a.experimental.input.fadvise=random{code}
> And as already mentioned I have used same Spark, same amount of resources and 
> same config.  Only change is Hadoop 3.2.0 to Hadoop 3.3.1 (Built with Spark 
> using ./dev/make-distribution.sh --name spark-patched --pip -Pkubernetes 
> -Phive -Phive-thriftserver -Dhadoop.version="3.3.1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-17789) S3 CSV read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop

Reply via email to