[ https://issues.apache.org/jira/browse/HADOOP-17789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382816#comment-17382816 ]
Arghya Saha commented on HADOOP-17789: -------------------------------------- [~ste...@apache.org] not sure what I am missing but I am getting below exception {code:java} root@babbe50bce01:/opt/hadoop/bin# ./hadoop jar cloudstore-1.0.jar storedciag -r s3a://<bucket-with-prefix>/ Exception in thread "main" java.lang.ClassNotFoundException: storedciag at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:398) at org.apache.hadoop.util.RunJar.run(RunJar.java:316) at org.apache.hadoop.util.RunJar.main(RunJar.java:236) {code} > S3 read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop > ---------------------------------------------------------------------------- > > Key: HADOOP-17789 > URL: https://issues.apache.org/jira/browse/HADOOP-17789 > Project: Hadoop Common > Issue Type: Improvement > Affects Versions: 3.3.1 > Reporter: Arghya Saha > Priority: Major > > This is issue is continuation to > https://issues.apache.org/jira/browse/HADOOP-17755 > The input data reported by Spark(Hadoop 3.3.1) was almost double and read > runtime also increased (around 20%) compared to Spark(Hadoop 3.2.0) with same > exact amount of resource and same configuration. And this is happening with > other jobs as well which was not impacted by read fully error as stated above. > *I was having the same exact issue when I was using the workaround > fs.s3a.readahead.range = 1G with Hadoop 3.2.0* > Below is further details : > > |Hadoop Version|Actual size of the files(in SQL Tab)|Reported size of the > file(In Stages)|Time to complete the Stage|fs.s3a.readahead.range| > |Hadoop 3.2.0|29.3 GiB|29.3 GiB|23 min|64K| > |Hadoop 3.3.1|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}27 > min{color}*|{color:#172b4d}64K{color}| > |Hadoop 3.2.0|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}~27 > min{color}*|{color:#172b4d}1G{color}| > * *Shuffle Write* is same (95.9 GiB) for all the above three cases > I was expecting some improvement(or same as 3.2.0) with Hadoop 3.3.1 with > read operations, please suggest how to approach this and resolve this. > I have used the default s3a config along with below and also using EKS cluster > {code:java} > spark.hadoop.fs.s3a.committer.magic.enabled: 'true' > spark.hadoop.fs.s3a.committer.name: magic > spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a: > org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory > spark.hadoop.fs.s3a.downgrade.syncable.exceptions: "true"{code} > * I did not use > {code:java} > spark.hadoop.fs.s3a.experimental.input.fadvise=random{code} > And as already mentioned I have used same Spark, same amount of resources and > same config. Only change is Hadoop 3.2.0 to Hadoop 3.3.1 (Built with Spark > using ./dev/make-distribution.sh --name spark-patched --pip -Pkubernetes > -Phive -Phive-thriftserver -Dhadoop.version="3.3.1") -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org