Hi, I am using newAPIHadoopFile to process large number of s3 files(around 20 thousand) by passing URLs as comma separated String. It take around *7 minutes* to start the job. I am running the job on EMR 5.2.0 with spark 2.0.2.
Here is the code Configuration conf = new Configuration(); JavaPairRDD<Text, BytesWritable> file = jsc.newAPIHadoopFile(inputPath, FullFileInputFormat.class, Text.class, BytesWritable.class, conf) ; I have experimented with different url schemes s://, s3n:// and s3a:// but it did not worked. Could you please suggest something to reduce this job start time? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/newAPIHadoopFile-bad-performance-tp28281.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org