Hi,

I am using newAPIHadoopFile to process large number of s3 files(around 20
thousand) by passing URLs as comma separated String. It take around *7
minutes* to start the job. I am running the job on EMR 5.2.0 with spark
2.0.2.

Here is the code

Configuration conf = new Configuration();

                JavaPairRDD<Text, BytesWritable> file = 
jsc.newAPIHadoopFile(inputPath,
FullFileInputFormat.class, Text.class,
                                BytesWritable.class, conf)
                                ;

I have experimented with different url schemes s://, s3n:// and s3a:// but
it did not worked.

Could you please suggest something to reduce this job start time?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/newAPIHadoopFile-bad-performance-tp28281.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to