Hi, This is a solved problem, try using s3a instead and everything will be fine.
Besides that you might want to use coalesce or partitionby or repartition in order to see how many executors are being used to write (that speeds things up quite a bit). We had a write issue taking close to 50 min which is not running for lower than 5 minutes. Regards, Gourav Sengupta On Fri, Mar 4, 2016 at 8:59 PM, Jelez Raditchkov <je...@hotmail.com> wrote: > Working on a streaming job with DirectParquetOutputCommitter to S3 > I need to use PartitionBy and hence SaveMode.Append > > Apparently when using SaveMode.Append spark automatically defaults to the > default parquet output committer and ignores DirectParquetOutputCommitter. > > My problems are: > 1. the copying to _temporary takes alot of time > 2. I get job failures with: java.io.FileNotFoundException: File > s3n://jelez/parquet-data/_temporary/0/task_201603040904_0544_m_000007 does > not exist. > > I have set: > sparkConfig.set("spark.speculation", "false") > sc.hadoopConfiguration.set("mapreduce.map.speculative", "false") > sc.hadoopConfiguration.set("mapreduce.reduce.speculative", > "false") > > Any ideas? Opinions? Best practices? > >