Thanks for the clarification, Gourav.
> On Mar 6, 2016, at 3:54 AM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > > Hi Ted, > > There was no idle time after I changed the path to start with s3a and then > ensured that the number of executors writing were large. The writes start and > complete in about 5 mins or less. > > Initially the write used to complete by around 30 mins and we could see that > there were failure messages all over the place for another 20 mins after > which we killed jupyter application. > > > Regards, > Gourav Sengupta > >> On Sun, Mar 6, 2016 at 11:48 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> Gourav: >> For the 3rd paragraph, did you mean the job seemed to be idle for about 5 >> minutes ? >> >> Cheers >> >>> On Mar 6, 2016, at 3:35 AM, Gourav Sengupta <gourav.sengu...@gmail.com> >>> wrote: >>> >>> Hi, >>> >>> This is a solved problem, try using s3a instead and everything will be fine. >>> >>> Besides that you might want to use coalesce or partitionby or repartition >>> in order to see how many executors are being used to write (that speeds >>> things up quite a bit). >>> >>> We had a write issue taking close to 50 min which is not running for lower >>> than 5 minutes. >>> >>> >>> Regards, >>> Gourav Sengupta >>> >>>> On Fri, Mar 4, 2016 at 8:59 PM, Jelez Raditchkov <je...@hotmail.com> wrote: >>>> Working on a streaming job with DirectParquetOutputCommitter to S3 >>>> I need to use PartitionBy and hence SaveMode.Append >>>> >>>> Apparently when using SaveMode.Append spark automatically defaults to the >>>> default parquet output committer and ignores DirectParquetOutputCommitter. >>>> >>>> My problems are: >>>> 1. the copying to _temporary takes alot of time >>>> 2. I get job failures with: java.io.FileNotFoundException: File >>>> s3n://jelez/parquet-data/_temporary/0/task_201603040904_0544_m_000007 does >>>> not exist. >>>> >>>> I have set: >>>> sparkConfig.set("spark.speculation", "false") >>>> sc.hadoopConfiguration.set("mapreduce.map.speculative", "false") >>>> sc.hadoopConfiguration.set("mapreduce.reduce.speculative", >>>> "false") >>>> >>>> Any ideas? Opinions? Best practices? >