Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
Ah that's right. I didn't mention it: I have 10 executors in my cluster, and so when I do .coalesce(10) and right after that saving orc to s3 - does coalescing really affects parallelism? To me it looks like no, because we went from 100 tasks that are executed in parallel by 10 executors to 10

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread John Compitello
Spark is doing operations on each partition in parallel. If you decrease number of partitions, you’re potentially doing less work in parallel depending on your cluster setup. > On May 23, 2017, at 4:23 PM, Andrii Biletskyi > wrote: > > > No, I didn't

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
 No, I didn't try to use repartition, how exactly it impacts the parallelism?In my understanding coalesce simply "unions" multiple partitions located on same executor "one on on top of the other", while repartition does hash-based shuffle decreasing the number of output partitions. So how this

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
No, I didn't try to use repartition, how exactly it impacts the parallelism? In my understanding coalesce simply "unions" multiple partitions located on same executor "one on on top of the other", while repartition does hash-based shuffle decreasing the number of output partitions. So how this

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Michael Armbrust
coalesce is nice because it does not shuffle, but the consequence of avoiding a shuffle is it will also reduce parallelism of the preceding computation. Have you tried using repartition instead? On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi < andrii.bilets...@yahoo.com.invalid> wrote: > Hi

Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
Hi all, I'm trying to understand the impact of coalesce operation on spark job performance. As a side note: were are using emrfs (i.e. aws s3) as source and a target for the job. Omitting unnecessary details job can be explained as: join 200M records Dataframe stored in orc format on emrfs with