subject:"Impact of coalesce operation before writing dataframe"

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi

Ah that's right. I didn't mention it: I have 10 executors in my cluster, and so when I do .coalesce(10) and right after that saving orc to s3 - does coalescing really affects parallelism? To me it looks like no, because we went from 100 tasks that are executed in parallel by 10 executors to 10

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread John Compitello

Spark is doing operations on each partition in parallel. If you decrease number of partitions, you’re potentially doing less work in parallel depending on your cluster setup. > On May 23, 2017, at 4:23 PM, Andrii Biletskyi > wrote: > > > No, I didn't

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi

No, I didn't try to use repartition, how exactly it impacts the parallelism?In my understanding coalesce simply "unions" multiple partitions located on same executor "one on on top of the other", while repartition does hash-based shuffle decreasing the number of output partitions. So how this

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi

No, I didn't try to use repartition, how exactly it impacts the parallelism? In my understanding coalesce simply "unions" multiple partitions located on same executor "one on on top of the other", while repartition does hash-based shuffle decreasing the number of output partitions. So how this

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Michael Armbrust

coalesce is nice because it does not shuffle, but the consequence of avoiding a shuffle is it will also reduce parallelism of the preceding computation. Have you tried using repartition instead? On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi < andrii.bilets...@yahoo.com.invalid> wrote: > Hi

Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi

Hi all, I'm trying to understand the impact of coalesce operation on spark job performance. As a side note: were are using emrfs (i.e. aws s3) as source and a target for the job. Omitting unnecessary details job can be explained as: join 200M records Dataframe stored in orc format on emrfs with

Re: Impact of coalesce operation before writing dataframe

Re: Impact of coalesce operation before writing dataframe

Re: Impact of coalesce operation before writing dataframe

Re: Impact of coalesce operation before writing dataframe

Re: Impact of coalesce operation before writing dataframe

Impact of coalesce operation before writing dataframe

6 matches

Site Navigation

Mail list logo

Footer information