om: Silvio Fiorito
Sent: Wednesday, February 3, 2021 11:05 AM
To: James Yu ; user
Subject: Re: Poor performance caused by coalesce to 1
Coalesce is reducing the parallelization of your last stage, in your case to 1
task. So, it’s natural it will give poor performance especially with large
da
tage boundary"?
>
> Thanks
> --
> *From:* Silvio Fiorito
> *Sent:* Wednesday, February 3, 2021 11:05 AM
> *To:* James Yu ; user
> *Subject:* Re: Poor performance caused by coalesce to 1
>
>
> Coalesce is reducing the parallelization o
That sounds like a plan as suggested by Sean, I have also seen caching the
RS before coalesce provides benefits, especially for a minute 50MB data.
Check Spark GUI storage tab for its effect.
HTH
Mich
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
rito
Sent: Wednesday, February 3, 2021 11:05 AM
To: James Yu ; user
Subject: Re: Poor performance caused by coalesce to 1
Coalesce is reducing the parallelization of your last stage, in your case to 1
task. So, it’s natural it will give poor performance especially with large
data. If you absol
Probably could also be because that coalesce can cause some upstream
transformations to also have parallelism of 1. I think (?) an OK solution
is to cache the result, then coalesce and write. Or combine the files after
the fact. or do what Silvio said.
On Wed, Feb 3, 2021 at 12:55 PM James Yu
I had that issue too and from what I gathered, it is an expected
optimization... Try using repartiion instead
Get BlueMail for Android
On Feb 3, 2021, 11:55, at 11:55, James Yu wrote:
>Hi Team,
>
>We are running into this poor performance issue and seeking your
>suggestion on how to improve
Date: Wednesday, February 3, 2021 at 1:54 PM
To: user
Subject: Poor performance caused by coalesce to 1
Hi Team,
We are running into this poor performance issue and seeking your suggestion on
how to improve it:
We have a particular dataset which we aggregate from other datasets and like to
write
Hi Team,
We are running into this poor performance issue and seeking your suggestion on
how to improve it:
We have a particular dataset which we aggregate from other datasets and like to
write out to one single file (because it is small enough). We found that after
a series of