Hi all,

I'm trying to understand the impact of coalesce operation on spark job
performance.

As a side note: were are using emrfs (i.e. aws s3) as source and a target
for the job.

Omitting unnecessary details job can be explained as: join 200M records
Dataframe stored in orc format on emrfs with another 200M records cached
Dataframe, the result of the join put back to emrfs. First DF is a set of
wide rows (Spark UI shows 300 GB) and the second is relatively small (Spark
shows 20 GB).

I have enough resources in my cluster to perform the job but I don't like
the fact that output datasource contains 200 part orc files (as
spark.sql.shuffle.partitions
defaults to 200) so before saving orc to emrfs I'm doing .coalesce(10).
>From documentation coalesce looks like a quite harmless operations: no
repartitioning etc.

But with such setup my job fails to write dataset on the last stage. Right
now the error is OOM: GC overhead. When I change  .coalesce(10) to
.coalesce(100) the job runs much faster and finishes without errors.

So what's the impact of .coalesce in this case? And how to do in place
concatenation of files (not involving hive) to end up with smaller amount
of bigger files, as with .coalesce(100) job generates 100 orc snappy
encoded files ~300MB each.

Thanks,
Andrii

Reply via email to