SparkRunner Combine.perKey performance

Jan Lukavský Thu, 13 Jun 2019 06:20:46 -0700

Hi,

I have hit a performance issue with Spark runner, that seems to relatedto its current Combine.perKey implementation. I'll try to summarize whatI have found in the code:

- Combine.perKey uses Spark's combineByKey primitive, which is prettysimilar to the definition of CombineFn

- it holds all elements as WindowedValues, and usesIterable<WindowedValue<Acc>> as accumulator (each WindowedValue holdsaccumulated state for each window)


 - the update function is implemented as

  1) convert value to Iterable<WindowedValue<Acc>>

  2) merge accumulators for each windows

The logic inside createAccumulator and mergeAccumulators is quitenon-trivial. The result of profiling is that two frames where the codespends most of the time are:

41633930798 33.18% 4163org.apache.beam.runners.spark.translation.SparkKeyedCombineFn.mergeCombiners 19990682441 15.93% 1999org.apache.beam.vendor.guava.v20_0.com.google.common.collect.Iterables.unmodifiableIterable


A simple change on code from

 PCollection<..> input = ...

 input.apply(Combine.perKey(...))

to

 PCollection<..> input = ...

 input

   .apply(GroupByKey.create())

   .apply(Combine.groupedValues(...))

had drastical impact on the job run time (minutes as opposed to hours,after which the first job didn't even finish!).

I think I understand the reason why the current logic is implemented asit is, it has to deal with merging windows. But the consequences seem tobe that it renders the implementation very inefficient.

Has anyone seen similar behavior? Does my analysis of the problem seemcorrect?

Jan

SparkRunner Combine.perKey performance

Reply via email to