Re: Parallelism in Combine.GroupedValues

Luke Cwik Tue, 12 May 2020 09:34:24 -0700

There isn't just one accumulator. There are multiple accumulators that are
used to support a parallel combine. Feel free to open up a PR to help
improve the javadoc.


Yes, these combines are converted to a lifted combine when the runner is
able to do so.

On Tue, May 12, 2020 at 9:14 AM rahul patwari <[email protected]>
wrote:

> Hi Luke,
>
> I should have been more clear with my question. Sorry, my bad.
>
> I wanted to ask: How can combine happen parallelly by using only *one
> accumulator instance*?
>
> It has been explicitly specified in CombineFn.apply()[4] that
> mergeAccumulators() will not be called. A single accumulator instance is
> created and used to combine all the values associated with a single key.
> As Combine.GroupedValues use CombineFn.apply(), the parallelism for
> combining the values of a key will only be one.
> And as Combine.perKey() uses Combine.GroupedValues to combine the values
> of a key, the parallelism of this combine will also be limited to one.
>
> Do you mean to say that these types of Combine will be translated to
> Lifted Combine(assuming no side inputs are provided) depending on the
> runner?
>
> Regards,
> Rahul
>
> [4]:
> https://github.com/apache/beam/blob/334682d4a8ac5e1ebd298ba3b8020a9161884927/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Combine.java#L441
>
> On Tue, May 12, 2020 at 8:35 PM Luke Cwik <[email protected]> wrote:
>
>> There is more than one instance of an accumulator being used and then
>> those accumulators are merged using mergeAccumulators method.
>>
>> Two examples of when combining happens in parallel is when the
>> withFewKeys hint is used on the combiner or when there is partial
>> combining[1] happening on the mapper side before the grouping operation.
>>
>> 1: https://s.apache.org/beam-runner-api-combine-model
>>
>> On Tue, May 12, 2020 at 7:05 AM rahul patwari <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> In the Javadoc for Combine.GroupedValues[1], it has been described that 
>>> *combining
>>> the values associated with a single key can happen in parallel*.
>>> The logic to combine values associated with a key can be provided by
>>> CombineFnWithContext (or) CombineFn.
>>> Both CombineFnWithContext.apply()[2] and CombineFn.apply()[3] uses a
>>> single accumulator to combine the values.
>>>
>>> My understanding is that the parallelism in Combine PTransform will be
>>> determined by the no. of accumulators. But, the Javadoc describes that
>>> combining is done in parallel even though the no. of accumulators used to
>>> combine is one.
>>>
>>> How can combine happen parallelly by using only one accumulator?
>>>
>>> Regards,
>>> Rahul
>>>
>>> [1]:
>>> https://github.com/apache/beam/blob/334682d4a8ac5e1ebd298ba3b8020a9161884927/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Combine.java#L2075-L2078
>>> [2]:
>>> https://github.com/apache/beam/blob/53e5cee254023152e77a3fc46564642dc9b6b506/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/CombineWithContext.java#L117
>>> [3]:
>>> https://github.com/apache/beam/blob/334682d4a8ac5e1ebd298ba3b8020a9161884927/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Combine.java#L443
>>>
>>

Re: Parallelism in Combine.GroupedValues

Reply via email to