Re: GroupByKey with sorted values within key

Lukasz Cwik Wed, 30 May 2018 07:52:31 -0700

Each runner can choose to override the SortValues PTransform with their own
internal offering. For example Spark overrides global combine[1] during
pipeline translation. If Spark detected the SortValues PTransform during
translation, it could override the offering with something that used
repartitionAndSortWithinPartitions.

GroupByKeyAndSortValuesOnly inside Dataflow exists to support a specific
use case. Users should rely on SortValues as it is the public
implementation for sorting.

1:
https://github.com/apache/beam/blob/85dcab56268fbac923ffd5885489ee154f097fc5/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L200

As a side note, its uncommon where you need to sort all values, usually top
100 suffices and can be implemented much more efficiently with a combiner
when compared to sorting.

On Wed, May 30, 2018 at 3:38 AM <marek-simu...@seznam.cz> wrote:

> Hi,
>  I have question I am trying to do translation in dsl-euphoria for
> “GroupByKey with sorted values within key” to Beam. I am aware of java sdk
> extensions SortValues, but it doesn’t have sufficient abstraction for
> runners.
>
> I noticed that in DataflowRunner there is translation of batch GroupByKey
> to GroupByKeyAndSortValuesOnly but is it considered to have it in beam core
> so for example SparkRunner could translate “GroupByKey with sorted values
> within key” with their internals such as repartitionAndSortWithinPartitions.
>
> Thank you.
> Marek Simunek
>

Re: GroupByKey with sorted values within key

Reply via email to