Re: GroupByKey with sorted values within key

Kenneth Knowles Wed, 30 May 2018 09:09:49 -0700

The sorting by bytes is a deliberate limitation of this particular
approach. It basically assumes you are using bytes-based shuffle under the
hood, so invoking a language-specific comparator would be something new. I
know +Ben had some ideas about this.


Kenn

On Wed, May 30, 2018 at 8:53 AM David Morávek <[email protected]>
wrote:

> Thanks for pointing us the right direction. We'll try to prototype custom
> translation for Spark runner within next sprint. In order to do so, I have
> few questions:
>
> 1) Should we move SortValues tranform to beam-sdks-java-core or just add
> it as spark runner dependency?
> 2) I think we should try to make SortValues more flexible by letting user
> to provide custom value comparator, sorting lexicographically by secondary
> key may be painful in some use cases. What do you think?
>
> side note:
> I agree that usually top n values, that fit in memory are sufficient and
> we can combine them using PQ, but in practice we still have pipelines that
> need to do top N selection over data that do not fit in memory for a single
> key.
>
> D.
>
> On Wed, May 30, 2018 at 5:28 PM, Lukasz Cwik <[email protected]> wrote:
>
>> SortValues does not have a defined & documented URN yet. Once a Runner is
>> providing such an override, it will happen. No runner publicly provides one
>> to my knowledge.
>>
>> On Wed, May 30, 2018 at 8:08 AM Kenneth Knowles <[email protected]> wrote:
>>
>>> I can see a few usability issues here. Totally agree w/ Luke, just
>>> noting:
>>>
>>>  - The naming is slightly misleading because SortValues is actually
>>> already GBK+SortValues.
>>>  - It also makes things look less supported when they are in the
>>> extensions/ folder. I'd say we should have a better place to put such a
>>> library if it is the official public implementation. The word "extensions"
>>> doesn't seem particularly accurate or meaningful to me.
>>>
>>> Q: Does SortValues have a defined & documented URN yet?
>>>
>>> Kenn
>>>
>>> On Wed, May 30, 2018 at 7:52 AM Lukasz Cwik <[email protected]> wrote:
>>>
>>>> Each runner can choose to override the SortValues PTransform with their
>>>> own internal offering. For example Spark overrides global combine[1] during
>>>> pipeline translation. If Spark detected the SortValues PTransform during
>>>> translation, it could override the offering with something that used
>>>> repartitionAndSortWithinPartitions.
>>>>
>>>> GroupByKeyAndSortValuesOnly inside Dataflow exists to support a
>>>> specific use case. Users should rely on SortValues as it is the public
>>>> implementation for sorting.
>>>>
>>>> 1:
>>>> https://github.com/apache/beam/blob/85dcab56268fbac923ffd5885489ee154f097fc5/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L200
>>>>
>>>> As a side note, its uncommon where you need to sort all values, usually
>>>> top 100 suffices and can be implemented much more efficiently with a
>>>> combiner when compared to sorting.
>>>>
>>>> On Wed, May 30, 2018 at 3:38 AM <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>  I have question I am trying to do translation in dsl-euphoria for
>>>>> “GroupByKey with sorted values within key” to Beam. I am aware of java sdk
>>>>> extensions SortValues, but it doesn’t have sufficient abstraction for
>>>>> runners.
>>>>>
>>>>> I noticed that in DataflowRunner there is translation of batch
>>>>> GroupByKey to GroupByKeyAndSortValuesOnly but is it considered to have it
>>>>> in beam core so for example SparkRunner could translate “GroupByKey with
>>>>> sorted values within key” with their internals such as
>>>>> repartitionAndSortWithinPartitions.
>>>>>
>>>>> Thank you.
>>>>> Marek Simunek
>>>>>
>>>>
>

Re: GroupByKey with sorted values within key

Reply via email to