Hi all.

It was necessary to normalize the identity of instances with multiple
identities in my work, as I thought why I needed such a function. It is a
name identification of identity.

I first thought about doing this by customizing the "equals" method of the
object that is the key of "GroupByKey".

However, it turned out that this is difficult.

https://stackoverflow.com/questions/55413635/custom-key-for-grouping-in-dataflow/55447135?noredirect=1#comment97669360_55447135

I thought that it could be realized by processing only one key with
GroupByKey and connecting it as many as the number of keys (3 in my
example).

Serial processes is a connection of multiple ParDos and GroupByKeys.
However, duplicate processing may occur during processing, and combine
processing is performed to remove this.

When I try with my local data set, the amount of calculation is o(log n).


> like it needs a transitive closure


It seems that this task can be generalized as graph theory, but I was
thinking about how to calculate with "Beam Primitive".


> If that's the case then, what does the integer do when creating the
GroupByMultiKey?

In my example, integer values (represented as Strings) are targets to be
grouping.


> Even though we don't support iteration, one could have a known upperbound

Is there an upper limit on the number of processes that make up the
pipeline? I did not know that.
In my idea, the number of processes increases in proportion to the number
of keys (3 in my example). This will cause problems if there are many keys.

> This looks like a really specialized use case

Certainly, this usecase may be a little strange .
There were multiple condition for identity, and if one matches, it is
necessary to perform such processing on a large amount of data, assuming
that it is the same instance.

> Unfortunately it is also not likely to implement in a scalable way using
Beam primitives

Well, I did not think about generalizing this problem as graph theory.
However, when I measured it in my work, the computational complexity was
"o(log n)". This may be the result under special conditions. . .




2019年6月8日(土) 1:48 Lukasz Cwik <[email protected]>:

> Even though we don't support iteration, one could have a known upperbound
> and "unroll" the loop to a fixed number of iterations statically before the
> pipeline is run but I agree with Eugene on his other points.
>
>
>
>
>
>
>
> On Fri, Jun 7, 2019 at 3:59 AM Robert Burke <[email protected]> wrote:
>
>> I'm not sure I understand the desired properties of GroupByMultiKey.
>>
>> Offhand, am I right interpreting GroupByMultiKey as essentially forming a
>> graph of the keys based on the MultiKeys nodes, and the number of resulting
>> iterables is based on the components of the graph.
>>
>> If that's the case then, what does the integer do when creating the
>> GroupByMultiKey?
>>
>> In the example, it seems to be saying "I'd like 3 groups" but wouldn't
>> that be a property of the implicit connected graphs of MultiKeys?
>>
>> Thank you very much!
>>
>>
>> On Fri, Jun 7, 2019, 10:14 AM Jan Lukavský <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> that sounds interesting, but it seems to be computationally intensive
>>> and might not be well scalable, if I understand it correctly. It looks
>>> like it needs a transitive closure, am I right?
>>>
>>>   Jan
>>>
>>> On 6/7/19 11:17 AM, i.am.moai wrote:
>>> > Hello everyone, nice to meet you
>>> >
>>> > I am Naoki Hyu(日宇尚記). a developer live in Tokyo. I often use scala and
>>> > python as my favorite language .
>>> >
>>> > I have no experience with OSS development, but as I use DataFlow at
>>> > work, I want to contribute to the development of Beam.
>>> >
>>> > In fact, there is a feature I want to develop, and now I have the
>>> > source code on my local PC.
>>> >
>>> > The feature I want to create is an extension of GroupBy to a multiple
>>> > key, which realizes more complex grouping.
>>> >
>>> > https://issues.apache.org/jira/browse/BEAM-7358
>>> >
>>> > Everyone, could you give me an opinion on this intent?
>>> >
>>>
>>

Reply via email to