Re: [Proposal] Slowly Changing Dimensions and Distributed Map Side Inputs (in Dataflow)

Mikhail Gryzykhin Tue, 17 Dec 2019 01:45:18 -0800

inline

On Tue, Dec 17, 2019 at 12:59 AM Jan Lukavský <je...@seznam.cz> wrote:


> Hi,
>
> I actually thought that the proposal refers to Dataflow only. If this is
> supposed to be general, can we remove the Dataflow/Windmill specific parts
> and replace them with generic ones?
>
 I'll look into rephrasing doc to keep Dataflow/Windmill as example.

> I'd have two more questions:
>
>  a) the proposal is named "Slowly changing", why is the rate of change
> essential to the proposal? Once running on event time, that should not
> matter, or what am I missing?
>
Within this proposal, it is suggested to make a full snapshot of data on
every re-read. This is generally expensive and setting time event to short
interval might cause issues. Otherwise it is not essential.

>  b) The description says: 'User wants to solve a stream enrichment
> problem. In brief request sounds like: ”I want to enrich each event in this
> stream by corresponding data from given table.”'. That is understandable,
> but would it be better to enable the user to express this intent directly
> (via Join operation)? The actual implementation might be runner (and
> input!) specific. The analogy is that when doing group-by-key operation,
> runner can choose hash grouping or sort-merge grouping, but that is not
> (directly) expressed in user code. I'm not saying that we should not have
> low-level transforms, just asking if it would be better to leave this
> decision to the runner (at least in some cases). It might be the case that
> we want to make core SDK as low level as possible (and as reasonable), I
> just want to make sure that that is really the intent.
>
The idea is to add basic operation with as small change as possible for
current API.
Ultimate goal is to have a Join/GBK operator that will choose proper
strategy. However, I don't think that we have proper tools and view of how
to choose best strategy at hand as of yet.

> Thanks for the proposal!
>
> Jan
> On 12/17/19 12:01 AM, Kenneth Knowles wrote:
>
> I want to highlight that this design works for definitely more runners
> than just Dataflow. I see two pieces of it that I want to bring onto the
> thread:
>
> 1. A new kind of "unbounded source" which is a periodic refresh of a
> bounded source, and use that as a side input. Each main input element has a
> window that maps to a specific refresh of the side input.
> 2. Distributed map side inputs: supporting very large lookup tables, but
> with consistency challenges. Even the part about "windmill API" probably
> applies to other runners
>
> So I hope the title and "Objective" section do not cause people to stop
> reading.
>
> Kenn
>
> On Mon, Dec 16, 2019 at 11:36 AM Mikhail Gryzykhin <mig...@google.com>
> wrote:
>
>> +some people explicitly
>>
>> Can you please check on the doc and comment if it looks fine?
>>
>> Thank you,
>> --Mikhail
>>
>> On Tue, Dec 10, 2019 at 1:43 PM Mikhail Gryzykhin <mig...@google.com>
>> wrote:
>>
>>> "Good news, everyone-"
>>> ―Farnsworth
>>>
>>> Hi everyone,
>>>
>>> Recently, I was looking into relaxing limitations on side inputs in
>>> Dataflow runner. As part of it, I came up with design proposal for
>>> standardizing slowly changing dimensions use case in Beam and relevant
>>> changes to add support for distributed map side inputs.
>>>
>>> Please review and comment on design doc.
>>> <https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg>
>>>  [1]
>>>
>>> Thank you,
>>> Mikhail.
>>>
>>> -----
>>>
>>> [1]
>>> https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg
>>>
>>>

Re: [Proposal] Slowly Changing Dimensions and Distributed Map Side Inputs (in Dataflow)

Reply via email to