Re: [Proposal] Slowly Changing Dimensions support in Beam

Mikhail Gryzykhin Mon, 06 Jan 2020 10:34:03 -0800

I've narrowed down the topic. This does not include any of Dataflow part
and is general for all runners.  Please visit
<https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg>
.


Changes:
* Changed title
* Narrowed topic to slowly changing dimensions support only.

This should simplify discussion and make it easier to come to conclusion.

Looking for comments on:
* API for new feature.
   * Currently there's single comment with alternative approach.
* General idea review.

Regards,
--Mikhail

On Wed, Dec 18, 2019 at 3:14 PM Kenneth Knowles <k...@apache.org> wrote:

> I do think that the implementation concerns around larger side inputs are
> relevant to most runners. Ideally there would be no model change necessary.
> Triggers are harder and bring in consistency concerns, which are even more
> likely to be relevant to all runners.
>
> Kenn
>
> On Wed, Dec 18, 2019 at 11:23 AM Luke Cwik <lc...@google.com> wrote:
>
>> Most of the doc is about how to support distributed side inputs in
>> Dataflow and doesn't really cover how the Beam model (accumulating,
>> discarding, retraction) triggers impact what are the "contents" of a
>> PCollection in time and how this proposal for a limited set of side input
>> shapes can work to support larger side inputs in Dataflow.
>>
>> On Tue, Dec 17, 2019 at 2:28 AM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>> Hi Mikhail,
>>> On 12/17/19 10:43 AM, Mikhail Gryzykhin wrote:
>>>
>>> inline
>>>
>>> On Tue, Dec 17, 2019 at 12:59 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>
>>>> Hi,
>>>>
>>>> I actually thought that the proposal refers to Dataflow only. If this
>>>> is supposed to be general, can we remove the Dataflow/Windmill specific
>>>> parts and replace them with generic ones?
>>>>
>>>  I'll look into rephrasing doc to keep Dataflow/Windmill as example.
>>>
>>> Cool, thanks!
>>>
>>> I'd have two more questions:
>>>>
>>>>  a) the proposal is named "Slowly changing", why is the rate of change
>>>> essential to the proposal? Once running on event time, that should not
>>>> matter, or what am I missing?
>>>>
>>> Within this proposal, it is suggested to make a full snapshot of data on
>>> every re-read. This is generally expensive and setting time event to short
>>> interval might cause issues. Otherwise it is not essential.
>>>
>>> Understood. This relates to table-stream duality, where the requirements
>>> might relax once you don't have to convert table to stream by re-reading
>>> it, but by being able to retrieve updates as you go (example would be
>>> reading directly from kafka or any other "commit log" abstraction).
>>>
>>>  b) The description says: 'User wants to solve a stream enrichment
>>>> problem. In brief request sounds like: ”I want to enrich each event in this
>>>> stream by corresponding data from given table.”'. That is understandable,
>>>> but would it be better to enable the user to express this intent directly
>>>> (via Join operation)? The actual implementation might be runner (and
>>>> input!) specific. The analogy is that when doing group-by-key operation,
>>>> runner can choose hash grouping or sort-merge grouping, but that is not
>>>> (directly) expressed in user code. I'm not saying that we should not have
>>>> low-level transforms, just asking if it would be better to leave this
>>>> decision to the runner (at least in some cases). It might be the case that
>>>> we want to make core SDK as low level as possible (and as reasonable), I
>>>> just want to make sure that that is really the intent.
>>>>
>>> The idea is to add basic operation with as small change as possible for
>>> current API.
>>> Ultimate goal is to have a Join/GBK operator that will choose proper
>>> strategy. However, I don't think that we have proper tools and view of how
>>> to choose best strategy at hand as of yet.
>>>
>>> OK, cool. That is where I would find it very much useful to have some
>>> sort of "goals", that we are targeting. I agree that there are some pieces
>>> missing in the puzzle as of now. But it would be good to know what these
>>> pieces are and what needs to be done to fulfill our goals. But this is
>>> probably not related to discussion of this proposal, but more related to
>>> the concept of BIP or similar.
>>>
>>> Thanks for the explanation.
>>>
>>> Thanks for the proposal!
>>>>
>>>> Jan
>>>> On 12/17/19 12:01 AM, Kenneth Knowles wrote:
>>>>
>>>> I want to highlight that this design works for definitely more runners
>>>> than just Dataflow. I see two pieces of it that I want to bring onto the
>>>> thread:
>>>>
>>>> 1. A new kind of "unbounded source" which is a periodic refresh of a
>>>> bounded source, and use that as a side input. Each main input element has a
>>>> window that maps to a specific refresh of the side input.
>>>> 2. Distributed map side inputs: supporting very large lookup tables,
>>>> but with consistency challenges. Even the part about "windmill API"
>>>> probably applies to other runners
>>>>
>>>> So I hope the title and "Objective" section do not cause people to stop
>>>> reading.
>>>>
>>>> Kenn
>>>>
>>>> On Mon, Dec 16, 2019 at 11:36 AM Mikhail Gryzykhin <mig...@google.com>
>>>> wrote:
>>>>
>>>>> +some people explicitly
>>>>>
>>>>> Can you please check on the doc and comment if it looks fine?
>>>>>
>>>>> Thank you,
>>>>> --Mikhail
>>>>>
>>>>> On Tue, Dec 10, 2019 at 1:43 PM Mikhail Gryzykhin <mig...@google.com>
>>>>> wrote:
>>>>>
>>>>>> "Good news, everyone-"
>>>>>> ―Farnsworth
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> Recently, I was looking into relaxing limitations on side inputs in
>>>>>> Dataflow runner. As part of it, I came up with design proposal for
>>>>>> standardizing slowly changing dimensions use case in Beam and relevant
>>>>>> changes to add support for distributed map side inputs.
>>>>>>
>>>>>> Please review and comment on design doc.
>>>>>> <https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg>
>>>>>>  [1]
>>>>>>
>>>>>> Thank you,
>>>>>> Mikhail.
>>>>>>
>>>>>> -----
>>>>>>
>>>>>> [1]
>>>>>> https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg
>>>>>>
>>>>>>

Re: [Proposal] Slowly Changing Dimensions support in Beam

Reply via email to