I've narrowed down the topic. This does not include any of Dataflow part and is general for all runners. Please visit <https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg> .
Changes: * Changed title * Narrowed topic to slowly changing dimensions support only. This should simplify discussion and make it easier to come to conclusion. Looking for comments on: * API for new feature. * Currently there's single comment with alternative approach. * General idea review. Regards, --Mikhail On Wed, Dec 18, 2019 at 3:14 PM Kenneth Knowles <k...@apache.org> wrote: > I do think that the implementation concerns around larger side inputs are > relevant to most runners. Ideally there would be no model change necessary. > Triggers are harder and bring in consistency concerns, which are even more > likely to be relevant to all runners. > > Kenn > > On Wed, Dec 18, 2019 at 11:23 AM Luke Cwik <lc...@google.com> wrote: > >> Most of the doc is about how to support distributed side inputs in >> Dataflow and doesn't really cover how the Beam model (accumulating, >> discarding, retraction) triggers impact what are the "contents" of a >> PCollection in time and how this proposal for a limited set of side input >> shapes can work to support larger side inputs in Dataflow. >> >> On Tue, Dec 17, 2019 at 2:28 AM Jan Lukavský <je...@seznam.cz> wrote: >> >>> Hi Mikhail, >>> On 12/17/19 10:43 AM, Mikhail Gryzykhin wrote: >>> >>> inline >>> >>> On Tue, Dec 17, 2019 at 12:59 AM Jan Lukavský <je...@seznam.cz> wrote: >>> >>>> Hi, >>>> >>>> I actually thought that the proposal refers to Dataflow only. If this >>>> is supposed to be general, can we remove the Dataflow/Windmill specific >>>> parts and replace them with generic ones? >>>> >>> I'll look into rephrasing doc to keep Dataflow/Windmill as example. >>> >>> Cool, thanks! >>> >>> I'd have two more questions: >>>> >>>> a) the proposal is named "Slowly changing", why is the rate of change >>>> essential to the proposal? Once running on event time, that should not >>>> matter, or what am I missing? >>>> >>> Within this proposal, it is suggested to make a full snapshot of data on >>> every re-read. This is generally expensive and setting time event to short >>> interval might cause issues. Otherwise it is not essential. >>> >>> Understood. This relates to table-stream duality, where the requirements >>> might relax once you don't have to convert table to stream by re-reading >>> it, but by being able to retrieve updates as you go (example would be >>> reading directly from kafka or any other "commit log" abstraction). >>> >>> b) The description says: 'User wants to solve a stream enrichment >>>> problem. In brief request sounds like: ”I want to enrich each event in this >>>> stream by corresponding data from given table.”'. That is understandable, >>>> but would it be better to enable the user to express this intent directly >>>> (via Join operation)? The actual implementation might be runner (and >>>> input!) specific. The analogy is that when doing group-by-key operation, >>>> runner can choose hash grouping or sort-merge grouping, but that is not >>>> (directly) expressed in user code. I'm not saying that we should not have >>>> low-level transforms, just asking if it would be better to leave this >>>> decision to the runner (at least in some cases). It might be the case that >>>> we want to make core SDK as low level as possible (and as reasonable), I >>>> just want to make sure that that is really the intent. >>>> >>> The idea is to add basic operation with as small change as possible for >>> current API. >>> Ultimate goal is to have a Join/GBK operator that will choose proper >>> strategy. However, I don't think that we have proper tools and view of how >>> to choose best strategy at hand as of yet. >>> >>> OK, cool. That is where I would find it very much useful to have some >>> sort of "goals", that we are targeting. I agree that there are some pieces >>> missing in the puzzle as of now. But it would be good to know what these >>> pieces are and what needs to be done to fulfill our goals. But this is >>> probably not related to discussion of this proposal, but more related to >>> the concept of BIP or similar. >>> >>> Thanks for the explanation. >>> >>> Thanks for the proposal! >>>> >>>> Jan >>>> On 12/17/19 12:01 AM, Kenneth Knowles wrote: >>>> >>>> I want to highlight that this design works for definitely more runners >>>> than just Dataflow. I see two pieces of it that I want to bring onto the >>>> thread: >>>> >>>> 1. A new kind of "unbounded source" which is a periodic refresh of a >>>> bounded source, and use that as a side input. Each main input element has a >>>> window that maps to a specific refresh of the side input. >>>> 2. Distributed map side inputs: supporting very large lookup tables, >>>> but with consistency challenges. Even the part about "windmill API" >>>> probably applies to other runners >>>> >>>> So I hope the title and "Objective" section do not cause people to stop >>>> reading. >>>> >>>> Kenn >>>> >>>> On Mon, Dec 16, 2019 at 11:36 AM Mikhail Gryzykhin <mig...@google.com> >>>> wrote: >>>> >>>>> +some people explicitly >>>>> >>>>> Can you please check on the doc and comment if it looks fine? >>>>> >>>>> Thank you, >>>>> --Mikhail >>>>> >>>>> On Tue, Dec 10, 2019 at 1:43 PM Mikhail Gryzykhin <mig...@google.com> >>>>> wrote: >>>>> >>>>>> "Good news, everyone-" >>>>>> ―Farnsworth >>>>>> >>>>>> Hi everyone, >>>>>> >>>>>> Recently, I was looking into relaxing limitations on side inputs in >>>>>> Dataflow runner. As part of it, I came up with design proposal for >>>>>> standardizing slowly changing dimensions use case in Beam and relevant >>>>>> changes to add support for distributed map side inputs. >>>>>> >>>>>> Please review and comment on design doc. >>>>>> <https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg> >>>>>> [1] >>>>>> >>>>>> Thank you, >>>>>> Mikhail. >>>>>> >>>>>> ----- >>>>>> >>>>>> [1] >>>>>> https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg >>>>>> >>>>>>