inline On Tue, Dec 17, 2019 at 12:59 AM Jan Lukavský <je...@seznam.cz> wrote:
> Hi, > > I actually thought that the proposal refers to Dataflow only. If this is > supposed to be general, can we remove the Dataflow/Windmill specific parts > and replace them with generic ones? > I'll look into rephrasing doc to keep Dataflow/Windmill as example. > I'd have two more questions: > > a) the proposal is named "Slowly changing", why is the rate of change > essential to the proposal? Once running on event time, that should not > matter, or what am I missing? > Within this proposal, it is suggested to make a full snapshot of data on every re-read. This is generally expensive and setting time event to short interval might cause issues. Otherwise it is not essential. > b) The description says: 'User wants to solve a stream enrichment > problem. In brief request sounds like: ”I want to enrich each event in this > stream by corresponding data from given table.”'. That is understandable, > but would it be better to enable the user to express this intent directly > (via Join operation)? The actual implementation might be runner (and > input!) specific. The analogy is that when doing group-by-key operation, > runner can choose hash grouping or sort-merge grouping, but that is not > (directly) expressed in user code. I'm not saying that we should not have > low-level transforms, just asking if it would be better to leave this > decision to the runner (at least in some cases). It might be the case that > we want to make core SDK as low level as possible (and as reasonable), I > just want to make sure that that is really the intent. > The idea is to add basic operation with as small change as possible for current API. Ultimate goal is to have a Join/GBK operator that will choose proper strategy. However, I don't think that we have proper tools and view of how to choose best strategy at hand as of yet. > Thanks for the proposal! > > Jan > On 12/17/19 12:01 AM, Kenneth Knowles wrote: > > I want to highlight that this design works for definitely more runners > than just Dataflow. I see two pieces of it that I want to bring onto the > thread: > > 1. A new kind of "unbounded source" which is a periodic refresh of a > bounded source, and use that as a side input. Each main input element has a > window that maps to a specific refresh of the side input. > 2. Distributed map side inputs: supporting very large lookup tables, but > with consistency challenges. Even the part about "windmill API" probably > applies to other runners > > So I hope the title and "Objective" section do not cause people to stop > reading. > > Kenn > > On Mon, Dec 16, 2019 at 11:36 AM Mikhail Gryzykhin <mig...@google.com> > wrote: > >> +some people explicitly >> >> Can you please check on the doc and comment if it looks fine? >> >> Thank you, >> --Mikhail >> >> On Tue, Dec 10, 2019 at 1:43 PM Mikhail Gryzykhin <mig...@google.com> >> wrote: >> >>> "Good news, everyone-" >>> ―Farnsworth >>> >>> Hi everyone, >>> >>> Recently, I was looking into relaxing limitations on side inputs in >>> Dataflow runner. As part of it, I came up with design proposal for >>> standardizing slowly changing dimensions use case in Beam and relevant >>> changes to add support for distributed map side inputs. >>> >>> Please review and comment on design doc. >>> <https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg> >>> [1] >>> >>> Thank you, >>> Mikhail. >>> >>> ----- >>> >>> [1] >>> https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg >>> >>>