On Fri, Nov 8, 2019 at 9:23 AM Luke Cwik <[email protected]> wrote: > > > On Thu, Nov 7, 2019 at 7:36 PM Kenneth Knowles <[email protected]> wrote: > >> >> >> On Thu, Nov 7, 2019 at 9:19 AM Luke Cwik <[email protected]> wrote: >> >>> I did suggest one other alternative on Jincheng's PR[1] which was to >>> allow windowless values to be sent across the gRPC port. The SDK would then >>> be responsible for ensuring that the execution didn't access any properties >>> that required knowledge of the timestamp, pane or window. This is different >>> then adding the ValueOnlyWindowedValueCoder as a model coder because it >>> allows SDKs to pass around raw values to functions without any windowing >>> overhead which could be useful for things like the side input window >>> mapping or window merging functions we have. >>> >> >> When you say "pass around" what does it mean? If it is over the wire, >> there is already no overhead to ValueOnlyWindowedValueCoder. So do you mean >> the overhead of having the layer of boxing of WindowedValue? I would assume >> all non-value components of the WindowedValue from >> ValueOnlyWindowedValueCoder are pointers to a single shared immutable >> instance carried by the coder instance. >> > > I was referring to the layer of boxing of WindowedValue. My concern wasn't > the performance overhead of passing around a wrapper object but the > cognitive overhead of understanding why everything needs to be wrapped in a > windowed value. Since you have been working on SQL for some time, this > would be analogous to executing a UDF and making all the machinery around > it take WindowedValue<T> instead of T. > > >> I think raw values can already be passed to functions, no? The main thing >> is that elements in a PCollection always have a window, timestamp, and >> paneinfo. Not all values are elements. Is there a specific case you have in >> mind? I would not expect WindowMappingFn or window merging fn to be passing >> "elements" but just values of the appropriate type for the function. >> > > This is about the machinery around WindowMappingFn/WindowMergingFn. For > example the implementation around WindowMappingFn takes a > WindowedValue<Window> and unwraps it forwarding it to the WindowMappingFn > and then takes the result and wraps it in a WindowedValue<Window> and > returns that to the runner. >
I'm not familiar with this, but it sounds like it should not be necessary and is an implementation detail. Is there a model change necessary to avoid the unboxing/boxing? I would be surprised. Kenn > > >> >> >>> On Thu, Nov 7, 2019 at 8:48 AM Robert Bradshaw <[email protected]> >>> wrote: >>> >>>> I think there is some misunderstanding about what is meant by option >>>> 2. What Kenn (I think) and I are proposing is not a WindowedValueCoder >>>> whose window/timestamp/paneinfo coders are parameterized to be >>>> constant coders, but a WindowedValueCoder whose >>>> window/timestamp/paneinfo values are specified as constants in the >>>> coder. >>>> >>>> Let's call this NewValueOnlyWindowedValueCoder, and is parameterized >>>> by a window, timestamp, and pane info instance >>>> >>>> The existing ValueOnlyWindowedValueCoder is literally >>>> NewValueOnlyWindowedValueCoder(GlobalWindow, MIN_TIMESTAMP, >>>> PaneInfo.NO_FIRING). Note in particular that using the existing >>>> ValueOnlyWindowedValueCoder would give the wrong timestamp and pane >>>> info if it is use for the result of a GBK, which I think is the loss >>>> of consistency referred to here. >>>> >>> >> Yes, this is exactly what I am proposing and sounds like what "approach >> 2" is. I think this approach "just works". It is simpler and more efficient >> than "WindowedValueCoder.of(ValueCoder, WindowCoder, TimestampCoder, >> PaneInfoCoder)" which would require some overhead even if WindowCoder, >> TimestampCoder and PaneInfoCoder were constant coders consuming zero bytes. >> >> Kenn >> >> >> >>> On Thu, Nov 7, 2019 at 1:03 AM jincheng sun <[email protected]> >>>> wrote: >>>> > >>>> > Thanks for your feedback and the valuable comments, Kenn & Robert! >>>> > >>>> > I think your comments are more comprehensive and enlighten me a lot. >>>> The two proposals which I mentioned above are to reuse the existing coder >>>> (FullWindowedValueCoder and ValueOnlyWindowedValueCoder). Now, with your >>>> comments, I think we can further abstract 'FullWindowedValueCoder' and >>>> 'ValueOnlyWindowedValueCoder', that is, we can rename >>>> 'FullWindowedValueCoder' to 'WindowedValueCoder', and make the coders of >>>> window/timestamp/pane configurable. Then we can remove >>>> 'ValueOnlyWindowedValueCoder' and need to add a mount of constant coders >>>> for window/timestamp/pane. >>>> > >>>> > I have replied your comments on the doc, and quick feedback as >>>> following: >>>> > >>>> > Regarding to "Approach 2: probably no SDK harness work / compatible >>>> with existing Beam model so no risk of introducing inconsistency",if we >>>> "just puts default window/timestamp/pane info on elements" and don't change >>>> the original coder, then the performance is not optimized. If we want to >>>> get the best performance, then the default coder of Window/timestamp/pane >>>> should be constant coder. In this case the SDK harnesses need to be aware >>>> of the constant coder and there will be some development work in the SDK >>>> harness. Besides, the SDK harness also needs to make the coders for >>>> window/timestamp/pane configurable and this will introduce some related >>>> changes, such as updating WindowedValueCoder._get_component_coders, etc. >>>> > >>>> > Regarding to "Approach 1: option a: if the SDK harness has to >>>> understand 'values without windows' then very large changes and high risk >>>> of introducing inconsistency (I eliminated many of these inconsistencies)", >>>> we only need to add ValueOnlyWindowedValueCoder to the StandardCoders and >>>> all the SDK harness should be aware of this coder. There is no much changes >>>> actually. >>>> > >>>> > Please feel free to correct me if there is anyting incorrect. :) >>>> > >>>> > Besides, I'm not quite clear about the consistency issues you meant >>>> here. Could you please give me some hints about this? >>>> > >>>> > Best, >>>> > Jincheng >>>> > >>>> > Robert Bradshaw <[email protected]> 于2019年11月7日周四 上午3:38写道: >>>> >> >>>> >> Yes, the portability framework is designed to support this, and >>>> >> possibly even more efficient transfers of data than >>>> element-by-element >>>> >> as per the wire coder specified in the IO port operators. I left some >>>> >> comments on the doc as well, and would also prefer approach 2. >>>> >> >>>> >> On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles <[email protected]> >>>> wrote: >>>> >> > >>>> >> > I think the portability framework is designed for this. The runner >>>> controls the coder on the grpc ports and the runner controls the process >>>> bundle descriptor. >>>> >> > >>>> >> > I commented on the doc. I think what is missing is analysis of >>>> scope of SDK harness changes and risk to model consistency >>>> >> > >>>> >> > Approach 2: probably no SDK harness work / compatible with >>>> existing Beam model so no risk of introducing inconsistency >>>> >> > >>>> >> > Approach 1: what are all the details? >>>> >> > option a: if the SDK harness has to understand "values >>>> without windows" then very large changes and high risk of introducing >>>> inconsistency (I eliminated many of these inconsistencies) >>>> >> > option b: if the coder just puts default >>>> window/timestamp/pane info on elements, then it is the same as approach 2, >>>> no work / no risk >>>> >> > >>>> >> > Kenn >>>> >> > >>>> >> > On Wed, Nov 6, 2019 at 1:09 AM jincheng sun < >>>> [email protected]> wrote: >>>> >> >> >>>> >> >> Hi all, >>>> >> >> >>>> >> >> I am trying to make some improvements of portability framework to >>>> make it usable in other projects. However, we find that the coder between >>>> runner and harness can only be FullWindowedValueCoder. This means each time >>>> when sending a WindowedValue, we have to encode/decode timestamp, windows >>>> and pan infos. In some circumstances(such as using the portability >>>> framework in Flink), only values are needed between runner and harness. So, >>>> it would be nice if we can configure the coder and avoid redundant encoding >>>> and decoding between runner and harness to improve the performance. >>>> >> >> >>>> >> >> There are two approaches to solve this issue: >>>> >> >> >>>> >> >> Approach 1: Support ValueOnlyWindowedValueCoder between >>>> runner and harness. >>>> >> >> Approach 2: Add a "constant" window coder that embeds all >>>> the windowing information as part of the coder that should be used to wrap >>>> the value during decoding. >>>> >> >> >>>> >> >> More details can be found here [1]. >>>> >> >> >>>> >> >> As of the shortcomings of “Approach 2” which still need to >>>> encode/decode timestamp and pane infos, we tend to choose “Approach 1” >>>> which brings better performance and is more thorough. >>>> >> >> >>>> >> >> Welcome any feedback :) >>>> >> >> >>>> >> >> Best, >>>> >> >> Jincheng >>>> >> >> >>>> >> >> [1] >>>> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing >>>> >> >> >>>> >>>
