https://github.com/apache/beam/pull/34051 ready for comments.
On Fri, Feb 21, 2025 at 12:59 PM Robert Bradshaw <rober...@google.com> wrote: > Sounds like consensus. (I do think option 2 might be useful for other > contextful params, but that can be deferred.) > > I'll put together a PR. > > On Fri, Feb 21, 2025 at 11:41 AM Kenneth Knowles <k...@apache.org> wrote: > >> +1 to option 1 >> >> >> >> On Fri, Feb 21, 2025 at 11:06 AM XQ Hu via dev <dev@beam.apache.org> >> wrote: >> >>> +1 to ExtractWindowingInfo >>> >>> On Fri, Feb 21, 2025 at 10:55 AM Danny McCormick via dev < >>> dev@beam.apache.org> wrote: >>> >>>> +1 to `ReifyWindowingInfo` (or maybe `ExtractWindowingInfo` or >>>> `GetWindowing` is a little more understandable to the average user). I >>>> definitely prefer something which doesn't require extending the set of >>>> concepts/advanced usages we're exposing through Yaml, especially for a >>>> feature that I think will not be heavily used (but if you need it, you need >>>> it). >>>> >>>> As a rule, I think we should prefer a simple base language here with >>>> higher level capabilities available through transforms when possible. It >>>> will be a little more verbose, but more readable/searchable/learnable, and >>>> it will preserve the base simplicity for the bulk of use cases. >>>> >>>> Thanks, >>>> Danny >>>> >>>> On Thu, Feb 20, 2025 at 3:21 PM Robert Bradshaw via dev < >>>> dev@beam.apache.org> wrote: >>>> >>>>> Currently our YAML API supports basic streaming, including setting >>>>> windowing for aggregations, but there's no way to retrieve the >>>>> windowing/timestamp metadata (short of stepping out of YAML proper and >>>>> using Python, Java, etc. DoFn). It would probably be quite useful to have >>>>> a >>>>> more native way of getting this. >>>>> >>>>> One option would be to add a built-in transform to extract this >>>>> information, e.g. something like >>>>> >>>>> - type: ReifyWindowingInfo >>>>> config: >>>>> new_field1: timestamp >>>>> new_field2: window >>>>> new_field3: window.end >>>>> ... >>>>> >>>>> The possible values on the RHS of the map would be a fixed list; >>>>> supporting things like window.end or pane_info.index would be desirable as >>>>> their types are schema-compatible (unlike a raw Window or PaneInfo >>>>> object). >>>>> One could then use this information in downstream transforms. >>>>> >>>>> A second option would be to enhance MapToFields to make this >>>>> information available. Currently this transform looks like >>>>> >>>>> - type: MapToFields >>>>> config: >>>>> language: python # java is also supported, javascript, etc. >>>>> conceivable >>>>> fields: >>>>> output_field1: input_field + another_input_field >>>>> output_field2: >>>>> callable: | >>>>> def my_inline_function(row): >>>>> row.input_field + another_input_field >>>>> ... >>>>> >>>>> The first case, called the "expression" case, is syntactic sugar that >>>>> roughly reifies all[1] input fields as locals and translates to the >>>>> second. >>>>> >>>>> For the second case, one could treat this similar to the process >>>>> method of a DoFn and allow additional annotated arguments (e.g. >>>>> ParDo.TimestampParam in Python, @Timestamp annotation for Java). We would >>>>> detect and propagate this up to the generated DoFn. >>>>> >>>>> We could consider supporting the "expression" case via some magic >>>>> variables (or a special namespace) or require the second form for this >>>>> capability. >>>>> >>>>> We could, of course, offer both options as well. >>>>> >>>>> Anyone have any opinions or other ideas here? >>>>> >>>>> - Robert >>>>> >>>>> >>>>> >>>>> [1] As an optimization we only capture those locals that appear >>>>> textually in the body of the expression. >>>>> >>>>