Re: Best way to expose windowing information in Beam YAML

Robert Bradshaw via dev Fri, 21 Feb 2025 14:01:18 -0800

https://github.com/apache/beam/pull/34051 ready for comments.


On Fri, Feb 21, 2025 at 12:59 PM Robert Bradshaw <rober...@google.com>
wrote:

> Sounds like consensus. (I do think option 2 might be useful for other
> contextful params, but that can be deferred.)
>
> I'll put together a PR.
>
> On Fri, Feb 21, 2025 at 11:41 AM Kenneth Knowles <k...@apache.org> wrote:
>
>> +1 to option 1
>>
>>
>>
>> On Fri, Feb 21, 2025 at 11:06 AM XQ Hu via dev <dev@beam.apache.org>
>> wrote:
>>
>>> +1 to ExtractWindowingInfo
>>>
>>> On Fri, Feb 21, 2025 at 10:55 AM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> +1 to `ReifyWindowingInfo` (or maybe `ExtractWindowingInfo` or
>>>> `GetWindowing` is a little more understandable to the average user). I
>>>> definitely prefer something which doesn't require extending the set of
>>>> concepts/advanced usages we're exposing through Yaml, especially for a
>>>> feature that I think will not be heavily used (but if you need it, you need
>>>> it).
>>>>
>>>> As a rule, I think we should prefer a simple base language here with
>>>> higher level capabilities available through transforms when possible. It
>>>> will be a little more verbose, but more readable/searchable/learnable, and
>>>> it will preserve the base simplicity for the bulk of use cases.
>>>>
>>>> Thanks,
>>>> Danny
>>>>
>>>> On Thu, Feb 20, 2025 at 3:21 PM Robert Bradshaw via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> Currently our YAML API supports basic streaming, including setting
>>>>> windowing for aggregations, but there's no way to retrieve the
>>>>> windowing/timestamp metadata (short of stepping out of YAML proper and
>>>>> using Python, Java, etc. DoFn). It would probably be quite useful to have 
>>>>> a
>>>>> more native way of getting this.
>>>>>
>>>>> One option would be to add a built-in transform to extract this
>>>>> information, e.g. something like
>>>>>
>>>>> - type: ReifyWindowingInfo
>>>>>   config:
>>>>>     new_field1: timestamp
>>>>>     new_field2: window
>>>>>     new_field3: window.end
>>>>>     ...
>>>>>
>>>>> The possible values on the RHS of the map would be a fixed list;
>>>>> supporting things like window.end or pane_info.index would be desirable as
>>>>> their types are schema-compatible (unlike a raw Window or PaneInfo 
>>>>> object).
>>>>> One could then use this information in downstream transforms.
>>>>>
>>>>> A second option would be to enhance MapToFields to make this
>>>>> information available. Currently this transform looks like
>>>>>
>>>>> - type: MapToFields
>>>>>   config:
>>>>>     language: python  # java is also supported, javascript, etc.
>>>>> conceivable
>>>>>     fields:
>>>>>       output_field1: input_field + another_input_field
>>>>>       output_field2:
>>>>>         callable: |
>>>>>             def my_inline_function(row):
>>>>>                row.input_field + another_input_field
>>>>>         ...
>>>>>
>>>>> The first case, called the "expression" case, is syntactic sugar that
>>>>> roughly reifies all[1] input fields as locals and translates to the 
>>>>> second.
>>>>>
>>>>> For the second case, one could treat this similar to the process
>>>>> method of a DoFn and allow additional annotated arguments (e.g.
>>>>> ParDo.TimestampParam in Python, @Timestamp annotation for Java). We would
>>>>> detect and propagate this up to the generated DoFn.
>>>>>
>>>>> We could consider supporting the "expression" case via some magic
>>>>> variables (or a special namespace) or require the second form for this
>>>>> capability.
>>>>>
>>>>> We could, of course, offer both options as well.
>>>>>
>>>>> Anyone have any opinions or other ideas here?
>>>>>
>>>>> - Robert
>>>>>
>>>>>
>>>>>
>>>>> [1] As an optimization we only capture those locals that appear
>>>>> textually in the body of the expression.
>>>>>
>>>>

Re: Best way to expose windowing information in Beam YAML

Reply via email to