@RequiresTimeSortedInput adoption by runners

Jan Lukavský Thu, 18 Jan 2024 08:15:13 -0800

Hi,

recently I came across the fact that most runners do not support@RequiresTimeSortedInput annotation for sorting per-key data by eventtimestamp [1]. Actually, runners supporting it seem to be Direct java,Flink and Dataflow batch (as it is a noop there). The annotation hasuse-cases in time-series data processing, in transaction processing andmore. Though it is absolutely possible to implement the time-sortingmanually (e.g. [2]), this is actually efficient only in streaming mode,in batch mode the runner typically wants to leverage the internalsort-grouping it already does.

The original idea was to implement this annotation insideStatefulDoFnRunner, which would be used by majority of runners. It turnsout that this is not the case. The question now is, should we use analternative place to implement the annotation (e.g. Pipeline expansion,or DoFnInvoker) so that more runners can benefit from it automatically(at least for streaming case, batch case needs to be implementedmanually)? Do the community find the annotation useful? I'm linking arather old (and long :)) thread that preceded introduction of theannotation [3] for more context.

I sense the current adoption of the annotation by runners makes itsomewhat use-less.


Looking forward to any comments on this.

Best,

 Jan

[1]https://beam.apache.org/releases/javadoc/2.53.0/org/apache/beam/sdk/transforms/DoFn.RequiresTimeSortedInput.html

[2]https://cloud.google.com/spanner/docs/change-streams/use-dataflow#order-by-key


[3] https://lists.apache.org/thread/bkl9kk8l44xw2sw08s7m54k1wsc3n4tn

@RequiresTimeSortedInput adoption by runners

Reply via email to