junaiddshaukat opened a new issue, #39135:
URL: https://github.com/apache/beam/issues/39135
Tracking issue: #18479
Follows: #39051 (KStreamsPayload serde, merged)
## Summary
First GroupByKey — the simplest version agreed with @je-ik: GlobalWindow +
default trigger, no allowed lateness. This is the runner's first stateful,
shuffle-bearing transform and the gateway to the first @ValidatesRunner test.
## Design (agreed with @je-ik)
- Repartition is done by Kafka Streams: extract the Beam GBK key, set it as
the
Kafka record key, and shuffle through an internal repartition topic — so
it is
multi-instance. Uses the KStreamsPayload serde from #39051.
- Values are buffered in a Kafka Streams state store (so they survive to
fire).
- Each key fires once when the watermark reaches TIMESTAMP_MAX
(GlobalWindow's
end), emitting KV<K, Iterable<V>>.
## Scope (this PR)
- Register beam:transform:group_by_key:v1 as a runner-native primitive
(knownUrns + IsKafkaStreamsNativeTransform, same as Redistribute).
- GroupByKeyTranslator: re-key by the encoded Beam key; repartition through
an
internal topic (addSink keyed by the Beam key + addSource back) using
KStreamsPayloadSerde; add a per-key state store.
- GroupByKeyProcessor: buffer WindowedValue<KV<K,V>> per key in the state
store;
on the terminal watermark, emit KV<K, Iterable<V>> per key; fan the
watermark
out to all repartition partitions so every task fires.
- GlobalWindow + default trigger, no allowed lateness.
## Out of scope (later slices)
- Fixed/sliding/session windows, custom triggers, allowed lateness — these
will
reuse runner-core GroupAlsoByWindow (SystemReduceFn) like Flink/Spark do,
and
need the state + timers work.
- Combine (this is GroupByKey only).
## Testing
- TopologyTestDriver E2E: Impulse -> produce KVs -> GroupByKey -> assert
grouped
output.
- First @ValidatesRunner test on the GroupByKey category (global window) with
PAssert. (If wiring @ValidatesRunner into the gradle config turns out
large,
it may split into a follow-up; the GBK itself lands here.)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]