junaiddshaukat opened a new issue, #39135:
URL: https://github.com/apache/beam/issues/39135

   Tracking issue: #18479
   Follows: #39051 (KStreamsPayload serde, merged)
   
   ## Summary
   First GroupByKey — the simplest version agreed with @je-ik: GlobalWindow + 
default trigger, no allowed lateness. This is the runner's first stateful, 
shuffle-bearing transform and the gateway to the first @ValidatesRunner test.
   
   ## Design (agreed with @je-ik)
   - Repartition is done by Kafka Streams: extract the Beam GBK key, set it as 
the
     Kafka record key, and shuffle through an internal repartition topic — so 
it is
     multi-instance. Uses the KStreamsPayload serde from #39051.
   - Values are buffered in a Kafka Streams state store (so they survive to 
fire).
   - Each key fires once when the watermark reaches TIMESTAMP_MAX 
(GlobalWindow's
     end), emitting KV<K, Iterable<V>>.
   
   ## Scope (this PR)
   - Register beam:transform:group_by_key:v1 as a runner-native primitive
     (knownUrns + IsKafkaStreamsNativeTransform, same as Redistribute).
   - GroupByKeyTranslator: re-key by the encoded Beam key; repartition through 
an
     internal topic (addSink keyed by the Beam key + addSource back) using
     KStreamsPayloadSerde; add a per-key state store.
   - GroupByKeyProcessor: buffer WindowedValue<KV<K,V>> per key in the state 
store;
     on the terminal watermark, emit KV<K, Iterable<V>> per key; fan the 
watermark
     out to all repartition partitions so every task fires.
   - GlobalWindow + default trigger, no allowed lateness.
   
   ## Out of scope (later slices)
   - Fixed/sliding/session windows, custom triggers, allowed lateness — these 
will
     reuse runner-core GroupAlsoByWindow (SystemReduceFn) like Flink/Spark do, 
and
     need the state + timers work.
   - Combine (this is GroupByKey only).
   
   ## Testing
   - TopologyTestDriver E2E: Impulse -> produce KVs -> GroupByKey -> assert 
grouped
     output.
   - First @ValidatesRunner test on the GroupByKey category (global window) with
     PAssert. (If wiring @ValidatesRunner into the gradle config turns out 
large,
     it may split into a follow-up; the GBK itself lands here.)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to