That's exciting.

I would suggest that you take a look at implementing a portable runner so
that you get cross language pipelines and the ability to execute Python and
Go pipelines. Looking at https://s.apache.org/beam-fn-api and the Flink or
Samza implementations would be good starting points.



On Fri, Sep 25, 2020 at 11:35 AM Pablo Estrada <[email protected]> wrote:

> Hi Kyle,
> this is very cool. For others, here's the link to compare Kyle's branch to
> Beam's master[1].
>
> I don't have a lot of experience writing runners, so someone else should
> look as well, but I'll try to take a look at your branch in the next few
> days, because I think it's pretty cool : )
>
>
> [1]
> https://github.com/apache/beam/compare/master...kyle-winkelman:kafka-streams
>
> On Thu, Sep 24, 2020 at 9:53 AM Kyle Winkelman <[email protected]>
> wrote:
>
>> Hello everyone,
>>
>>
>>
>> Jira:
>>
>> https://issues.apache.org/jira/browse/BEAM-2466
>>
>>
>>
>> My Branch:
>>
>> https://github.com/kyle-winkelman/beam/tree/kafka-streams
>>
>>
>>
>> I have taken an initial pass at creating a Kafka Streams Runner. It
>> passes nearly all of the @ValidatesRunner tests. I am hoping to find
>> someone that has experience writing a Runner to take a look at it and give
>> me some feedback before I open a PR.
>>
>>
>>
>> I am using the Kafka Streams DSL
>> <https://kafka.apache.org/26/documentation/streams/developer-guide/dsl-api.html>,
>> so a PCollection is equivalent to a KStream
>> <https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/kstream/KStream.html>
>> .
>>
>>
>>
>> ParDo:
>>
>>   - implements Transformer
>> <https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/kstream/Transformer.html>
>>  to KStream#transform
>> <https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/kstream/KStream.html#transform-org.apache.kafka.streams.kstream.TransformerSupplier-org.apache.kafka.streams.kstream.Named-java.lang.String...->
>>
>>   - outputs KStream<TupleTag<?>, WindowedValue<?>> then uses a
>> KStream#branch
>> <https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/kstream/KStream.html#branch-org.apache.kafka.streams.kstream.Predicate...->
>>  to
>> handle Pardo.MultiOutput
>>
>>   - schedules a Punctuator
>> <https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/processor/Punctuator.html>
>>  to
>> periodically start/finish bundles
>>
>>
>>
>> GroupByKey:
>>
>>   - KStream#repartition
>> <https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/kstream/KStream.html#repartition-->
>>  the
>> data to groupOnlyByKey
>>
>>   - groupAlsoByWindow similarly to ParDo, runs ReduceFn instead of DoFn
>>
>>
>>
>> Flatten:
>>
>>   - KStream#merge
>> <https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/kstream/KStream.html#merge-org.apache.kafka.streams.kstream.KStream->
>>
>>
>>
>> Combine:
>>
>>   - not as composite, but as primitive GroupByKey/ParDo
>>
>>
>>
>> Composite Transforms:
>>
>>   - inlining (I believe)
>>
>>
>>
>> Side Inputs:
>>
>>   - write data to a topic with the key being the
>> StateNamespace.stringKey()
>>
>>   - read topic into a GlobalKTable
>> <https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/kstream/GlobalKTable.html>
>>  and
>> materialize it into a KeyValueStore
>> <https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/state/KeyValueStore.html>
>>
>>   - in ParDo, access the KeyValueStore, via the ProcessorContext
>> <https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/processor/ProcessorContext.html>,
>> and use the WindowFn to get the Side Input Window and its
>> StateNamespace.stringKey() to look up the value
>>
>>
>>
>> Source API:
>>
>>   - overridden by SplittableDoFn, via
>> Read.SPLITTABLE_DOFN_PREFERRED_RUNNERS
>>
>>
>>
>> Impulse:
>>
>>   - if not exists, create a topic in Kafka and use a KafkaProducer
>> <https://kafka.apache.org/26/javadoc/org/apache/kafka/clients/producer/KafkaProducer.html>
>>  to
>> generate the initial record
>>
>>   - generate a KStream from the above topic
>>
>>
>>
>> SplittableDoFn:
>>
>>   - overriden by SplittableParDo/SplittableParDoViaKeyedWorkItems
>>
>>   - extends Stateful Processing, runs ProcessFn instead of DoFn and
>> Punctuator fires timers as expected by the ProcessFn
>>
>>   - write watermarks to a separate topic to be aggregated and read into a
>> GlobalKTable and materialize it into a KeyValueStore
>>
>>
>>
>> Stateful Processing:
>>
>>   - StateInternals and TimerInternals are materialized in KeyValueStores
>>
>>   - extends ParDo, with statefulDoFnRunner and Punctuator advances
>> TimerInternals (via watermarks KeyValueStore) and fire timers
>>
>>
>> Thanks,
>>
>> Kyle Winkelman
>>
>

Reply via email to