This is an automated email from the ASF dual-hosted git repository.
aloalt pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-wayang-website.git
The following commit(s) were added to refs/heads/main by this push:
new 22a510d8 Added part 3 of the blog series : Apache Wayang meets Apache
Kafka (#39)
22a510d8 is described below
commit 22a510d8bf5acc4d798583e8a8400a54027a193c
Author: Mirko Kämpf <[email protected]>
AuthorDate: Sun Mar 10 11:00:32 2024 +0100
Added part 3 of the blog series : Apache Wayang meets Apache Kafka (#39)
* added blog article : Apache Wayang meets Apache Kafka
* clean up
* changed : into - :)
* Added part 3 - Apache Kafka meets Wayang.
* Update kafka-meets-wayang-3.md
edit slug (needs to be a valid URL) and removed 2nd title
---------
Co-authored-by: Alexander Alten <[email protected]>
Co-authored-by: Alexander Alten <[email protected]>
---
blog/authors.yml | 5 +-
blog/kafka-meets-wayang-1.md | 68 +++++++++++++++++++++++++
blog/kafka-meets-wayang-2.md | 95 +++++++++++++++++++++++++++++++++++
blog/kafka-meets-wayang-3.md | 116 +++++++++++++++++++++++++++++++++++++++++++
4 files changed, 283 insertions(+), 1 deletion(-)
diff --git a/blog/authors.yml b/blog/authors.yml
index 9b97898b..c3a1212c 100644
--- a/blog/authors.yml
+++ b/blog/authors.yml
@@ -5,6 +5,9 @@ alo.alt:
image_url: https://avatars.githubusercontent.com/u/1323575?v=4
kamir:
name: Mirko Kämpf
+ title: Apache Commiter
+ url: https://github.com/kamir
+ image_url: https://avatars.githubusercontent.com/u/1241122?v=4
title: Apache Committer
url: https://github.com/kamir
image_url: https://avatars.githubusercontent.com/u/1241122?v=4
@@ -12,4 +15,4 @@ zkaoudi:
name: Zoi Kaoudi
title: PPMC Apache Wayang
url: https://github.com/zkaoudi
- image_url: https://avatars.githubusercontent.com/zkaoudi
+ image_url: https://avatars.githubusercontent.com/zkaoudi
\ No newline at end of file
diff --git a/blog/kafka-meets-wayang-1.md b/blog/kafka-meets-wayang-1.md
new file mode 100644
index 00000000..9e6b1df3
--- /dev/null
+++ b/blog/kafka-meets-wayang-1.md
@@ -0,0 +1,68 @@
+---
+slug: Apache Kafka meets Apache Wayang
+title: Apache Kafka meets Apache Wayang - Part 1
+authors: kamir
+tags: [wayang, kafka, cross organization data collaboration]
+---
+
+# Apache Wayang meets Apache Kafka - Part 1
+
+## Intro
+
+This article is the first of a four part series about federated data analysis
using Apache Wayang.
+The first article starts with an introduction of a typical data colaboration
scenario which will emerge in our digital future.
+
+In part two and three we will share a summary of our Apache Kafka client
implementation for Apache Wayang.
+We started with the Java Platform (part 2) and the Apache Spark implementation
follows (W.I.P.) in part three.
+
+The use case behind this work is an imaginary data collaboration scenario.
+We see this example and the demand for a solution already in many places.
+For us this is motivation enough to propose a solution.
+This would also allow us to do more local data processing, and businesses can
stop moving data around the world, but rather care about data locality while
they expose and share specific information to others by using data federation.
+This reduces complexity of data management and cost dramatically.
+
+For this purpose, we illustrate a cross organizational data sharing scenario
from the finance sector soon.
+This analysis pattern will also be relevant in the context of data analysis
along supply chains, another typical example where data from many stakeholder
together is needed but never managed in one place, for good reasons.
+
+Data federation can help us to unlock the hidden value of all those isolated
data lakes.
+
+
+## A cross organizational data sharing scenario
+Our goal is the implementation of a cross organization decentralized data
processing scenario, in which protected local data should be processed in
combination with public data from public sources in a collaborative manner.
+Instead of copying all data into a central data lake or a central data
platform we decided to use federated analytics.
+Apache Wayang is the tool we work with.
+In our case, the public data is hosted on publicly available websites or data
pods.
+A client can use the HTTP(S) protocol to read the data which is given in a
well defined format.
+For simplicity we decided to use CSV format.
+When we look into the data of each participant we have a different perspective.
+
+Our processing procedure should calculate a particular metric on the _local
data_ of each participant.
+An example of such a metric is the average spending of all users on a
particular product category per month.
+This can vary from partner to partner, hence, we want to be able to calculate
a peer-group comparison so that each partner can see its own metric compared
with a global average calculated from contributions by all partners.
+Such a process requires global averaging and local averaging.
+And due to governance constraints, we can’t bring all raw data together in one
place.
+
+Instead, we want to use Apache Wayang for this purpose.
+We simplify the procedure and split it into two phases.
+Phase one is the process, which allows each participant to calculate the local
metrics.
+This requires only local data. The second phase requires data from all
collaborating partners.
+The monthly sum and counter values per partner and category are needed in one
place by all other parties.
+Hence, the algorithm of the first phase stores the local results locally, and
the contributions to the global results in an externally accessible Kafka
topic.
+We assume this is done by each of the partners.
+
+Now we have a scenario, in which an Apache Wayang process must be able to read
data from multiple Apache Kafka topics from multiple Apache Kafka clusters but
finally writes into a single Kafka topic, which then can be accessed by all the
participating clients.
+
+
+
+The illustration shows the data flows in such a scenario.
+Jobs with red border are executed by the participants in isolation within
their own data processing environments.
+But they share some of the data, using publicly accessible Kafka topics,
marked by A. Job 4 is the Apache Wayang job in our focus: here we intent to
read data from 3 different source systems, and write results into a fourth
system (marked as B), which can be accesses by all participants again.
+
+With this in mind we want to implement an Apache Wayang application which
implements the illustrated *Job 4*.
+Since as of today, there is now _KafkaSource_ and _KafkaSink_ available in
Apache Wayang, an implementation of both will be our first step.
+Our assumption is, that in the beginning, there won’t be much data.
+
+Apache Spark is not required to cope with the load, but we expect, that in the
future, a single Java application would not be able to handle our workload.
+Hence, we want to utilize the Apache Wayang abstraction over multiple
processing platforms, starting with Java.
+Later, we want to switch to Apache Spark.
+
diff --git a/blog/kafka-meets-wayang-2.md b/blog/kafka-meets-wayang-2.md
new file mode 100644
index 00000000..bfd2abbd
--- /dev/null
+++ b/blog/kafka-meets-wayang-2.md
@@ -0,0 +1,95 @@
+---
+slug: Apache Kafka meets Apache Wayang
+title: Apache Kafka meets Apache Wayang - Part 2
+authors: kamir
+tags: [wayang, kafka, cross organization data collaboration]
+---
+
+# Apache Wayang meets Apache Kafka - Part 2
+
+In the second part of the article series we describe the implementation of the
Kafka Source and Kafka Sink component for Apache Wayang.
+We look into the “Read- and Write-Path” for our data items, called
_DataQuanta_.
+
+## Apache Wayang’s Read & Write Path for Kafka topics
+
+To describe the read and write paths for data in the context of the created
Apache Wayang code snippet, the primary classes and interfaces we need to
understand are as follows:
+
+**WayangContext:** This class is essential for initializing the Wayang
processing environment.
+It allows you to configure the execution environment and register plugins that
define which platforms Wayang can use for data processing tasks, such as
_Java.basicPlugin()_ for local Java execution.
+
+**JavaPlanBuilder:** This class is used to build and define the data
processing pipeline (or plan) in Wayang.
+It provides a fluent API to specify the operations to be performed on the
data, from reading the input to processing it and writing the output.
+
+### Read Path
+The read path describes how data is ingested from a source into the Wayang
processing pipeline:
+
+_Reading from Kafka Topic:_ The method _readKafkaTopic(topicName)_ is used to
ingest data from a specified Kafka topic.
+This is the starting point of the data processing pipeline, where topicName
represents the name of the Kafka topic from which data is read.
+
+_Data Tokenization and Preparation:_ Once the data is read from Kafka, it
undergoes several transformations such as Splitting, Filtering, and Mapping.
+What follows are the procedures known as Reducing, Grouping, Co-Grouping, and
Counting.
+
+### Write Path
+_Writing to Kafka Topic:_ The final step in the pipeline involves writing the
processed data back to a Kafka topic using _.writeKafkaTopic(...)_.
+This method takes parameters that specify the target Kafka topic, a
serialization function to format the data as strings, and additional
configuration for load profile estimation, which optimizes the writing process.
+
+This read-write path provides a comprehensive flow of data from ingestion from
Kafka, through various processing steps, and finally back to Kafka, showcasing
a full cycle of data processing within Apache Wayang's abstracted environment
and is implemented in our example program shown in *listing 1*.
+
+## Implementation of Input- and Output Operators
+The next section shows how a new pair of operators can be implemented to
extend Apache Wayang’s capabilities on the input and output side.
+We created the Kafka Source and Kafka Sink components so that our cross
organizational data collaboration scenario can be implemented using data
streaming infrastructure.
+
+**Level 1 – Wayang execution plan with abstract operators**
+
+The implementation of our Kafka Source and Kafka Sink components for Apache
Wayang requires new methods and classes on three layers.
+First of all in the API package.
+Here we use the JavaPlanBuilder to expose the function for selecting a Kafka
topic as the source to be used by client.
+The class _JavaPlanBuilder_ in package _org.apache.wayang.api_ in the project
*wayang-api/wayang-api-scala-java* exposes our new functionality to our
external client.
+An instance of the JavaPlanBuilder is used to define the data processing
pipeline.
+We use its _readKafkaTopic()_ which specifies the source Kafka topic to read
from, and for the write path we use the _writeKafkaTopic()_ method.
+Both Methods do only trigger activities in the background.
+
+For the output side, we use the _DataQuantaBuilder_ class, which offers an
implementation of the writeKafkaTopic function.
+This function is designed to send processed data, referred to as DataQuanta,
to a specified Kafka topic.
+Essentially, it marks the final step in a data processing sequence constructed
using the Apache Wayang framework.
+
+In the DataQuanta class we implemented the methods writeKafkaTopic and
writeKafkaTopicJava which use the KafkaTopicSink class.
+In this API layer we use the Scala programming language, but we utilize the
Java classes, implemented in the layer below.
+
+**Level 2 – Wiring between Platform Abstraction and Implementation**
+
+The second layer builds the bridge between the WayangContext and PlanBuilders
which work together with DataQuanta and the DataQuantaBuilder.
+
+Also, the mapping between the abstract components and the specific
implementations are defined in this layer.
+
+Therefore, the mappings package has a class _Mappings_ in which all relevant
input and output operators are listed.
+We use it to register the KafkaSourceMapping and a KafkaSinkMapping for the
particular platform, Java in our case.
+These classes allow the Apache Wayang framework to use the Java implementation
of the KafkaTopicSource component (and KafkaTopicSink respectively).
+While the Wayang execution plan uses the higher abstractions, here on the
“platform level” we have to link the specific implementation for the target
platform.
+In our case this leads to a Java program running on a JVM which is set up by
the Apache Wayang framework using the logical components of the execution plan.
+
+Those mappings link the real implementation of our operators the ones used in
an execution plan.
+The JavaKafkaTopicSource and the JavaKafkaTopicSink extend the
KafkaTopicSource and KafkaTopicSink so that the lower level implementation of
those classes become available within Wayang’s Java Platform context.
+
+In this layer, the KafkaConsumer class and the KafkaProducer class are used,
but both are configured and instantiated in the next layer underneath.
+All this is done in the project *wayang-plarforms/wayang-java*.
+
+**Layer 3 – Input/Output Connector Layer**
+
+The _KafkaTopicSource_ and _KafkaTopicSink_ classes build the third layer of
our implementation.
+Both are implemented in Java programming language.
+In this layer, the real Kafka-Client logic is defined.
+Details about consumer and producers, client configuration, and schema
handling have to be handled here.
+
+## Summary
+Both classes in the third layer implement the Kafka client logic which is
needed by the Wayang-execution plan when external data flows should be
established.
+The layer above handles the mapping of the components at startup time.
+All this wiring is needed to keep Wayang open and flexible so that multiple
external systems can be used in a variety of combinations and using multiple
target platforms in combinations.
+
+## Outlook
+The next part of the article series will cover the creation of an Kafka Source
and Sink component for the Apache Spark platform, which allows our use case to
scale.
+Finally, in part four we bring all puzzles together, and show the full
implementation of the multi organizational data collaboration use case.
+
+
+
+
diff --git a/blog/kafka-meets-wayang-3.md b/blog/kafka-meets-wayang-3.md
new file mode 100644
index 00000000..8e90b5fc
--- /dev/null
+++ b/blog/kafka-meets-wayang-3.md
@@ -0,0 +1,116 @@
+---
+slug: kafka-meets-wayang-2
+title: Apache Kafka meets Apache Wayang - Part 3
+authors: kamir
+tags: [wayang, kafka, spark, cross organization data collaboration]
+---
+
+The third part of this article series is an activity log.
+Motivated by the learnings from last time, I stated implementing a Kafka
Source component and a Kafka Sink component for the Apache Spark platform in
Apache Wayang.
+In our previous article we shared the results of the work on the frist Apache
Kafka integration using the Java Platform.
+
+Let's see how it goes this time with Apache Spark.
+
+## The goal of this implementation
+
+We want to process data from Apache Kafka topics, which are hosted on
Confluent cloud.
+In our example scenario, the data is available in multiple different clusters,
in different regions and owned by different organizations.
+
+We assume, that the operator of our job has been granted appropriate
permissions, and the topic owner already provided the configuration properties,
including access coordinates and credentials.
+
+
+
+This illustration has already been introduced in part one.
+We focus on **Job 4** in the image and start to implement it.
+This time we expect the processing load to be higher so that we want to
utilize the scalability capabilities of Apache Spark.
+
+Again, we start with a **WayangContext**, as shown by examples in the Wayang
code repository.
+
+```
+WayangContext wayangContext = new WayangContext().with(Spark.basicPlugin());
+```
+We simply switched the backend system towards Apache Spark by using the
_WayangContext_ with _Spark.basicPlugin()_.
+The **JavaPlanBuilder** and all other logic of our example job won't be
touched.
+
+In order to make this working we will now implement the Mappings and the
Operators for the Apache Spark platform module.
+
+## Implementation of Input- and Output Operators
+
+We reuse the Kafka Source and Kafka Sink components which have been created
for the JavaKafkaSource and JavaKafkaSink.
+Hence we work with Wayang's Java API.
+
+**Level 1 – Wayang execution plan with abstract operators**
+
+Since the _JavaPlanBuilder_ already exposes the function for selecting a Kafka
topic as source
+and the _DataQuantaBuilder_ class exposes the _writeKafkaTopic_ function we
can move on quickly.
+
+Remember, in this API layer we use the Scala programming language, but we
utilize the Java classes, implemented in the layer below.
+
+**Level 2 – Wiring between Platform Abstraction and Implementation**
+
+As in the case with the Java Platform, in the second layer we build a bridge
between the WayangContext and the PlanBuilders, which work together with
DataQuanta and the DataQuantaBuilder.
+
+We must provide the mapping between the abstract components and the specific
implementations in this layer.
+
+Therefore, the mappings package in project **wayang-platforms/wayang-spark**
has a class _Mappings_ in which
+our _KafkaTopicSinkMapping_ and _KafkaTopicSourceMapping_ will be registered.
+
+Again, these classes allow the Apache Wayang framework to use the Java
implementation of the KafkaTopicSource component (and KafkaTopicSink
respectively).
+
+While the Wayang execution plan uses the higher abstractions, here on the
“platform level” we have to link the specific implementation for the target
platform.
+In this case this leads to an Apache Spark job, running on a Spark cluster
which is set up by the Apache Wayang framework using the logical components of
the execution plan, and the Apache Spark configuration provided at runtime.
+
+A mapping links an operator implementation to the abstraction used in an
execution plan.
+We define two new mappings for our purpose, namely KafkaTopicSourceMapping,
and KafkaTopicSinkMapping, both could be reused from last round.
+
+For the Spark platform we simply replace the occurences of _JavaPlatform_ with
_SparkPlatform_.
+
+Furthermore, we create an implementation of the _SparkKafkaTopicSource_ and
_SparkKafkaTopicSink_.
+
+**Layer 3 – Input/Output Connector Layer**
+
+Let's quickly recap, how does Apache Spark interacts with Apache Kafka?
+
+There is already an integration which gives us a DataSet using the Spark SQL
framework.
+For Spark Streaming, there is also a Kafka integration using the
_SparkSession_'s _readStream()_ function.
+Kafka client properties are provided as key value pairs _k_ and _v_ by using
the _option( k, v )_ function.
+For writing into a topic, we can use the _writeStream()_ function.
+But from a first look, it seems to be not the best fit.
+
+Another approach is possible.
+We can use simple RDDs to process data previously consumed from Apache Kafka.
+This is a more low-level approach compared to using Datasets with Spark
Structured Streaming,
+and it typically involves using the Kafka RDD API provided by Spark.
+
+This approach is less common with newer versions of Spark, as Structured
Streaming provides a higher-level abstraction that simplifies stream
processing.
+However, we might need that approach for the integration with Apache Wayang.
+
+For now, we will focus on the lower level approach and plan to consume data
from Kafka using a Kafka client, and then
+we parallelize the records in an RDD.
+
+This allows us to reuse _KafkaTopicSource_ and _KafkaTopicSink_ classes we
built last time.
+Those were made specifically for a simple non parallel Java program, using one
Consumer and one Producer.
+
+The selected approach does not yet fully take advantage from Spark's
parallelism at load time.
+For higher loads and especially for streaming processing we would have to
investigate another approache, using a _SparkStreamingContext_, but this is out
of scope for now.
+
+Since we can't reuse the _JavaKafkaTopicSource_ and _JavaKafkaTopicSink_ we
rather implement _SparkKafkaTopicSource_ and _SparkKafkaTopicSink_ based on
given _SparkTextFileSource_ and _SparkTextFileSink_ which both cary all needed
RDD specific logic.
+
+## Summary
+As expected, the integration of Apache Spark with Apache Wayang was no magic,
thanks to a fluent API design and a well structured architecture of Apache
Wayang.
+We could easily follow the pattern we have worked out in the previous exercise.
+
+But a bunch of much more interesting work will follow next.
+More testing, more serialization schemes, and Kafka Schema Registry support
should follow, and full parallelization as well.
+
+The code has been submitted to the Apache Wayang repository.
+
+
+## Outlook
+The next part of the article series will cover the real world example as
described in image 1.
+We will show how analysts and developers can use the Apache Kafka integration
for Apache Wayang to solve cross organizational collaboration issues.
+Therefore, we will bring all puzzles together, and show the full
implementation of the multi organizational data collaboration use case.
+
+
+
+