This is an automated email from the ASF dual-hosted git repository.

kamir pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-wayang-website.git

commit 8304374f9313baf0aa30c5c43a5f8c8a6eafb95e
Merge: 27b1f9ce fb7e7784
Author: Mirko Kämpf <[email protected]>
AuthorDate: Sat Sep 21 09:26:48 2024 +0200

    Merge branch 'main' into main

 blog/kafka-meets-wayang-1.md |  19 +++---
 blog/kafka-meets-wayang-2.md |  52 +++++++--------
 blog/kafka-meets-wayang-3.md | 146 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 185 insertions(+), 32 deletions(-)

diff --cc blog/kafka-meets-wayang-1.md
index 1a005fbc,9e6b1df3..cf66d468
--- a/blog/kafka-meets-wayang-1.md
+++ b/blog/kafka-meets-wayang-1.md
@@@ -16,7 -16,7 +16,9 @@@ In part two and three we will share a s
  We started with the Java Platform (part 2) and the Apache Spark 
implementation follows (W.I.P.) in part three.
  
  The use case behind this work is an imaginary data collaboration scenario.
 -We see this example and the demand for a solution already in many places.  
++
 +We see this example and the demand for a solution already in many places.
++
  For us this is motivation enough to propose a solution.
  This would also allow us to do more local data processing, and businesses can 
stop moving data around the world, but rather care about data locality while 
they expose and share specific information to others by using data federation.
  This reduces complexity of data management and cost dramatically.
@@@ -28,41 -28,41 +30,44 @@@ Data federation can help us to unlock t
  
  
  ## A cross organizational data sharing scenario
 -Our goal is the implementation of a cross organization decentralized data 
processing scenario, in which protected local data should be processed in 
combination with public data from public sources in a collaborative manner. 
 -Instead of copying all data into a central data lake or a central data 
platform we decided to use federated analytics. 
 -Apache Wayang is the tool we work with. 
 -In our case, the public data is hosted on publicly available websites or data 
pods. 
 -A client can use the HTTP(S) protocol to read the data which is given in a 
well defined format. 
 -For simplicity we decided to use CSV format. 
++
 +Our goal is the implementation of a cross organization decentralized data 
processing scenario, in which protected local data should be processed in 
combination with public data from public sources in a collaborative manner.
 +Instead of copying all data into a central data lake or a central data 
platform we decided to use federated analytics.
 +Apache Wayang is the tool we work with.
 +In our case, the public data is hosted on publicly available websites or data 
pods.
 +A client can use the HTTP(S) protocol to read the data which is given in a 
well defined format.
 +For simplicity we decided to use CSV format.
  When we look into the data of each participant we have a different 
perspective.
  
 -Our processing procedure should calculate a particular metric on the _local 
data_ of each participant. 
 -An example of such a metric is the average spending of all users on a 
particular product category per month. 
 -This can vary from partner to partner, hence, we want to be able to calculate 
a peer-group comparison so that each partner can see its own metric compared 
with a global average calculated from contributions by all partners. 
 -Such a process requires global averaging and local averaging. 
 +Our processing procedure should calculate a particular metric on the _local 
data_ of each participant.
 +An example of such a metric is the average spending of all users on a 
particular product category per month.
 +This can vary from partner to partner, hence, we want to be able to calculate 
a peer-group comparison so that each partner can see its own metric compared 
with a global average calculated from contributions by all partners.
 +Such a process requires global averaging and local averaging.
  And due to governance constraints, we can’t bring all raw data together in 
one place.
  
 -Instead, we want to use Apache Wayang for this purpose. 
 -We simplify the procedure and split it into two phases. 
 -Phase one is the process, which allows each participant to calculate the 
local metrics. 
 -This requires only local data. The second phase requires data from all 
collaborating partners. 
 -The monthly sum and counter values per partner and category are needed in one 
place by all other parties. 
 -Hence, the algorithm of the first phase stores the local results locally, and 
the contributions to the global results in an externally accessible Kafka 
topic. 
 -We assume this is done by each of the partners. 
 +Instead, we want to use Apache Wayang for this purpose.
 +We simplify the procedure and split it into two phases.
 +Phase one is the process, which allows each participant to calculate the 
local metrics.
 +This requires only local data. The second phase requires data from all 
collaborating partners.
 +The monthly sum and counter values per partner and category are needed in one 
place by all other parties.
 +Hence, the algorithm of the first phase stores the local results locally, and 
the contributions to the global results in an externally accessible Kafka topic.
 +We assume this is done by each of the partners.
 +
+ 
  Now we have a scenario, in which an Apache Wayang process must be able to 
read data from multiple Apache Kafka topics from multiple Apache Kafka clusters 
but finally writes into a single Kafka topic, which then can be accessed by all 
the participating clients.
  
  ![images/image-1.png](images/image-1.png)
  
- The illustration shows the data flows in such a scenario.
- Jobs with red border are executed by the participants in isolation within 
their own data processing environments.
+ The illustration shows the data flows in such a scenario. 
+ Jobs with red border are executed by the participants in isolation within 
their own data processing environments. 
  But they share some of the data, using publicly accessible Kafka topics, 
marked by A. Job 4 is the Apache Wayang job in our focus: here we intent to 
read data from 3 different source systems, and write results into a fourth 
system (marked as B), which can be accesses by all participants again.
  
- With this in mind we want to implement an Apache Wayang application which 
implements the illustrated *Job 4*.
- Since as of today, there is now _KafkaSource_ and _KafkaSink_ available in 
Apache Wayang, an implementation of both will be our first step.
- Our assumption is, that in the beginning, there won’t be much data.
+ With this in mind we want to implement an Apache Wayang application which 
implements the illustrated *Job 4*. 
+ Since as of today, there is now _KafkaSource_ and _KafkaSink_ available in 
Apache Wayang, an implementation of both will be our first step. 
+ Our assumption is, that in the beginning, there won’t be much data. 
+ 
+ Apache Spark is not required to cope with the load, but we expect, that in 
the future, a single Java application would not be able to handle our workload. 
+ Hence, we want to utilize the Apache Wayang abstraction over multiple 
processing platforms, starting with Java. 
 +
- Apache Spark is not required to cope with the load, but we expect, that in 
the future, a single Java application would not be able to handle our workload.
- Hence, we want to utilize the Apache Wayang abstraction over multiple 
processing platforms, starting with Java.
  Later, we want to switch to Apache Spark.
  
diff --cc blog/kafka-meets-wayang-2.md
index 60ef401e,bfd2abbd..23182487
--- a/blog/kafka-meets-wayang-2.md
+++ b/blog/kafka-meets-wayang-2.md
@@@ -62,10 -62,10 +62,11 @@@ The second layer builds the bridge betw
  
  Also, the mapping between the abstract components and the specific 
implementations are defined in this layer.
  
- Therefore, the mappings package has a class _Mappings_ in which all relevant 
input and output operators are listed.
- We use it to register the KafkaSourceMapping and a KafkaSinkMapping for the 
particular platform, Java in our case.
- These classes allow the Apache Wayang framework to use the Java 
implementation of the KafkaTopicSource component (and KafkaTopicSink 
respectively).
- While the Wayang execution plan uses the higher abstractions, here on the 
“platform level” we have to link the specific implementation for the target 
platform.
+ Therefore, the mappings package has a class _Mappings_ in which all relevant 
input and output operators are listed. 
+ We use it to register the KafkaSourceMapping and a KafkaSinkMapping for the 
particular platform, Java in our case. 
+ These classes allow the Apache Wayang framework to use the Java 
implementation of the KafkaTopicSource component (and KafkaTopicSink 
respectively). 
+ While the Wayang execution plan uses the higher abstractions, here on the 
“platform level” we have to link the specific implementation for the target 
platform. 
++
  In our case this leads to a Java program running on a JVM which is set up by 
the Apache Wayang framework using the logical components of the execution plan.
  
  Those mappings link the real implementation of our operators the ones used in 
an execution plan.
@@@ -87,7 -87,7 +88,8 @@@ The layer above handles the mapping of 
  All this wiring is needed to keep Wayang open and flexible so that multiple 
external systems can be used in a variety of combinations and using multiple 
target platforms in combinations.
  
  ## Outlook
- The next part of the article series will cover the creation of an Kafka 
Source and Sink component for the Apache Spark platform, which allows our use 
case to scale.
+ The next part of the article series will cover the creation of an Kafka 
Source and Sink component for the Apache Spark platform, which allows our use 
case to scale. 
++
  Finally, in part four we bring all puzzles together, and show the full 
implementation of the multi organizational data collaboration use case.
  
  

Reply via email to