[hudi] branch asf-site updated: Travis CI build asf-site

vinoth Wed, 21 Oct 2020 19:17:59 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 5cf318e  Travis CI build asf-site
5cf318e is described below

commit 5cf318e823e82325adf42287772a7decfc868e8d
Author: CI <[email protected]>
AuthorDate: Thu Oct 22 02:17:16 2020 +0000

    Travis CI build asf-site
---
 content/activity.html                              |  24 ++
 .../assets/images/blog/hudi-meets-flink/image1.png | Bin 0 -> 161393 bytes
 .../assets/images/blog/hudi-meets-flink/image2.png | Bin 0 -> 123298 bytes
 .../assets/images/blog/hudi-meets-flink/image3.png | Bin 0 -> 175346 bytes
 content/assets/js/lunr/lunr-store.js               |   5 +
 content/blog.html                                  |  24 ++
 .../blog/apache-hudi-meets-apache-flink/index.html | 435 +++++++++++++++++++++
 content/cn/activity.html                           |  24 ++
 content/sitemap.xml                                |   4 +
 9 files changed, 516 insertions(+)

diff --git a/content/activity.html b/content/activity.html
index ea50e63..08f7582 100644
--- a/content/activity.html
+++ b/content/activity.html
@@ -191,6 +191,30 @@
     
     <h2 class="archive__item-title" itemprop="headline">
       
+        <a href="/blog/apache-hudi-meets-apache-flink/" rel="permalink">Apache 
Hudi meets Apache Flink
+</a>
+      
+    </h2>
+    <!-- Look the author details up from the site config. -->
+    
+    <!-- Output author details if some exist. -->
+    <div class="archive__item-meta"><a 
href="https://cwiki.apache.org/confluence/display/~mathieu";>Xianghu Wang</a> 
posted on <time datetime="2020-10-15">October 15, 2020</time></div>
+ 
+    <p class="archive__item-excerpt" itemprop="description">The design and 
latest progress of the integration of Apache Hudi and Apache Flink.
+</p>
+  </article>
+</div>
+
+        
+        
+
+
+
+<div class="list__item">
+  <article class="archive__item" itemscope 
itemtype="https://schema.org/CreativeWork";>
+    
+    <h2 class="archive__item-title" itemprop="headline">
+      
         <a href="/blog/ingest-multiple-tables-using-hudi/" 
rel="permalink">Ingest multiple tables using Hudi
 </a>
       
diff --git a/content/assets/images/blog/hudi-meets-flink/image1.png 
b/content/assets/images/blog/hudi-meets-flink/image1.png
new file mode 100644
index 0000000..380fdea
Binary files /dev/null and 
b/content/assets/images/blog/hudi-meets-flink/image1.png differ
diff --git a/content/assets/images/blog/hudi-meets-flink/image2.png 
b/content/assets/images/blog/hudi-meets-flink/image2.png
new file mode 100644
index 0000000..cd101e4
Binary files /dev/null and 
b/content/assets/images/blog/hudi-meets-flink/image2.png differ
diff --git a/content/assets/images/blog/hudi-meets-flink/image3.png 
b/content/assets/images/blog/hudi-meets-flink/image3.png
new file mode 100644
index 0000000..c11bab5
Binary files /dev/null and 
b/content/assets/images/blog/hudi-meets-flink/image3.png differ
diff --git a/content/assets/js/lunr/lunr-store.js 
b/content/assets/js/lunr/lunr-store.js
index bc1859c..41afb36 100644
--- a/content/assets/js/lunr/lunr-store.js
+++ b/content/assets/js/lunr/lunr-store.js
@@ -1188,4 +1188,9 @@ var store = [{
         "excerpt":"When building a change data capture pipeline for already 
existing or newly created relational databases, one of the most common problems 
that one faces is simplifying the onboarding process for multiple tables. 
Ingesting multiple tables to Hudi dataset at a single go is now possible using 
HoodieMultiTableDeltaStreamer class which is...","categories": ["blog"],
         "tags": [],
         "url": 
"https://hudi.apache.org/blog/ingest-multiple-tables-using-hudi/";,
+        "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
+        "title": "Apache Hudi meets Apache Flink",
+        "excerpt":"Apache Hudi (Hudi for short) is a data lake framework 
created at Uber. Hudi joined the Apache incubator for incubation in January 
2019, and was promoted to the top Apache project in May 2020. It is one of the 
most popular data lake frameworks. 1. Why decouple Hudi has 
been...","categories": ["blog"],
+        "tags": [],
+        "url": "https://hudi.apache.org/blog/apache-hudi-meets-apache-flink/";,
         "teaser":"https://hudi.apache.org/assets/images/500x300.png"},]
diff --git a/content/blog.html b/content/blog.html
index 2bdd4a0..9d28e6b 100644
--- a/content/blog.html
+++ b/content/blog.html
@@ -189,6 +189,30 @@
     
     <h2 class="archive__item-title" itemprop="headline">
       
+        <a href="/blog/apache-hudi-meets-apache-flink/" rel="permalink">Apache 
Hudi meets Apache Flink
+</a>
+      
+    </h2>
+    <!-- Look the author details up from the site config. -->
+    
+    <!-- Output author details if some exist. -->
+    <div class="archive__item-meta"><a 
href="https://cwiki.apache.org/confluence/display/~mathieu";>Xianghu Wang</a> 
posted on <time datetime="2020-10-15">October 15, 2020</time></div>
+ 
+    <p class="archive__item-excerpt" itemprop="description">The design and 
latest progress of the integration of Apache Hudi and Apache Flink.
+</p>
+  </article>
+</div>
+
+        
+        
+
+
+
+<div class="list__item">
+  <article class="archive__item" itemscope 
itemtype="https://schema.org/CreativeWork";>
+    
+    <h2 class="archive__item-title" itemprop="headline">
+      
         <a href="/blog/ingest-multiple-tables-using-hudi/" 
rel="permalink">Ingest multiple tables using Hudi
 </a>
       
diff --git a/content/blog/apache-hudi-meets-apache-flink/index.html 
b/content/blog/apache-hudi-meets-apache-flink/index.html
new file mode 100644
index 0000000..6c167b0
--- /dev/null
+++ b/content/blog/apache-hudi-meets-apache-flink/index.html
@@ -0,0 +1,435 @@
+<!doctype html>
+<html lang="en" class="no-js">
+  <head>
+    <meta charset="utf-8">
+
+<!-- begin _includes/seo.html --><title>Apache Hudi meets Apache Flink - 
Apache Hudi</title>
+<meta name="description" content="The design and latest progress of the 
integration of Apache Hudi and Apache Flink.">
+
+<meta property="og:type" content="article">
+<meta property="og:locale" content="en_US">
+<meta property="og:site_name" content="">
+<meta property="og:title" content="Apache Hudi meets Apache Flink">
+<meta property="og:url" 
content="https://hudi.apache.org/blog/apache-hudi-meets-apache-flink/";>
+
+
+  <meta property="og:description" content="The design and latest progress of 
the integration of Apache Hudi and Apache Flink.">
+
+
+
+
+
+
+
+
+
+
+
+<!-- end _includes/seo.html -->
+
+
+<!--<link href="/feed.xml" type="application/atom+xml" rel="alternate" title=" 
Feed">-->
+
+<!-- https://t.co/dKP3o1e -->
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+
+<script>
+  document.documentElement.className = 
document.documentElement.className.replace(/\bno-js\b/g, '') + ' js ';
+</script>
+
+<!-- For all browsers -->
+<link rel="stylesheet" href="/assets/css/main.css">
+
+<!--[if IE]>
+  <style>
+    /* old IE unsupported flexbox fixes */
+    .greedy-nav .site-title {
+      padding-right: 3em;
+    }
+    .greedy-nav button {
+      position: absolute;
+      top: 0;
+      right: 0;
+      height: 100%;
+    }
+  </style>
+<![endif]-->
+
+
+
+<link rel="icon" type="image/x-icon" href="/assets/images/favicon.ico">
+<link rel="stylesheet" href="/assets/css/font-awesome.min.css">
+<script src="/assets/js/jquery.min.js"></script>
+
+    
+<script src="/assets/js/main.min.js"></script>
+
+  </head>
+
+  <body class="layout--single">
+    <!--[if lt IE 9]>
+<div class="notice--danger align-center" style="margin: 0;">You are using an 
<strong>outdated</strong> browser. Please <a 
href="https://browsehappy.com/";>upgrade your browser</a> to improve your 
experience.</div>
+<![endif]-->
+
+    <div class="masthead">
+  <div class="masthead__inner-wrap" id="masthead__inner-wrap">
+    <div class="masthead__menu">
+      <nav id="site-nav" class="greedy-nav">
+        
+          <a class="site-logo" href="/">
+              <div style="width: 150px; height: 40px">
+              </div>
+          </a>
+        
+        <a class="site-title" href="/">
+          
+        </a>
+        <ul class="visible-links"><li class="masthead__menu-item">
+              <a href="/docs/quick-start-guide.html" target="_self" 
>Documentation</a>
+            </li><li class="masthead__menu-item">
+              <a href="/community.html" target="_self" >Community</a>
+            </li><li class="masthead__menu-item">
+              <a href="/blog.html" target="_self" >Blog</a>
+            </li><li class="masthead__menu-item">
+              <a href="https://cwiki.apache.org/confluence/display/HUDI/FAQ"; 
target="_blank" >FAQ</a>
+            </li><li class="masthead__menu-item">
+              <a href="/releases.html" target="_self" >Releases</a>
+            </li></ul>
+        <button class="greedy-nav__toggle hidden" type="button">
+          <span class="visually-hidden">Toggle menu</span>
+          <div class="navicon"></div>
+        </button>
+        <ul class="hidden-links hidden"></ul>
+      </nav>
+    </div>
+  </div>
+</div>
+<!--
+<p class="notice--warning" style="margin: 0 !important; text-align: center 
!important;"><strong>Note:</strong> This site is work in progress, if you 
notice any issues, please <a target="_blank" 
href="https://github.com/apache/hudi/issues";>Report on Issue</a>.
+  Click <a href="/"> here</a> back to old site.</p>
+-->
+
+    <div class="initial-content">
+      <div id="main" role="main">
+  
+
+  <div class="sidebar sticky">
+
+  
+    <div itemscope itemtype="https://schema.org/Person";>
+
+  <div class="author__content">
+    
+      <h3 class="author__name" itemprop="name">Quick Links</h3>
+    
+    
+      <div class="author__bio" itemprop="description">
+        <p>Hudi <em>ingests</em> &amp; <em>manages</em> storage of large 
analytical datasets over DFS.</p>
+
+      </div>
+    
+  </div>
+
+  <div class="author__urls-wrapper">
+    <ul class="author__urls social-icons">
+      
+        
+          <li><a href="/docs/quick-start-guide" target="_self" rel="nofollow 
noopener noreferrer"><i class="fa fa-book" aria-hidden="true"></i> 
Documentation</a></li>
+
+          
+        
+          <li><a href="https://cwiki.apache.org/confluence/display/HUDI"; 
target="_blank" rel="nofollow noopener noreferrer"><i class="fa fa-wikipedia-w" 
aria-hidden="true"></i> Technical Wiki</a></li>
+
+          
+        
+          <li><a href="/contributing" target="_self" rel="nofollow noopener 
noreferrer"><i class="fa fa-thumbs-o-up" aria-hidden="true"></i> Contribution 
Guide</a></li>
+
+          
+        
+          <li><a 
href="https://join.slack.com/t/apache-hudi/shared_invite/enQtODYyNDAxNzc5MTg2LTE5OTBlYmVhYjM0N2ZhOTJjOWM4YzBmMWU2MjZjMGE4NDc5ZDFiOGQ2N2VkYTVkNzU3ZDQ4OTI1NmFmYWQ0NzE";
 target="_blank" rel="nofollow noopener noreferrer"><i class="fa fa-slack" 
aria-hidden="true"></i> Join on Slack</a></li>
+
+          
+        
+          <li><a href="https://github.com/apache/hudi"; target="_blank" 
rel="nofollow noopener noreferrer"><i class="fa fa-github" 
aria-hidden="true"></i> Fork on GitHub</a></li>
+
+          
+        
+          <li><a href="https://issues.apache.org/jira/projects/HUDI/summary"; 
target="_blank" rel="nofollow noopener noreferrer"><i class="fa fa-navicon" 
aria-hidden="true"></i> Report Issues</a></li>
+
+          
+        
+          <li><a href="/security" target="_self" rel="nofollow noopener 
noreferrer"><i class="fa fa-navicon" aria-hidden="true"></i> Report Security 
Issues</a></li>
+
+          
+        
+      
+    </ul>
+  </div>
+</div>
+
+  
+
+  
+  </div>
+
+
+  <article class="page" itemscope itemtype="https://schema.org/CreativeWork";>
+    <!-- Look the author details up from the site config. -->
+    
+
+    <div class="page__inner-wrap">
+      
+        <header>
+          <h1 id="page-title" class="page__title" itemprop="headline">Apache 
Hudi meets Apache Flink
+</h1>
+          <!-- Output author details if some exist. -->
+          <div class="page__author"><a 
href="https://cwiki.apache.org/confluence/display/~mathieu";>Xianghu Wang</a> 
posted on <time datetime="2020-10-15">October 15, 2020</time></span>
+        </header>
+      
+
+      <section class="page__content" itemprop="text">
+        
+          <style>
+            .page {
+              padding-right: 0 !important;
+            }
+          </style>
+        
+        <p>Apache Hudi (Hudi for short) is a data lake framework created at 
Uber. Hudi joined the Apache incubator for incubation in January 2019, and was 
promoted to the top Apache project in May 2020. It is one of the most popular 
data lake frameworks.</p>
+
+<h2 id="1-why-decouple">1. Why decouple</h2>
+
+<p>Hudi has been using Spark as its data processing engine since its birth. If 
users want to use Hudi as their data lake framework, they must introduce Spark 
into their platform technology stack. 
+A few years ago, using Spark as a big data processing engine can be said to be 
very common or even natural. Since Spark can either perform batch processing or 
use micro-batch to simulate streaming, one engine solves both streaming and 
batch problems. 
+However, in recent years, with the development of big data technology, Flink, 
which is also a big data processing engine, has gradually entered people’s 
vision and has occupied a certain market in the field of computing engines. 
+In the big data technology community, forums and other territories, the voice 
of whether Hudi supports Flink has gradually appeared and has become more 
frequent. Therefore, it is a valuable thing to make Hudi support the Flink 
engine, and the first step of integrating the Flink engine is that Hudi and 
Spark are decoupled.</p>
+
+<p>In addition, looking at the mature, active, and viable frameworks in the 
big data, all frameworks are elegant in design and can be integrated with other 
frameworks and leverage each other’s expertise. 
+Therefore, decoupling Hudi from Spark and turning it into an 
engine-independent data lake framework will undoubtedly create more 
possibilities for the integration of Hudi and other components, allowing Hudi 
to better integrate into the big data ecosystem.</p>
+
+<h2 id="2-challenges">2. Challenges</h2>
+
+<p>Hudi’s internal use of Spark API is as common as our usual development and 
use of List. Since the data source reads the data, and finally writes the data 
to the table, Spark RDD is used as the main data structure everywhere, and even 
ordinary tools are implemented using the Spark API. 
+It can be said that Hudi is a universal data lake framework implemented by 
Spark. Hudi also leverages deep Spark functionality like custom partitioning, 
in-memory caching to implement indexing and file sizing using workload 
heuristics. 
+For some of these, Flink offers better out-of-box support (e.g using Flink’s 
state store for indexing) and can in fact, make Hudi approach real-time 
latencies more and more.</p>
+
+<p>In addition, the primary engine integrated after this decoupling is Flink. 
Flink and Spark differ greatly in core abstraction. Spark believes that data is 
bounded, and its core abstraction is a limited set of data. 
+Flink believes that the essence of data is a stream, and its core abstract 
DataStream contains various operations on data. Hudi has a streaming first 
design (record level updates, record level streams), that arguably fit the 
Flink model more naturally. 
+At the same time, there are multiple RDDs operating at the same time in Hudi, 
and the processing result of one RDD is combined with another RDD. 
+This difference in abstraction and the reuse of intermediate results during 
implementation make it difficult for Hudi to use a unified API to operate both 
RDD and DataStream in terms of decoupling abstraction.</p>
+
+<h2 id="3-decoupling-spark">3. Decoupling Spark</h2>
+<p>In theory, Hudi uses Spark as its computing engine to use Spark’s 
distributed computing power and RDD’s rich operator capabilities. Apart from 
distributed computing power, Hudi uses RDD more as a data structure, and RDD is 
essentially a bounded data set. 
+Therefore, it is theoretically feasible to replace RDD with List (of course, 
it may sacrifice performance/scale). In order to ensure the performance and 
stability of the Hudi Spark version as much as possible. We can keep setting 
the bounded data set as the basic operation unit. 
+Hudi’s main operation API remains unchanged, and RDD is extracted as a generic 
type. The Spark engine implementation still uses RDD, and other engines use 
List or other bounded  data set according to the actual situation.</p>
+
+<h3 id="decoupling-principle">Decoupling principle</h3>
+<p>1) Unified generics. The input records <code 
class="highlighter-rouge">JavaRDD&lt;HoodieRecord&gt;</code>, key of input 
records <code class="highlighter-rouge">JavaRDD&lt;HoodieKey&gt;</code>, and 
result of write operations <code 
class="highlighter-rouge">JavaRDD&lt;WriteStatus&gt;</code> used by the Spark 
API use generic <code class="highlighter-rouge">I,K,O</code> instead;</p>
+
+<p>2) De-sparkization. All APIs of the abstraction layer must have nothing to 
do with Spark. Involving specific operations that are difficult to implement in 
the abstract layer, rewrite them as abstract methods and introduce Spark 
subclasses.</p>
+
+<p>For example: Hudi uses the <code 
class="highlighter-rouge">JavaSparkContext#map()</code> method in many places. 
To de-spark, you need to hide the <code 
class="highlighter-rouge">JavaSparkContext</code>. For this problem, we 
introduced the <code class="highlighter-rouge">HoodieEngineContext#map()</code> 
method, which will block the specific implementation details of <code 
class="highlighter-rouge">map</code>, so as to achieve de-sparkization in 
abstraction.</p>
+
+<p>3) Minimize changes to the abstraction layer to ensure the original 
function and performance of Hudi;</p>
+
+<p>4) Replace the <code class="highlighter-rouge">JavaSparkContext</code> with 
the <code class="highlighter-rouge">HoodieEngineContext</code> abstract class 
to provide the running environment context.</p>
+
+<p>In addition, some of the core algorithms in Hudi, like <a 
href="https://github.com/apache/hudi/pull/1756";>rollback</a>, has been redone 
without the need for computing a workload profile ahead of time, which used to 
rely on Spark caching.</p>
+
+<h2 id="4-flink-integration-design">4. Flink integration design</h2>
+<p>Hudi’s write operation is batch processing in nature, and the continuous 
mode of <code class="highlighter-rouge">DeltaStreamer</code> is realized by 
looping batch processing. In order to use a unified API, when Hudi integrates 
Flink, we choose to collect a batch of data before processing, and finally 
submit it in a unified manner (here we use List to collect data in Flink).
+In Hudi terminology, we will stream data for a given commit, but only publish 
the commits every so often, making it practical to scale storage on cloud 
storage and also tunable.</p>
+
+<p>The easiest way to think of batch operation is to use a time window. 
However, when using a window, when there is no data flowing in a window, there 
will be no output data, and it is difficult for the Flink sink to judge whether 
all the data from a given batch has been processed. 
+Therefore, we use Flink’s checkpoint mechanism to collect batches. The data 
between every two barriers is a batch. When there is no data in a subtask, the 
mock result data is made up. 
+In this way, on the sink side, when each subtask has result data issued, it 
can be considered that a batch of data has been processed and the commit can be 
executed.</p>
+
+<p>The DAG is as follows:</p>
+
+<p><img src="/assets/images/blog/hudi-meets-flink/image1.png" alt="dualism" 
/></p>
+
+<ul>
+  <li><strong>Source:</strong> receives Kafka data and converts it into <code 
class="highlighter-rouge">List&lt;HoodieRecord&gt;</code>;</li>
+  <li><strong>InstantGeneratorOperator:</strong> generates a globally unique 
instant. When the previous instant is not completed or the current batch has no 
data, no new instant is created;</li>
+  <li><strong>KeyBy partitionPath:</strong> partitions according to <code 
class="highlighter-rouge">partitionPath</code> to avoid multiple subtasks from 
writing the same partition;</li>
+  <li><strong>WriteProcessOperator:</strong> performs a write operation. When 
there is no data in the current partition, it sends empty result data to the 
downstream to make up the number;</li>
+  <li><strong>CommitSink:</strong> receives the calculation results of the 
upstream task. When receiving the parallelism results, it is considered that 
all the upstream subtasks are completed and the commit is executed.</li>
+</ul>
+
+<p>Note:
+<code class="highlighter-rouge">InstantGeneratorOperator</code> and <code 
class="highlighter-rouge">WriteProcessOperator</code> are both custom Flink 
operators. <code class="highlighter-rouge">InstantGeneratorOperator</code> will 
block checking the state of the previous instant to ensure that there is only 
one instant in the global (or requested) state.
+<code class="highlighter-rouge">WriteProcessOperator</code> is the actual 
execution Where a write operation is performed, the write operation is 
triggered at checkpoint.</p>
+
+<h3 id="41-index-design-based-on-flink-state">4.1 Index design based on Flink 
State</h3>
+
+<p>Stateful computing is one of the highlights of the Flink engine. Compared 
with using external storage, using Flink’s built-in <code 
class="highlighter-rouge">State</code> can significantly improve the 
performance of Flink applications. 
+Therefore, it would be a good choice to implement a Hudi index based on 
Flink’s State.</p>
+
+<p>The core of the Hudi index is to maintain the mapping of the Hudi key <code 
class="highlighter-rouge">HoodieKey</code> and the location of the Hudi data 
<code class="highlighter-rouge">HoodieRecordLocation</code>. 
+Therefore, based on the current design, we can simply maintain a <code 
class="highlighter-rouge">MapState&lt;HoodieKey, 
HoodieRecordLocation&gt;</code> in Flink UDF to map the <code 
class="highlighter-rouge">HoodieKey</code> and <code 
class="highlighter-rouge">HoodieRecordLocation</code>, and leave the fault 
tolerance and persistence of State to the Flink framework.</p>
+
+<p><img src="/assets/images/blog/hudi-meets-flink/image2.png" alt="dualism" 
/></p>
+
+<h2 id="5-implementation-examples">5. Implementation examples</h2>
+<h3 id="1-hoodietable">1) HoodieTable</h3>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>/**
+  * Abstract implementation of a HoodieTable.
+  *
+  * @param &lt;T&gt; Sub type of HoodieRecordPayload
+  * @param &lt;I&gt; Type of inputs
+  * @param &lt;K&gt; Type of keys
+  * @param &lt;O&gt; Type of outputs
+  */
+public abstract class HoodieTable&lt;T extends HoodieRecordPayload, I, K, 
O&gt; implements Serializable {
+
+   protected final HoodieWriteConfig config;
+   protected final HoodieTableMetaClient metaClient;
+   protected final HoodieIndex&lt;T, I, K, O&gt; index;
+
+   public abstract HoodieWriteMetadata&lt;O&gt; upsert(HoodieEngineContext 
context, String instantTime,
+       I records);
+
+   public abstract HoodieWriteMetadata&lt;O&gt; insert(HoodieEngineContext 
context, String instantTime,
+       I records);
+
+   public abstract HoodieWriteMetadata&lt;O&gt; bulkInsert(HoodieEngineContext 
context, String instantTime,
+       I records, Option&lt;BulkInsertPartitioner&lt;I&gt;&gt; 
bulkInsertPartitioner);
+
+   ...
+}
+</code></pre></div></div>
+
+<p><code class="highlighter-rouge">HoodieTable</code> is one of the core 
abstractions of Hudi, which defines operations such as <code 
class="highlighter-rouge">insert</code>, <code 
class="highlighter-rouge">upsert</code>, and <code 
class="highlighter-rouge">bulkInsert</code> supported by the table. 
+Take <code class="highlighter-rouge">upsert</code> as an example, the input 
data is changed from the original <code 
class="highlighter-rouge">JavaRDD&lt;HoodieRecord&gt; inputRdds</code> to <code 
class="highlighter-rouge">I records</code>, and the runtime <code 
class="highlighter-rouge">JavaSparkContext jsc</code> is changed to <code 
class="highlighter-rouge">HoodieEngineContext context</code>.</p>
+
+<p>From the class annotations, we can see that <code 
class="highlighter-rouge">T, I, K, O</code> represents the load data type, 
input data type, primary key type and output data type of Hudi operation 
respectively. 
+These generics will run through the entire abstraction layer.</p>
+
+<h3 id="2-hoodieenginecontext">2) HoodieEngineContext</h3>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>/**
+ * Base class contains the context information needed by the engine at 
runtime. It will be extended by different
+ * engine implementation if needed.
+ */
+public abstract class HoodieEngineContext {
+
+  public abstract &lt;I, O&gt; List&lt;O&gt; map(List&lt;I&gt; data, 
SerializableFunction&lt;I, O&gt; func, int parallelism);
+
+  public abstract &lt;I, O&gt; List&lt;O&gt; flatMap(List&lt;I&gt; data, 
SerializableFunction&lt;I, Stream&lt;O&gt;&gt; func, int parallelism);
+
+  public abstract &lt;I&gt; void foreach(List&lt;I&gt; data, 
SerializableConsumer&lt;I&gt; consumer, int parallelism);
+
+  ......
+}
+</code></pre></div></div>
+
+<p><code class="highlighter-rouge">HoodieEngineContext</code> plays the role 
of <code class="highlighter-rouge">JavaSparkContext</code>, it not only 
provides all the information that <code 
class="highlighter-rouge">JavaSparkContext</code> can provide, 
+but also encapsulates many methods such as <code 
class="highlighter-rouge">map</code>, <code 
class="highlighter-rouge">flatMap</code>, <code 
class="highlighter-rouge">foreach</code>, and hides The specific implementation 
of <code class="highlighter-rouge">JavaSparkContext#map()</code>,<code 
class="highlighter-rouge">JavaSparkContext#flatMap()</code>, <code 
class="highlighter-rouge">JavaSparkContext#foreach()</code> and other 
methods.</p>
+
+<p>Take the <code class="highlighter-rouge">map</code> method as an example. 
In the Spark implementation class <code 
class="highlighter-rouge">HoodieSparkEngineContext</code>, the <code 
class="highlighter-rouge">map</code> method is as follows:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>  @Override
+  public &lt;I, O&gt; List&lt;O&gt; map(List&lt;I&gt; data, 
SerializableFunction&lt;I, O&gt; func, int parallelism) {
+    return javaSparkContext.parallelize(data, 
parallelism).map(func::apply).collect();
+  }
+</code></pre></div></div>
+
+<p>In the engine that operates List, the implementation can be as follows 
(different methods need to pay attention to thread-safety issues, use <code 
class="highlighter-rouge">parallel()</code> with caution):</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>  @Override
+  public &lt;I, O&gt; List&lt;O&gt; map(List&lt;I&gt; data, 
SerializableFunction&lt;I, O&gt; func, int parallelism) {
+    return 
data.stream().parallel().map(func::apply).collect(Collectors.toList());
+  }
+</code></pre></div></div>
+
+<p>Note:
+The exception thrown in the map function can be solved by wrapping <code 
class="highlighter-rouge">SerializableFunction&lt;I, O&gt; func</code>.</p>
+
+<p>Here is a brief introduction to <code 
class="highlighter-rouge">SerializableFunction</code>:</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>@FunctionalInterface
+public interface SerializableFunction&lt;I, O&gt; extends Serializable {
+  O apply(I v1) throws Exception;
+}
+</code></pre></div></div>
+
+<p>This method is actually a variant of <code 
class="highlighter-rouge">java.util.function.Function</code>. The difference 
from <code class="highlighter-rouge">java.util.function.Function</code> is that 
<code class="highlighter-rouge">SerializableFunction</code> can be serialized 
and can throw exceptions. 
+This function is introduced because the input parameters that the <code 
class="highlighter-rouge">JavaSparkContext#map()</code> function can receive 
must be serializable. 
+At the same time, there are many exceptions that need to be thrown in the 
logic of Hudi, and the code for <code 
class="highlighter-rouge">try-catch</code> in the Lambda expression will be 
omitted It is bloated and not very elegant.</p>
+
+<h2 id="6-current-progress-and-follow-up-plan">6. Current progress and 
follow-up plan</h2>
+
+<h3 id="61-working-time-axis">6.1 Working time axis</h3>
+
+<p><img src="/assets/images/blog/hudi-meets-flink/image3.png" alt="dualism" 
/></p>
+
+<p><a href="https://www.t3go.cn/";>T3go</a>
+<a href="https://cn.aliyun.com/";>Aliyun</a>
+<a href="https://www.sf-express.com/cn/sc/";>SF-express</a></p>
+
+<h3 id="62-follow-up-plan">6.2 Follow-up plan</h3>
+
+<h4 id="1-promote-the-integration-of-hudi-and-flink">1) Promote the 
integration of Hudi and Flink</h4>
+
+<p>Push the integration of Flink and Hudi to the community as soon as 
possible. In the initial stage, this feature may only support Kafka data 
sources.</p>
+
+<h4 id="2-performance-optimization">2) Performance optimization</h4>
+
+<p>In order to ensure the stability and performance of the Hudi-Spark version, 
the decoupling did not take too much into consideration the possible 
performance problems of the Flink version.</p>
+
+<h4 id="3-flink-connector-hudi-like-third-party-package-development">3) 
flink-connector-hudi like third-party package development</h4>
+
+<p>Make the binding of Hudi-Flink into a third-party package. Users can this 
third-party package to read/write from/to Hudi with Flink.</p>
+
+      </section>
+
+      <a href="#masthead__inner-wrap" class="back-to-top">Back to top 
&uarr;</a>
+
+
+      
+
+    </div>
+
+  </article>
+
+</div>
+
+    </div>
+
+    <div class="page__footer">
+      <footer>
+        
+<div class="row">
+  <div class="col-lg-12 footer">
+    <p>
+      <table class="table-apache-info">
+        <tr>
+          <td>
+            <a class="footer-link-img" href="https://apache.org";>
+              <img width="250px" src="/assets/images/asf_logo.svg" alt="The 
Apache Software Foundation">
+            </a>
+          </td>
+          <td>
+            <a style="float: right" 
href="https://www.apache.org/events/current-event.html";>
+              <img 
src="https://www.apache.org/events/current-event-234x60.png"; />
+            </a>
+          </td>
+        </tr>
+      </table>
+    </p>
+    <p>
+      <a href="https://www.apache.org/licenses/";>License</a> | <a 
href="https://www.apache.org/security/";>Security</a> | <a 
href="https://www.apache.org/foundation/thanks.html";>Thanks</a> | <a 
href="https://www.apache.org/foundation/sponsorship.html";>Sponsorship</a>
+    </p>
+    <p>
+      Copyright &copy; <span id="copyright-year">2019</span> <a 
href="https://apache.org";>The Apache Software Foundation</a>, Licensed under 
the <a href="https://www.apache.org/licenses/LICENSE-2.0";> Apache License, 
Version 2.0</a>.
+      Hudi, Apache and the Apache feather logo are trademarks of The Apache 
Software Foundation. <a href="/docs/privacy">Privacy Policy</a>
+    </p>
+  </div>
+</div>
+      </footer>
+    </div>
+
+
+  </body>
+</html>
\ No newline at end of file
diff --git a/content/cn/activity.html b/content/cn/activity.html
index 21ea363..80134b2 100644
--- a/content/cn/activity.html
+++ b/content/cn/activity.html
@@ -191,6 +191,30 @@
     
     <h2 class="archive__item-title" itemprop="headline">
       
+        <a href="/blog/apache-hudi-meets-apache-flink/" rel="permalink">Apache 
Hudi meets Apache Flink
+</a>
+      
+    </h2>
+    <!-- Look the author details up from the site config. -->
+    
+    <!-- Output author details if some exist. -->
+    <div class="archive__item-meta"><a 
href="https://cwiki.apache.org/confluence/display/~mathieu";>Xianghu Wang</a> 
posted on <time datetime="2020-10-15">October 15, 2020</time></div>
+ 
+    <p class="archive__item-excerpt" itemprop="description">The design and 
latest progress of the integration of Apache Hudi and Apache Flink.
+</p>
+  </article>
+</div>
+
+        
+        
+
+
+
+<div class="list__item">
+  <article class="archive__item" itemscope 
itemtype="https://schema.org/CreativeWork";>
+    
+    <h2 class="archive__item-title" itemprop="headline">
+      
         <a href="/blog/ingest-multiple-tables-using-hudi/" 
rel="permalink">Ingest multiple tables using Hudi
 </a>
       
diff --git a/content/sitemap.xml b/content/sitemap.xml
index 53b9400..5b1411f 100644
--- a/content/sitemap.xml
+++ b/content/sitemap.xml
@@ -953,6 +953,10 @@
 <lastmod>2020-08-22T00:00:00-04:00</lastmod>
 </url>
 <url>
+<loc>https://hudi.apache.org/blog/apache-hudi-meets-apache-flink/</loc>
+<lastmod>2020-10-15T00:00:00-04:00</lastmod>
+</url>
+<url>
 <loc>https://hudi.apache.org/cn/activity</loc>
 <lastmod>2019-12-30T14:59:57-05:00</lastmod>
 </url>

[hudi] branch asf-site updated: Travis CI build asf-site

Reply via email to