This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi-site.git

commit 97b3106520c489612dd2187eb9ce4796d5f5c49f
Author: Vinoth Chandar <[email protected]>
AuthorDate: Sat Mar 9 13:18:07 2019 -0800

    Refreshing site content
---
 content/.gitignore                                 |   1 -
 content/404.html                                   |   6 +-
 content/admin_guide.html                           |  50 ++-
 content/community.html                             |  11 +-
 content/comparison.html                            |  15 +-
 content/concepts.html                              |   8 +-
 content/configurations.html                        | 489 ++++++++++++++-------
 content/contributing.html                          |   8 +-
 content/css/customstyles.css                       |   4 +-
 content/css/theme-blue.css                         |   2 +-
 content/feed.xml                                   |   6 +-
 content/gcs_hoodie.html                            |  16 +-
 ...ommit_duration.png => hudi_commit_duration.png} | Bin
 .../{hoodie_intro_1.png => hudi_intro_1.png}       | Bin
 ...ie_log_format_v2.png => hudi_log_format_v2.png} | Bin
 ...uery_perf_hive.png => hudi_query_perf_hive.png} | Bin
 ..._perf_presto.png => hudi_query_perf_presto.png} | Bin
 ...ry_perf_spark.png => hudi_query_perf_spark.png} | Bin
 .../{hoodie_upsert_dag.png => hudi_upsert_dag.png} | Bin
 ...odie_upsert_perf1.png => hudi_upsert_perf1.png} | Bin
 ...odie_upsert_perf2.png => hudi_upsert_perf2.png} | Bin
 content/implementation.html                        |  22 +-
 content/incremental_processing.html                |  36 +-
 content/index.html                                 |  10 +-
 content/js/mydoc_scroll.html                       |   6 +-
 content/migration_guide.html                       |  15 +-
 content/news.html                                  |   8 +-
 content/news_archive.html                          |   6 +-
 content/powered_by.html                            |   7 +-
 content/privacy.html                               |   6 +-
 content/quickstart.html                            |  26 +-
 content/s3_hoodie.html                             |  19 +-
 content/search.json                                |  40 +-
 content/sql_queries.html                           |   8 +-
 content/strata-talk.html                           |   6 +-
 content/use_cases.html                             |  19 +-
 36 files changed, 558 insertions(+), 292 deletions(-)

diff --git a/content/.gitignore b/content/.gitignore
deleted file mode 100644
index e43b0f9..0000000
--- a/content/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-.DS_Store
diff --git a/content/404.html b/content/404.html
index 9491810..fedef9b 100644
--- a/content/404.html
+++ b/content/404.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" ">
+<meta name="keywords" content="">
 <title>Page Not Found | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
diff --git a/content/admin_guide.html b/content/admin_guide.html
index 470a219..3625cee 100644
--- a/content/admin_guide.html
+++ b/content/admin_guide.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="This section offers an overview of tools 
available to operate an ecosystem of Hudi datasets">
-<meta name="keywords" content=" admin">
+<meta name="keywords" content="hudi, administration, operation, devops">
 <title>Admin Guide | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -355,11 +359,11 @@
 
 <h2 id="admin-cli">Admin CLI</h2>
 
-<p>Once hoodie has been built via <code class="highlighter-rouge">mvn clean 
install -DskipTests</code>, the shell can be fired by via  <code 
class="highlighter-rouge">cd hoodie-cli &amp;&amp; ./hoodie-cli.sh</code>.
-A hoodie dataset resides on HDFS, in a location referred to as the 
<strong>basePath</strong> and we would need this location in order to connect 
to a Hoodie dataset.
-Hoodie library effectively manages this HDFS dataset internally, using .hoodie 
subfolder to track all metadata</p>
+<p>Once hudi has been built, the shell can be fired by via  <code 
class="highlighter-rouge">cd hoodie-cli &amp;&amp; ./hoodie-cli.sh</code>.
+A hudi dataset resides on DFS, in a location referred to as the 
<strong>basePath</strong> and we would need this location in order to connect 
to a Hudi dataset.
+Hudi library effectively manages this dataset internally, using .hoodie 
subfolder to track all metadata</p>
 
-<p>To initialize a hoodie table, use the following command.</p>
+<p>To initialize a hudi table, use the following command.</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>18/09/06 15:56:52 
INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 
'javax.inject.Inject' annotation found and supported for autowiring
 ============================================
@@ -380,7 +384,7 @@ hoodie-&gt;create --path /user/hive/warehouse/table1 
--tableName hoodie_table_1
 </code></pre>
 </div>
 
-<p>To see the description of hoodie table, use the command:</p>
+<p>To see the description of hudi table, use the command:</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>
 hoodie:hoodie_table_1-&gt;desc
@@ -398,7 +402,7 @@ hoodie:hoodie_table_1-&gt;desc
 </code></pre>
 </div>
 
-<p>Following is a sample command to connect to a Hoodie dataset contains uber 
trips.</p>
+<p>Following is a sample command to connect to a Hudi dataset contains uber 
trips.</p>
 
 <div class="highlighter-rouge"><pre 
class="highlight"><code>hoodie:trips-&gt;connect --path /app/uber/trips
 
@@ -447,7 +451,7 @@ hoodie:trips-&gt;
 
 <h4 id="inspecting-commits">Inspecting Commits</h4>
 
-<p>The task of upserting or inserting a batch of incoming records is known as 
a <strong>commit</strong> in Hoodie. A commit provides basic atomicity 
guarantees such that only commited data is available for querying.
+<p>The task of upserting or inserting a batch of incoming records is known as 
a <strong>commit</strong> in Hudi. A commit provides basic atomicity guarantees 
such that only commited data is available for querying.
 Each commit has a monotonically increasing string/number called the 
<strong>commit number</strong>. Typically, this is the time at which we started 
the commit.</p>
 
 <p>To view some basic information about the last 10 commits,</p>
@@ -464,7 +468,7 @@ hoodie:trips-&gt;
 </code></pre>
 </div>
 
-<p>At the start of each write, Hoodie also writes a .inflight commit to the 
.hoodie folder. You can use the timestamp there to estimate how long the commit 
has been inflight</p>
+<p>At the start of each write, Hudi also writes a .inflight commit to the 
.hoodie folder. You can use the timestamp there to estimate how long the commit 
has been inflight</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>$ hdfs dfs -ls 
/app/uber/trips/.hoodie/*.inflight
 -rw-r--r--   3 vinoth supergroup     321984 2016-10-05 23:18 
/app/uber/trips/.hoodie/20161005225920.inflight
@@ -522,7 +526,7 @@ order (See Concepts). The below commands allow users to 
view the file-slices for
 
 <h4 id="statistics">Statistics</h4>
 
-<p>Since Hoodie directly manages file sizes for HDFS dataset, it might be good 
to get an overall picture</p>
+<p>Since Hudi directly manages file sizes for DFS dataset, it might be good to 
get an overall picture</p>
 
 <div class="highlighter-rouge"><pre 
class="highlight"><code>hoodie:trips-&gt;stats filesizes --partitionPath 
2016/09/01 --sortBy "95th" --desc true --limit 10
     
________________________________________________________________________________________________
@@ -534,7 +538,7 @@ order (See Concepts). The below commands allow users to 
view the file-slices for
 </code></pre>
 </div>
 
-<p>In case of Hoodie write taking much longer, it might be good to see the 
write amplification for any sudden increases</p>
+<p>In case of Hudi write taking much longer, it might be good to see the write 
amplification for any sudden increases</p>
 
 <div class="highlighter-rouge"><pre 
class="highlight"><code>hoodie:trips-&gt;stats wa
     __________________________________________________________________________
@@ -547,7 +551,7 @@ order (See Concepts). The below commands allow users to 
view the file-slices for
 
 <h4 id="archived-commits">Archived Commits</h4>
 
-<p>In order to limit the amount of growth of .commit files on HDFS, Hoodie 
archives older .commit files (with due respect to the cleaner policy) into a 
commits.archived file.
+<p>In order to limit the amount of growth of .commit files on DFS, Hudi 
archives older .commit files (with due respect to the cleaner policy) into a 
commits.archived file.
 This is a sequence file that contains a mapping from commitNumber =&gt; json 
with raw information about the commit (same that is nicely rolled up above).</p>
 
 <h4 id="compactions">Compactions</h4>
@@ -692,7 +696,7 @@ No File renames needed to unschedule pending compaction. 
Operation successful.</
 <div class="highlighter-rouge"><pre class="highlight"><code>
 ##### Repair Compaction
 
-The above compaction unscheduling operations could sometimes fail partially 
(e:g -&gt; HDFS temporarily unavailable). With
+The above compaction unscheduling operations could sometimes fail partially 
(e:g -&gt; DFS temporarily unavailable). With
 partial failures, the compaction operation could become inconsistent with the 
state of file-slices. When you run
 `compaction validate`, you can notice invalid compaction operations if there 
is one.  In these cases, the repair
 command comes to the rescue, it will rearrange the file-slices so that there 
is no loss and the file-slices are
@@ -710,7 +714,7 @@ Compaction successfully repaired
 
 <h2 id="metrics">Metrics</h2>
 
-<p>Once the Hoodie Client is configured with the right datasetname and 
environment for metrics, it produces the following graphite metrics, that aid 
in debugging hoodie datasets</p>
+<p>Once the Hudi Client is configured with the right datasetname and 
environment for metrics, it produces the following graphite metrics, that aid 
in debugging hudi datasets</p>
 
 <ul>
   <li><strong>Commit Duration</strong> - This is amount of time it took to 
successfully commit a batch of records</li>
@@ -722,29 +726,29 @@ Compaction successfully repaired
 
 <p>These metrics can then be plotted on a standard tool like grafana. Below is 
a sample commit duration chart.</p>
 
-<figure><img class="docimage" src="images/hoodie_commit_duration.png" 
alt="hoodie_commit_duration.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_commit_duration.png" 
alt="hudi_commit_duration.png" style="max-width: 1000px" /></figure>
 
 <h2 id="troubleshooting-failures">Troubleshooting Failures</h2>
 
-<p>Section below generally aids in debugging Hoodie failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)</p>
+<p>Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)</p>
 
 <ul>
-  <li><strong>_hoodie_record_key</strong> - Treated as a primary key within 
each HDFS partition, basis of all updates/inserts</li>
+  <li><strong>_hoodie_record_key</strong> - Treated as a primary key within 
each DFS partition, basis of all updates/inserts</li>
   <li><strong>_hoodie_commit_time</strong> - Last commit that touched this 
record</li>
   <li><strong>_hoodie_file_name</strong> - Actual file name containing the 
record (super useful to triage duplicates)</li>
   <li><strong>_hoodie_partition_path</strong> - Path from basePath that 
identifies the partition containing this record</li>
 </ul>
 
-<div class="bs-callout bs-callout-warning">Note that as of now, Hoodie assumes 
the application passes in the same deterministic partitionpath for a given 
recordKey. i.e the uniqueness of record key is only enforced within each 
partition</div>
+<div class="bs-callout bs-callout-warning">Note that as of now, Hudi assumes 
the application passes in the same deterministic partitionpath for a given 
recordKey. i.e the uniqueness of record key is only enforced within each 
partition</div>
 
 <h4 id="missing-records">Missing records</h4>
 
 <p>Please check if there were any write errors using the admin commands above, 
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hoodie, but 
handed back to the application to decide what to do with it.</p>
+If you do find errors, then the record was not actually written by Hudi, but 
handed back to the application to decide what to do with it.</p>
 
 <h4 id="duplicates">Duplicates</h4>
 
-<p>First of all, please confirm if you do indeed have duplicates 
<strong>AFTER</strong> ensuring the query is accessing the Hoodie datasets <a 
href="sql_queries.html">properly</a> .</p>
+<p>First of all, please confirm if you do indeed have duplicates 
<strong>AFTER</strong> ensuring the query is accessing the Hudi datasets <a 
href="sql_queries.html">properly</a> .</p>
 
 <ul>
   <li>If confirmed, please use the metadata fields above, to identify the 
physical files &amp; partition files containing the records .</li>
@@ -754,10 +758,10 @@ If you do find errors, then the record was not actually 
written by Hoodie, but h
 
 <h4 id="spark-failures">Spark failures</h4>
 
-<p>Typical upsert() DAG looks like below. Note that Hoodie client also caches 
intermediate RDDs to intelligently profile workload and size files and spark 
parallelism.
+<p>Typical upsert() DAG looks like below. Note that Hudi client also caches 
intermediate RDDs to intelligently profile workload and size files and spark 
parallelism.
 Also Spark UI shows sortByKey twice due to the probe job also being shown, 
nonetheless its just a single sort.</p>
 
-<figure><img class="docimage" src="images/hoodie_upsert_dag.png" 
alt="hoodie_upsert_dag.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_upsert_dag.png" 
alt="hudi_upsert_dag.png" style="max-width: 1000px" /></figure>
 
 <p>At a high level, there are two steps</p>
 
@@ -777,7 +781,7 @@ Also Spark UI shows sortByKey twice due to the probe job 
also being shown, nonet
   <li>Job 7 : Actual writing of data (update + insert + insert turned to 
updates to maintain file size)</li>
 </ul>
 
-<p>Depending on the exception source (Hoodie/Spark), the above knowledge of 
the DAG can be used to pinpoint the actual issue. The most often encountered 
failures result from YARN/HDFS temporary failures.
+<p>Depending on the exception source (Hudi/Spark), the above knowledge of the 
DAG can be used to pinpoint the actual issue. The most often encountered 
failures result from YARN/DFS temporary failures.
 In the future, a more sophisticated debug/management UI would be added to the 
project, that can help automate some of this debugging.</p>
 
 
diff --git a/content/community.html b/content/community.html
index 39488eb..34196f3 100644
--- a/content/community.html
+++ b/content/community.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" usecases">
+<meta name="keywords" content="hudi, use cases, big data, apache">
 <title>Community | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -355,7 +359,7 @@
   <tbody>
     <tr>
       <td>For any general questions, user support, development discussions</td>
-      <td>Dev Mailing list (<a 
href="&#109;&#097;&#105;&#108;&#116;&#111;:&#100;&#101;&#118;&#045;&#115;&#117;&#098;&#115;&#099;&#114;&#105;&#098;&#101;&#064;&#104;&#117;&#100;&#105;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">Subscribe</a>,
 <a 
href="&#109;&#097;&#105;&#108;&#116;&#111;:&#100;&#101;&#118;&#045;&#117;&#110;&#115;&#117;&#098;&#115;&#099;&#114;&#105;&#098;&#101;&#064;&#104;&#117;&#100;&#105;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&
 [...]
+      <td>Dev Mailing list (<a 
href="&#109;&#097;&#105;&#108;&#116;&#111;:&#100;&#101;&#118;&#045;&#115;&#117;&#098;&#115;&#099;&#114;&#105;&#098;&#101;&#064;&#104;&#117;&#100;&#105;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">Subscribe</a>,
 <a 
href="&#109;&#097;&#105;&#108;&#116;&#111;:&#100;&#101;&#118;&#045;&#117;&#110;&#115;&#117;&#098;&#115;&#099;&#114;&#105;&#098;&#101;&#064;&#104;&#117;&#100;&#105;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&
 [...]
     </tr>
     <tr>
       <td>For reporting bugs or issues or discover known issues</td>
@@ -389,9 +393,10 @@ Apache Hudi follows the typical Apache vulnerability 
handling <a href="https://a
   <li>Ask (and/or) answer questions on our support channels listed above.</li>
   <li>Review code or HIPs</li>
   <li>Help improve documentation</li>
+  <li>Author blogs on our wiki</li>
   <li>Testing; Improving out-of-box experience by reporting bugs</li>
   <li>Share new ideas/directions to pursue or propose a new HIP</li>
-  <li>Contributing code to the project</li>
+  <li>Contributing code to the project (<a 
href="https://issues.apache.org/jira/issues/?jql=project+%3D+HUDI+AND+component+%3D+newbie";>newbie
 JIRAs</a>)</li>
 </ul>
 
 <h4 id="code-contributions">Code Contributions</h4>
diff --git a/content/comparison.html b/content/comparison.html
index 34082e0..59bcf75 100644
--- a/content/comparison.html
+++ b/content/comparison.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" usecases">
+<meta name="keywords" content="apache, hudi, kafka, kudu, hive, hbase, stream 
processing">
 <title>Comparison | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -341,7 +345,7 @@
 
     
 
-  <p>Apache Hudi fills a big void for processing data on top of HDFS, and thus 
mostly co-exists nicely with these technologies. However,
+  <p>Apache Hudi fills a big void for processing data on top of DFS, and thus 
mostly co-exists nicely with these technologies. However,
 it would be useful to understand how Hudi fits into the current big data 
ecosystem, contrasting it with a few related systems
 and bring out the different tradeoffs these systems have accepted in their 
design.</p>
 
@@ -380,16 +384,15 @@ just for analytics. Finally, HBase does not support 
incremental processing primi
 <p>A popular question, we get is : “How does Hudi relate to stream processing 
systems?”, which we will try to answer here. Simply put, Hudi can integrate with
 batch (<code class="highlighter-rouge">copy-on-write storage</code>) and 
streaming (<code class="highlighter-rouge">merge-on-read storage</code>) jobs 
of today, to store the computed results in Hadoop. For Spark apps, this can 
happen via direct
 integration of Hudi library with Spark/Spark streaming DAGs. In case of 
Non-Spark processing systems (eg: Flink, Hive), the processing can be done in 
the respective systems
-and later sent into a Hudi table via a Kafka topic/HDFS intermediate file. In 
more conceptual level, data processing
+and later sent into a Hudi table via a Kafka topic/DFS intermediate file. In 
more conceptual level, data processing
 pipelines just consist of three components : <code 
class="highlighter-rouge">source</code>, <code 
class="highlighter-rouge">processing</code>, <code 
class="highlighter-rouge">sink</code>, with users ultimately running queries 
against the sink to use the results of the pipeline.
-Hudi can act as either a source or sink, that stores data on HDFS. 
Applicability of Hudi to a given stream processing pipeline ultimately boils 
down to suitability
+Hudi can act as either a source or sink, that stores data on DFS. 
Applicability of Hudi to a given stream processing pipeline ultimately boils 
down to suitability
 of Presto/SparkSQL/Hive for your queries.</p>
 
 <p>More advanced use cases revolve around the concepts of <a 
href="https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop";>incremental
 processing</a>, which effectively
 uses Hudi even inside the <code class="highlighter-rouge">processing</code> 
engine to speed up typical batch pipelines. For e.g: Hudi can be used as a 
state store inside a processing DAG (similar
 to how <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend";>rocksDB</a>
 is used by Flink). This is an item on the roadmap
-and will eventually happen as a <a 
href="https://github.com/uber/hoodie/issues/8";>Beam Runner</a></p>
-
+and will eventually happen as a <a 
href="https://issues.apache.org/jira/browse/HUDI-60";>Beam Runner</a></p>
 
 
     <div class="tags">
diff --git a/content/concepts.html b/content/concepts.html
index 7e85d32..22754c4 100644
--- a/content/concepts.html
+++ b/content/concepts.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="Here we introduce some basic concepts & give 
a broad technical overview of Hudi">
-<meta name="keywords" content=" concepts">
+<meta name="keywords" content="hudi, design, storage, views, timeline">
 <title>Concepts | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -343,7 +347,7 @@
 
     
 
-  <p>Apache Hudi (pronounced “Hudi”) provides the following primitives over 
datasets on HDFS</p>
+  <p>Apache Hudi (pronounced “Hudi”) provides the following primitives over 
datasets on DFS</p>
 
 <ul>
   <li>Upsert                     (how do I change the dataset?)</li>
diff --git a/content/configurations.html b/content/configurations.html
index 5f1adb8..73f66c9 100644
--- a/content/configurations.html
+++ b/content/configurations.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="Here we list all possible configurations and 
what they mean">
-<meta name="keywords" content=" configurations">
+<meta name="keywords" content="garbage collection, hudi, jvm, configs, tuning">
 <title>Configurations | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -343,174 +347,360 @@
 
     
 
-  <h3 id="configuration">Configuration</h3>
+  <p>This page covers the different ways of configuring your job to write/read 
Hudi datasets. 
+At a high level, you can control behaviour at few levels.</p>
+
+<ul>
+  <li><strong><a href="#spark-datasource">Spark Datasource 
Configs</a></strong> : These configs control the Hudi Spark Datasource, 
providing ability to define keys/partitioning, pick out the write operation, 
specify how to merge records or choosing view type to read.</li>
+  <li><strong><a href="#writeclient-configs">WriteClient Configs</a></strong> 
: Internally, the Hudi datasource uses a RDD based <code 
class="highlighter-rouge">HoodieWriteClient</code> api to actually perform 
writes to storage. These configs provide deep control over lower level aspects 
like 
+ file sizing, compression, parallelism, compaction, write schema, cleaning 
etc. Although Hudi provides sane defaults, from time-time these configs may 
need to be tweaked to optimize for specific workloads.</li>
+  <li><strong><a href="#PAYLOAD_CLASS_OPT_KEY">RecordPayload 
Config</a></strong> : This is the lowest level of customization offered by 
Hudi. Record payloads define how to produce new values to upsert based on 
incoming new record and 
+ stored old record. Hudi provides default implementations such as <code 
class="highlighter-rouge">OverwriteWithLatestAvroPayload</code> which simply 
update storage with the latest/last-written record. 
+ This can be overridden to a custom class extending <code 
class="highlighter-rouge">HoodieRecordPayload</code> class, on both datasource 
and WriteClient levels.</li>
+</ul>
+
+<h3 id="talking-to-cloud-storage">Talking to Cloud Storage</h3>
+
+<p>Immaterial of whether RDD/WriteClient APIs or Datasource is used, the 
following information helps configure access
+to cloud stores.</p>
+
+<ul>
+  <li><a href="s3_hoodie.html">AWS S3</a> <br />
+Configurations required for S3 and Hudi co-operability.</li>
+  <li><a href="gcs_hoodie.html">Google Cloud Storage</a> <br />
+Configurations required for GCS and Hudi co-operability.</li>
+</ul>
+
+<h3 id="spark-datasource">Spark Datasource Configs</h3>
+
+<p>Spark jobs using the datasource can be configured by passing the below 
options into the <code class="highlighter-rouge">option(k,v)</code> method as 
usual.
+The actual datasource level configs are listed below.</p>
+
+<h4 id="write-options">Write Options</h4>
+
+<p>Additionally, you can pass down any of the WriteClient level configs 
directly using <code class="highlighter-rouge">options()</code> or <code 
class="highlighter-rouge">option(k,v)</code> methods.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>inputDF.write()
+.format("com.uber.hoodie")
+.options(clientOpts) // any of the Hudi client opts can be passed in as well
+.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
+.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
+.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
+.option(HoodieWriteConfig.TABLE_NAME, tableName)
+.mode(SaveMode.Append)
+.save(basePath);
+</code></pre>
+</div>
+
+<p>Options useful for writing datasets via <code 
class="highlighter-rouge">write.format.option(...)</code></p>
+
+<ul>
+  <li><a href="#TABLE_NAME_OPT_KEY">TABLE_NAME_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.write.table.name</code> 
[Required]<br />
+<span style="color:grey">Hive table name, to register the dataset 
into.</span></li>
+  <li><a href="#OPERATION_OPT_KEY">OPERATION_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.write.operation</code>, Default: 
<code class="highlighter-rouge">upsert</code><br />
+<span style="color:grey">whether to do upsert, insert or bulkinsert for the 
write operation. Use <code class="highlighter-rouge">bulkinsert</code> to load 
new data into a table, and there on use <code 
class="highlighter-rouge">upsert</code>/<code 
class="highlighter-rouge">insert</code>. 
+bulk insert uses a disk based write path to scale to load large inputs without 
need to cache it.</span></li>
+  <li><a href="#STORAGE_TYPE_OPT_KEY">STORAGE_TYPE_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.write.storage.type</code>, Default: 
<code class="highlighter-rouge">COPY_ON_WRITE</code> <br />
+<span style="color:grey">The storage type for the underlying data, for this 
write. This can’t change between writes.</span></li>
+  <li><a href="#PRECOMBINE_FIELD_OPT_KEY">PRECOMBINE_FIELD_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.write.precombine.field</code>, 
Default: <code class="highlighter-rouge">ts</code> <br />
+<span style="color:grey">Field used in preCombining before actual write. When 
two records have the same key value,
+we will pick the one with the largest value for the precombine field, 
determined by Object.compareTo(..)</span></li>
+  <li><a href="#PAYLOAD_CLASS_OPT_KEY">PAYLOAD_CLASS_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.write.payload.class</code>, 
Default: <code 
class="highlighter-rouge">com.uber.hoodie.OverwriteWithLatestAvroPayload</code> 
<br />
+<span style="color:grey">Payload class used. Override this, if you like to 
roll your own merge logic, when upserting/inserting. 
+This will render any value set for <code 
class="highlighter-rouge">PRECOMBINE_FIELD_OPT_VAL</code> 
in-effective</span></li>
+  <li><a href="#RECORDKEY_FIELD_OPT_KEY">RECORDKEY_FIELD_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.write.recordkey.field</code>, 
Default: <code class="highlighter-rouge">uuid</code> <br />
+<span style="color:grey">Record key field. Value to be used as the <code 
class="highlighter-rouge">recordKey</code> component of <code 
class="highlighter-rouge">HoodieKey</code>. Actual value
+will be obtained by invoking .toString() on the field value. Nested fields can 
be specified using
+the dot notation eg: <code class="highlighter-rouge">a.b.c</code></span></li>
+  <li><a 
href="#PARTITIONPATH_FIELD_OPT_KEY">PARTITIONPATH_FIELD_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.write.partitionpath.field</code>, 
Default: <code class="highlighter-rouge">partitionpath</code> <br />
+<span style="color:grey">Partition path field. Value to be used at the <code 
class="highlighter-rouge">partitionPath</code> component of <code 
class="highlighter-rouge">HoodieKey</code>.
+Actual value ontained by invoking .toString()</span></li>
+  <li><a href="#KEYGENERATOR_CLASS_OPT_KEY">KEYGENERATOR_CLASS_OPT_KEY</a><br 
/>
+Property: <code 
class="highlighter-rouge">hoodie.datasource.write.keygenerator.class</code>, 
Default: <code 
class="highlighter-rouge">com.uber.hoodie.SimpleKeyGenerator</code> <br />
+<span style="color:grey">Key generator class, that implements will extract the 
key out of incoming <code class="highlighter-rouge">Row</code> 
object</span></li>
+  <li><a 
href="#COMMIT_METADATA_KEYPREFIX_OPT_KEY">COMMIT_METADATA_KEYPREFIX_OPT_KEY</a><br
 />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.write.commitmeta.key.prefix</code>, 
Default: <code class="highlighter-rouge">_</code> <br />
+<span style="color:grey">Option keys beginning with this prefix, are 
automatically added to the commit/deltacommit metadata.
+This is useful to store checkpointing information, in a consistent way with 
the hudi timeline</span></li>
+  <li><a href="#INSERT_DROP_DUPS_OPT_KEY">INSERT_DROP_DUPS_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.write.insert.drop.duplicates</code>,
 Default: <code class="highlighter-rouge">false</code> <br />
+<span style="color:grey">If set to true, filters out all duplicate records 
from incoming dataframe, during insert operations. </span></li>
+  <li><a href="#HIVE_SYNC_ENABLED_OPT_KEY">HIVE_SYNC_ENABLED_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.hive_sync.enable</code>, Default: 
<code class="highlighter-rouge">false</code> <br />
+<span style="color:grey">When set to true, register/sync the dataset to Apache 
Hive metastore</span></li>
+  <li><a href="#HIVE_DATABASE_OPT_KEY">HIVE_DATABASE_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.hive_sync.database</code>, Default: 
<code class="highlighter-rouge">default</code> <br />
+<span style="color:grey">database to sync to</span></li>
+  <li><a href="#HIVE_TABLE_OPT_KEY">HIVE_TABLE_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.hive_sync.table</code>, [Required] 
<br />
+<span style="color:grey">table to sync to</span></li>
+  <li><a href="#HIVE_USER_OPT_KEY">HIVE_USER_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.hive_sync.username</code>, Default: 
<code class="highlighter-rouge">hive</code> <br />
+<span style="color:grey">hive user name to use</span></li>
+  <li><a href="#HIVE_PASS_OPT_KEY">HIVE_PASS_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.hive_sync.password</code>, Default: 
<code class="highlighter-rouge">hive</code> <br />
+<span style="color:grey">hive password to use</span></li>
+  <li><a href="#HIVE_URL_OPT_KEY">HIVE_URL_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.hive_sync.jdbcurl</code>, Default: 
<code class="highlighter-rouge">jdbc:hive2://localhost:10000</code> <br />
+<span style="color:grey">Hive metastore url</span></li>
+  <li><a 
href="#HIVE_PARTITION_FIELDS_OPT_KEY">HIVE_PARTITION_FIELDS_OPT_KEY</a><br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.hive_sync.partition_fields</code>, 
Default: ` ` <br />
+<span style="color:grey">field in the dataset to use for determining hive 
partition columns.</span></li>
+  <li><a 
href="#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY">HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY</a><br
 />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.hive_sync.partition_extractor_class</code>,
 Default: <code 
class="highlighter-rouge">com.uber.hoodie.hive.SlashEncodedDayPartitionValueExtractor</code>
 <br />
+<span style="color:grey">Class used to extract partition field values into 
hive partition columns.</span></li>
+  <li><a 
href="#HIVE_ASSUME_DATE_PARTITION_OPT_KEY">HIVE_ASSUME_DATE_PARTITION_OPT_KEY</a><br
 />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.hive_sync.assume_date_partitioning</code>,
 Default: <code class="highlighter-rouge">false</code> <br />
+<span style="color:grey">Assume partitioning is yyyy/mm/dd</span></li>
+</ul>
+
+<h4 id="read-options">Read Options</h4>
+
+<p>Options useful for reading datasets via <code 
class="highlighter-rouge">read.format.option(...)</code></p>
 
 <ul>
-  <li><a href="#HoodieWriteConfig">HoodieWriteConfig</a> <br />
-<span style="color:grey">Top Level Config which is passed in when 
HoodieWriteClent is created.</span>
+  <li><a href="#VIEW_TYPE_OPT_KEY">VIEW_TYPE_OPT_KEY</a> <br />
+Property: <code class="highlighter-rouge">hoodie.datasource.view.type</code>, 
Default: <code class="highlighter-rouge">read_optimized</code> <br />
+<span style="color:grey">Whether data needs to be read, in incremental mode 
(new data since an instantTime)
+(or) Read Optimized mode (obtain latest view, based on columnar data)
+(or) Real time mode (obtain latest view, based on row &amp; columnar 
data)</span></li>
+  <li><a href="#BEGIN_INSTANTTIME_OPT_KEY">BEGIN_INSTANTTIME_OPT_KEY</a> <br 
/> 
+Property: <code 
class="highlighter-rouge">hoodie.datasource.read.begin.instanttime</code>, 
[Required in incremental mode] <br />
+<span style="color:grey">Instant time to start incrementally pulling data 
from. The instanttime here need not
+necessarily correspond to an instant on the timeline. New data written with an
+ <code class="highlighter-rouge">instant_time &gt; BEGIN_INSTANTTIME</code> 
are fetched out. For e.g: ‘20170901080000’ will get
+ all new data written after Sep 1, 2017 08:00AM.</span></li>
+  <li><a href="#END_INSTANTTIME_OPT_KEY">END_INSTANTTIME_OPT_KEY</a> <br />
+Property: <code 
class="highlighter-rouge">hoodie.datasource.read.end.instanttime</code>, 
Default: latest instant (i.e fetches all new data since begin instant time) <br 
/>
+<span style="color:grey"> Instant time to limit incrementally fetched data to. 
New data written with an
+<code class="highlighter-rouge">instant_time &lt;= END_INSTANTTIME</code> are 
fetched out.</span></li>
+</ul>
+
+<h3 id="writeclient-configs">WriteClient Configs</h3>
+
+<p>Jobs programming directly against the RDD level apis can build a <code 
class="highlighter-rouge">HoodieWriteConfig</code> object and pass it in to the 
<code class="highlighter-rouge">HoodieWriteClient</code> constructor. 
+HoodieWriteConfig can be built using a builder pattern as below.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>HoodieWriteConfig 
cfg = HoodieWriteConfig.newBuilder()
+        .withPath(basePath)
+        .forTable(tableName)
+        .withSchema(schemaStr)
+        .withProps(props) // pass raw k,v pairs from a property file.
+        
.withCompactionConfig(HoodieCompactionConfig.newBuilder().withXXX(...).build())
+        .withIndexConfig(HoodieIndexConfig.newBuilder().withXXX(...).build())
+        ...
+        .build();
+</code></pre>
+</div>
+
+<p>Following subsections go over different aspects of write configs, 
explaining most important configs with their property names, default values.</p>
+
+<ul>
+  <li><a href="#withPath">withPath</a> (hoodie_base_path) 
+Property: <code class="highlighter-rouge">hoodie.base.path</code> [Required] 
<br />
+<span style="color:grey">Base DFS path under which all the data partitions are 
created. Always prefix it explicitly with the storage scheme (e.g hdfs://, 
s3:// etc). Hudi stores all the main meta-data about commits, savepoints, 
cleaning audit logs etc in .hoodie directory under the base directory. 
</span></li>
+  <li><a href="#withSchema">withSchema</a> (schema_str) <br /> 
+Property: <code class="highlighter-rouge">hoodie.avro.schema</code> 
[Required]<br />
+<span style="color:grey">This is the current reader avro schema for the 
dataset. This is a string of the entire schema. HoodieWriteClient uses this 
schema to pass on to implementations of HoodieRecordPayload to convert from the 
source format to avro record. This is also used when re-writing records during 
an update. </span></li>
+  <li><a href="#forTable">forTable</a> (table_name)<br /> 
+Property: <code class="highlighter-rouge">hoodie.table.name</code> [Required] 
<br />
+ <span style="color:grey">Table name for the dataset, will be used for 
registering with Hive. Needs to be same across runs.</span></li>
+  <li><a href="#withBulkInsertParallelism">withBulkInsertParallelism</a> 
(bulk_insert_parallelism = 1500) <br /> 
+Property: <code 
class="highlighter-rouge">hoodie.bulkinsert.shuffle.parallelism</code><br />
+<span style="color:grey">Bulk insert is meant to be used for large initial 
imports and this parallelism determines the initial number of files in your 
dataset. Tune this to achieve a desired optimal size during initial 
import.</span></li>
+  <li><a href="#withParallelism">withParallelism</a> 
(insert_shuffle_parallelism = 1500, upsert_shuffle_parallelism = 1500)<br /> 
+Property: <code 
class="highlighter-rouge">hoodie.insert.shuffle.parallelism</code>, <code 
class="highlighter-rouge">hoodie.upsert.shuffle.parallelism</code><br />
+<span style="color:grey">Once data has been initially imported, this 
parallelism controls initial parallelism for reading input records. Ensure this 
value is high enough say: 1 partition for 1 GB of input data</span></li>
+  <li><a href="#combineInput">combineInput</a> (on_insert = false, 
on_update=true)<br /> 
+Property: <code class="highlighter-rouge">hoodie.combine.before.insert</code>, 
<code class="highlighter-rouge">hoodie.combine.before.upsert</code><br />
+<span style="color:grey">Flag which first combines the input RDD and merges 
multiple partial records into a single record before inserting or updating in 
DFS</span></li>
+  <li><a href="#withWriteStatusStorageLevel">withWriteStatusStorageLevel</a> 
(level = MEMORY_AND_DISK_SER)<br /> 
+Property: <code 
class="highlighter-rouge">hoodie.write.status.storage.level</code><br />
+<span style="color:grey">HoodieWriteClient.insert and HoodieWriteClient.upsert 
returns a persisted RDD[WriteStatus], this is because the Client can choose to 
inspect the WriteStatus and choose and commit or not based on the failures. 
This is a configuration for the storage level for this RDD </span></li>
+  <li><a href="#withAutoCommit">withAutoCommit</a> (autoCommit = true)<br /> 
+Property: <code class="highlighter-rouge">hoodie.auto.commit</code><br />
+<span style="color:grey">Should HoodieWriteClient autoCommit after insert and 
upsert. The client can choose to turn off auto-commit and commit on a “defined 
success condition”</span></li>
+  <li><a href="#withAssumeDatePartitioning">withAssumeDatePartitioning</a> 
(assumeDatePartitioning = false)<br /> 
+Property: ` hoodie.assume.date.partitioning`<br />
+<span style="color:grey">Should HoodieWriteClient assume the data is 
partitioned by dates, i.e three levels from base path. This is a stop-gap to 
support tables created by versions &lt; 0.3.1. Will be removed eventually 
</span></li>
+  <li><a href="#withConsistencyCheckEnabled">withConsistencyCheckEnabled</a> 
(enabled = false)<br /> 
+Property: <code 
class="highlighter-rouge">hoodie.consistency.check.enabled</code><br />
+<span style="color:grey">Should HoodieWriteClient perform additional checks to 
ensure written files’ are listable on the underlying filesystem/storage. Set 
this to true, to workaround S3’s eventual consistency model and ensure all data 
written as a part of a commit is faithfully available for queries. </span></li>
+</ul>
+
+<h4 id="index-configs">Index configs</h4>
+<p>Following configs control indexing behavior, which tags incoming records as 
either inserts or updates to older records.</p>
+
+<ul>
+  <li><a href="#withIndexConfig">withIndexConfig</a> (HoodieIndexConfig) <br />
+  <span style="color:grey">This is pluggable to have a external index (HBase) 
or use the default bloom filter stored in the Parquet files</span>
     <ul>
-      <li><a href="#withPath">withPath</a> (hoodie_base_path) <br />
-  <span style="color:grey">Base HDFS path under which all the data partitions 
are created. Hoodie stores all the main meta-data about commits, savepoints, 
cleaning audit logs etc in .hoodie directory under the base directory. 
</span></li>
-      <li><a href="#withSchema">withSchema</a> (schema_str) <br />
-  <span style="color:grey">This is the current reader avro schema for the 
Hoodie Dataset. This is a string of the entire schema. HoodieWriteClient uses 
this schema to pass on to implementations of HoodieRecordPayload to convert 
from the source format to avro record. This is also used when re-writing 
records during an update. </span></li>
-      <li><a href="#withParallelism">withParallelism</a> 
(insert_shuffle_parallelism = 200, upsert_shuffle_parallelism = 200) <br />
-  <span style="color:grey">Insert DAG uses the insert_parallelism in every 
shuffle. Upsert DAG uses the upsert_parallelism in every shuffle. Typical 
workload is profiled and once a min parallelism is established, trade off 
between latency and cluster usage optimizations this is tuned and have a 
conservatively high number to optimize for latency and  </span></li>
-      <li><a href="#combineInput">combineInput</a> (on_insert = false, 
on_update=true) <br />
-  <span style="color:grey">Flag which first combines the input RDD and merges 
multiple partial records into a single record before inserting or updating in 
HDFS</span></li>
-      <li><a 
href="#withWriteStatusStorageLevel">withWriteStatusStorageLevel</a> (level = 
MEMORY_AND_DISK_SER) <br />
-  <span style="color:grey">HoodieWriteClient.insert and 
HoodieWriteClient.upsert returns a persisted RDD[WriteStatus], this is because 
the Client can choose to inspect the WriteStatus and choose and commit or not 
based on the failures. This is a configuration for the storage level for this 
RDD </span></li>
-      <li><a href="#withAutoCommit">withAutoCommit</a> (autoCommit = true) <br 
/>
-  <span style="color:grey">Should HoodieWriteClient autoCommit after insert 
and upsert. The client can choose to turn off auto-commit and commit on a 
“defined success condition”</span></li>
-      <li><a href="#withAssumeDatePartitioning">withAssumeDatePartitioning</a> 
(assumeDatePartitioning = false) <br />
-  <span style="color:grey">Should HoodieWriteClient assume the data is 
partitioned by dates, i.e three levels from base path. This is a stop-gap to 
support tables created by versions &lt; 0.3.1. Will be removed eventually 
</span></li>
-      <li>
-        <p><a 
href="#withConsistencyCheckEnabled">withConsistencyCheckEnabled</a> (enabled = 
false) <br />
-  <span style="color:grey">Should HoodieWriteClient perform additional checks 
to ensure written files’ are listable on the underlying filesystem/storage. Set 
this to true, to workaround S3’s eventual consistency model and ensure all data 
written as a part of a commit is faithfully available for queries. </span></p>
-      </li>
-      <li><a href="#withIndexConfig">withIndexConfig</a> (HoodieIndexConfig) 
<br />
-  <span style="color:grey">Hoodie uses a index to help find the FileID which 
contains an incoming record key. This is pluggable to have a external index 
(HBase) or use the default bloom filter stored in the Parquet files</span>
-        <ul>
-          <li><a href="#withIndexType">withIndexType</a> (indexType = BLOOM) 
<br />
+      <li><a href="#withIndexType">withIndexType</a> (indexType = BLOOM) <br />
+  Property: <code class="highlighter-rouge">hoodie.index.type</code> <br />
   <span style="color:grey">Type of index to use. Default is Bloom filter. 
Possible options are [BLOOM | HBASE | INMEMORY]. Bloom filters removes the 
dependency on a external system and is stored in the footer of the Parquet Data 
Files</span></li>
-          <li><a href="#bloomFilterNumEntries">bloomFilterNumEntries</a> 
(60000) <br />
-  <span style="color:grey">Only applies if index type is BLOOM. <br />This is 
the number of entries to be stored in the bloom filter. We assume the 
maxParquetFileSize is 128MB and averageRecordSize is 1024B and hence we approx 
a total of 130K records in a file. The default (60000) is roughly half of this 
approximation. <a href="https://github.com/uber/hoodie/issues/70";>#70</a> 
tracks computing this dynamically. Warning: Setting this very low, will 
generate a lot of false positives and in [...]
-          <li><a href="#bloomFilterFPP">bloomFilterFPP</a> (0.000000001) <br />
+      <li><a href="#bloomFilterNumEntries">bloomFilterNumEntries</a> 
(numEntries = 60000) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.index.bloom.num_entries</code> <br />
+  <span style="color:grey">Only applies if index type is BLOOM. <br />This is 
the number of entries to be stored in the bloom filter. We assume the 
maxParquetFileSize is 128MB and averageRecordSize is 1024B and hence we approx 
a total of 130K records in a file. The default (60000) is roughly half of this 
approximation. <a 
href="https://issues.apache.org/jira/browse/HUDI-56";>HUDI-56</a> tracks 
computing this dynamically. Warning: Setting this very low, will generate a lot 
of false positiv [...]
+      <li><a href="#bloomFilterFPP">bloomFilterFPP</a> (fpp = 0.000000001) <br 
/>
+  Property: <code class="highlighter-rouge">hoodie.index.bloom.fpp</code> <br 
/>
   <span style="color:grey">Only applies if index type is BLOOM. <br /> Error 
rate allowed given the number of entries. This is used to calculate how many 
bits should be assigned for the bloom filter and the number of hash functions. 
This is usually set very low (default: 0.000000001), we like to tradeoff disk 
space for lower false positives</span></li>
-          <li><a href="#bloomIndexPruneByRanges">bloomIndexPruneByRanges</a> 
(true) <br />
+      <li><a href="#bloomIndexPruneByRanges">bloomIndexPruneByRanges</a> 
(pruneRanges = true) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.bloom.index.prune.by.ranges</code> <br />
   <span style="color:grey">Only applies if index type is BLOOM. <br /> When 
true, range information from files to leveraged speed up index lookups. 
Particularly helpful, if the key has a monotonously increasing prefix, such as 
timestamp.</span></li>
-          <li><a href="#bloomIndexUseCaching">bloomIndexUseCaching</a> (true) 
<br />
+      <li><a href="#bloomIndexUseCaching">bloomIndexUseCaching</a> (useCaching 
= true) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.bloom.index.use.caching</code> <br />
   <span style="color:grey">Only applies if index type is BLOOM. <br /> When 
true, the input RDD will cached to speed up index lookup by reducing IO for 
computing parallelism or affected partitions</span></li>
-          <li><a href="#bloomIndexParallelism">bloomIndexParallelism</a> (0) 
<br />
+      <li><a href="#bloomIndexParallelism">bloomIndexParallelism</a> (0) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.bloom.index.parallelism</code> <br />
   <span style="color:grey">Only applies if index type is BLOOM. <br /> This is 
the amount of parallelism for index lookup, which involves a Spark Shuffle. By 
default, this is auto computed based on input workload 
characteristics</span></li>
-          <li><a href="#hbaseZkQuorum">hbaseZkQuorum</a> (zkString) <br />
+      <li><a href="#hbaseZkQuorum">hbaseZkQuorum</a> (zkString) [Required]<br 
/>
+  Property: <code class="highlighter-rouge">hoodie.index.hbase.zkquorum</code> 
<br />
   <span style="color:grey">Only application if index type is HBASE. HBase ZK 
Quorum url to connect to.</span></li>
-          <li><a href="#hbaseZkPort">hbaseZkPort</a> (port) <br />
+      <li><a href="#hbaseZkPort">hbaseZkPort</a> (port) [Required]<br />
+  Property: <code class="highlighter-rouge">hoodie.index.hbase.zkport</code> 
<br />
   <span style="color:grey">Only application if index type is HBASE. HBase ZK 
Quorum port to connect to.</span></li>
-          <li><a href="#hbaseTableName">hbaseTableName</a> (tableName) <br />
-  <span style="color:grey">Only application if index type is HBASE. HBase 
Table name to use as the index. Hoodie stores the row_key and [partition_path, 
fileID, commitTime] mapping in the table.</span></li>
-        </ul>
-      </li>
-      <li><a href="#withStorageConfig">withStorageConfig</a> 
(HoodieStorageConfig) <br />
-  <span style="color:grey">Storage related configs</span>
-        <ul>
-          <li><a href="#limitFileSize">limitFileSize</a> (size = 120MB) <br />
-  <span style="color:grey">Hoodie re-writes a single file during update 
(copy_on_write) or a compaction (merge_on_read). This is fundamental unit of 
parallelism. It is important that this is aligned with the underlying 
filesystem block size. </span></li>
-          <li><a href="#parquetBlockSize">parquetBlockSize</a> (rowgroupsize = 
120MB) <br />
-  <span style="color:grey">Parquet RowGroup size. Its better than this is 
aligned with the file size, so that a single column within a file is stored 
continuously on disk</span></li>
-          <li><a href="#parquetPageSize">parquetPageSize</a> (pagesize = 1MB) 
<br />
+      <li><a href="#hbaseTableName">hbaseTableName</a> (tableName) 
[Required]<br />
+  Property: <code class="highlighter-rouge">hoodie.index.hbase.table</code> 
<br />
+  <span style="color:grey">Only application if index type is HBASE. HBase 
Table name to use as the index. Hudi stores the row_key and [partition_path, 
fileID, commitTime] mapping in the table.</span></li>
+    </ul>
+  </li>
+</ul>
+
+<h4 id="storage-configs">Storage configs</h4>
+<p>Controls aspects around sizing parquet and log files.</p>
+
+<ul>
+  <li><a href="#withStorageConfig">withStorageConfig</a> (HoodieStorageConfig) 
<br />
+    <ul>
+      <li><a href="#limitFileSize">limitFileSize</a> (size = 120MB) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.parquet.max.file.size</code> <br />
+  <span style="color:grey">Target size for parquet files produced by Hudi 
write phases. For DFS, this needs to be aligned with the underlying filesystem 
block size for optimal performance. </span></li>
+      <li><a href="#parquetBlockSize">parquetBlockSize</a> (rowgroupsize = 
120MB) <br />
+  Property: <code class="highlighter-rouge">hoodie.parquet.block.size</code> 
<br />
+  <span style="color:grey">Parquet RowGroup size. Its better this is same as 
the file size, so that a single column within a file is stored continuously on 
disk</span></li>
+      <li><a href="#parquetPageSize">parquetPageSize</a> (pagesize = 1MB) <br 
/>
+  Property: <code class="highlighter-rouge">hoodie.parquet.page.size</code> 
<br />
   <span style="color:grey">Parquet page size. Page is the unit of read within 
a parquet file. Within a block, pages are compressed seperately. </span></li>
-          <li><a href="#logFileMaxSize">logFileMaxSize</a> (logFileSize = 1GB) 
<br />
+      <li><a href="#parquetCompressionRatio">parquetCompressionRatio</a> 
(parquetCompressionRatio = 0.1) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.parquet.compression.ratio</code> <br />
+  <span style="color:grey">Expected compression of parquet data used by Hudi, 
when it tries to size new parquet files. Increase this value, if bulk_insert is 
producing smaller than expected sized files</span></li>
+      <li><a href="#logFileMaxSize">logFileMaxSize</a> (logFileSize = 1GB) <br 
/>
+  Property: <code class="highlighter-rouge">hoodie.logfile.max.size</code> <br 
/>
   <span style="color:grey">LogFile max size. This is the maximum size allowed 
for a log file before it is rolled over to the next version. </span></li>
-          <li><a href="#logFileDataBlockMaxSize">logFileDataBlockMaxSize</a> 
(dataBlockSize = 256MB) <br />
+      <li><a href="#logFileDataBlockMaxSize">logFileDataBlockMaxSize</a> 
(dataBlockSize = 256MB) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.logfile.data.block.max.size</code> <br />
   <span style="color:grey">LogFile Data block max size. This is the maximum 
size allowed for a single data block to be appended to a log file. This helps 
to make sure the data appended to the log file is broken up into sizable blocks 
to prevent from OOM errors. This size should be greater than the JVM memory. 
</span></li>
-        </ul>
-      </li>
-      <li><a href="#withCompactionConfig">withCompactionConfig</a> 
(HoodieCompactionConfig) <br />
-  <span style="color:grey">Cleaning and configurations related to compaction 
techniques</span>
-        <ul>
-          <li><a href="#withCleanerPolicy">withCleanerPolicy</a> (policy = 
KEEP_LATEST_COMMITS) <br />
-  <span style="color:grey">Hoodie Cleaning policy. Hoodie will delete older 
versions of parquet files to re-claim space. Any Query/Computation referring to 
this version of the file will fail. It is good to make sure that the data is 
retained for more than the maximum query execution time.</span></li>
-          <li><a href="#retainCommits">retainCommits</a> 
(no_of_commits_to_retain = 24) <br />
+      <li><a 
href="#logFileToParquetCompressionRatio">logFileToParquetCompressionRatio</a> 
(logFileToParquetCompressionRatio = 0.35) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.logfile.to.parquet.compression.ratio</code> 
<br />
+  <span style="color:grey">Expected additional compression as records move 
from log files to parquet. Used for merge_on_read storage to send inserts into 
log files &amp; control the size of compacted parquet file.</span></li>
+    </ul>
+  </li>
+</ul>
+
+<h4 id="compaction-configs">Compaction configs</h4>
+<p>Configs that control compaction (merging of log files onto a new parquet 
base file), cleaning (reclamation of older/unused file groups).</p>
+
+<ul>
+  <li><a href="#withCompactionConfig">withCompactionConfig</a> 
(HoodieCompactionConfig) <br />
+    <ul>
+      <li><a href="#withCleanerPolicy">withCleanerPolicy</a> (policy = 
KEEP_LATEST_COMMITS) <br />
+  Property: <code class="highlighter-rouge">hoodie.cleaner.policy</code> <br />
+  <span style="color:grey"> Cleaning policy to be used. Hudi will delete older 
versions of parquet files to re-claim space. Any Query/Computation referring to 
this version of the file will fail. It is good to make sure that the data is 
retained for more than the maximum query execution time.</span></li>
+      <li><a href="#retainCommits">retainCommits</a> (no_of_commits_to_retain 
= 24) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.cleaner.commits.retained</code> <br />
   <span style="color:grey">Number of commits to retain. So data will be 
retained for num_of_commits * time_between_commits (scheduled). This also 
directly translates into how much you can incrementally pull on this 
dataset</span></li>
-          <li><a href="#archiveCommitsWith">archiveCommitsWith</a> (minCommits 
= 96, maxCommits = 128) <br />
-  <span style="color:grey">Each commit is a small file in the .hoodie 
directory. Since HDFS is not designed to handle multiple small files, hoodie 
archives older commits into a sequential log. A commit is published atomically 
by a rename of the commit file.</span></li>
-          <li><a href="#compactionSmallFileSize">compactionSmallFileSize</a> 
(size = 0) <br />
-  <span style="color:grey">Small files can always happen because of the number 
of insert records in a paritition in a batch. Hoodie has an option to 
auto-resolve small files by masking inserts into this partition as updates to 
existing small files. The size here is the minimum file size considered as a 
“small file size”. This should be less &lt; maxFileSize and setting it to 0, 
turns off this feature. </span></li>
-          <li><a href="#insertSplitSize">insertSplitSize</a> (size = 500000) 
<br />
+      <li><a href="#archiveCommitsWith">archiveCommitsWith</a> (minCommits = 
96, maxCommits = 128) <br />
+  Property: <code class="highlighter-rouge">hoodie.keep.min.commits</code>, 
<code class="highlighter-rouge">hoodie.keep.max.commits</code> <br />
+  <span style="color:grey">Each commit is a small file in the <code 
class="highlighter-rouge">.hoodie</code> directory. Since DFS typically does 
not favor lots of small files, Hudi archives older commits into a sequential 
log. A commit is published atomically by a rename of the commit 
file.</span></li>
+      <li><a href="#compactionSmallFileSize">compactionSmallFileSize</a> (size 
= 0) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.parquet.small.file.limit</code> <br />
+  <span style="color:grey">This should be less &lt; maxFileSize and setting it 
to 0, turns off this feature. Small files can always happen because of the 
number of insert records in a partition in a batch. Hudi has an option to 
auto-resolve small files by masking inserts into this partition as updates to 
existing small files. The size here is the minimum file size considered as a 
“small file size”.</span></li>
+      <li><a href="#insertSplitSize">insertSplitSize</a> (size = 500000) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.copyonwrite.insert.split.size</code> <br />
   <span style="color:grey">Insert Write Parallelism. Number of inserts grouped 
for a single partition. Writing out 100MB files, with atleast 1kb records, 
means 100K records per file. Default is to overprovision to 500K. To improve 
insert latency, tune this to match the number of records in a single file. 
Setting this to a low number, will result in small files (particularly when 
compactionSmallFileSize is 0)</span></li>
-          <li><a href="#autoTuneInsertSplits">autoTuneInsertSplits</a> (true) 
<br />
-  <span style="color:grey">Should hoodie dynamically compute the 
insertSplitSize based on the last 24 commit’s metadata. Turned off by default. 
</span></li>
-          <li><a href="#approxRecordSize">approxRecordSize</a> () <br />
-  <span style="color:grey">The average record size. If specified, hoodie will 
use this and not compute dynamically based on the last 24 commit’s metadata. No 
value set as default. This is critical in computing the insert parallelism and 
bin-packing inserts into small files. See above.</span></li>
-          <li><a 
href="#withCompactionLazyBlockReadEnabled">withCompactionLazyBlockReadEnabled</a>
 (true) <br />
-  <span style="color:grey">When a CompactedLogScanner merges all log files, 
this config helps to choose whether the logblocks should be read lazily or not. 
Choose true to use I/O intensive lazy block reading (low memory usage) or false 
for Memory intensive immediate block read (high memory usage)</span></li>
-          <li><a 
href="#withMaxNumDeltaCommitsBeforeCompaction">withMaxNumDeltaCommitsBeforeCompaction</a>
 (maxNumDeltaCommitsBeforeCompaction = 10) <br />
+      <li><a href="#autoTuneInsertSplits">autoTuneInsertSplits</a> (true) <br 
/>
+  Property: <code 
class="highlighter-rouge">hoodie.copyonwrite.insert.auto.split</code> <br />
+  <span style="color:grey">Should hudi dynamically compute the insertSplitSize 
based on the last 24 commit’s metadata. Turned off by default. </span></li>
+      <li><a href="#approxRecordSize">approxRecordSize</a> () <br />
+  Property: <code 
class="highlighter-rouge">hoodie.copyonwrite.record.size.estimate</code> <br />
+  <span style="color:grey">The average record size. If specified, hudi will 
use this and not compute dynamically based on the last 24 commit’s metadata. No 
value set as default. This is critical in computing the insert parallelism and 
bin-packing inserts into small files. See above.</span></li>
+      <li><a href="#withInlineCompaction">withInlineCompaction</a> 
(inlineCompaction = false) <br />
+  Property: <code class="highlighter-rouge">hoodie.compact.inline</code> <br />
+  <span style="color:grey">When set to true, compaction is triggered by the 
ingestion itself, right after a commit/deltacommit action as part of 
insert/upsert/bulk_insert</span></li>
+      <li><a 
href="#withMaxNumDeltaCommitsBeforeCompaction">withMaxNumDeltaCommitsBeforeCompaction</a>
 (maxNumDeltaCommitsBeforeCompaction = 10) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.compact.inline.max.delta.commits</code> <br />
   <span style="color:grey">Number of max delta commits to keep before 
triggering an inline compaction</span></li>
-          <li><a 
href="#withCompactionReverseLogReadEnabled">withCompactionReverseLogReadEnabled</a>
 (false) <br />
+      <li><a 
href="#withCompactionLazyBlockReadEnabled">withCompactionLazyBlockReadEnabled</a>
 (true) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.compaction.lazy.block.read</code> <br />
+  <span style="color:grey">When a CompactedLogScanner merges all log files, 
this config helps to choose whether the logblocks should be read lazily or not. 
Choose true to use I/O intensive lazy block reading (low memory usage) or false 
for Memory intensive immediate block read (high memory usage)</span></li>
+      <li><a 
href="#withCompactionReverseLogReadEnabled">withCompactionReverseLogReadEnabled</a>
 (false) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.compaction.reverse.log.read</code> <br />
   <span style="color:grey">HoodieLogFormatReader reads a logfile in the 
forward direction starting from pos=0 to pos=file_length. If this config is set 
to true, the Reader reads the logfile in reverse direction, from 
pos=file_length to pos=0</span></li>
-        </ul>
-      </li>
-      <li><a href="#withMetricsConfig">withMetricsConfig</a> 
(HoodieMetricsConfig) <br />
-  <span style="color:grey">Hoodie publishes metrics on every commit, clean, 
rollback etc.</span>
-        <ul>
-          <li><a href="#on">on</a> (true) <br />
+      <li><a href="#withCleanerParallelism">withCleanerParallelism</a> 
(cleanerParallelism = 200) <br />
+  Property: <code class="highlighter-rouge">hoodie.cleaner.parallelism</code> 
<br />
+  <span style="color:grey">Increase this if cleaning becomes slow.</span></li>
+      <li><a href="#withCompactionStrategy">withCompactionStrategy</a> 
(compactionStrategy = 
com.uber.hoodie.io.compact.strategy.LogFileSizeBasedCompactionStrategy) <br />
+  Property: <code class="highlighter-rouge">hoodie.compaction.strategy</code> 
<br />
+  <span style="color:grey">Compaction strategy decides which file groups are 
picked up for compaction during each compaction run. By default. Hudi picks the 
log file with most accumulated unmerged data</span></li>
+      <li><a 
href="#withTargetIOPerCompactionInMB">withTargetIOPerCompactionInMB</a> 
(targetIOPerCompactionInMB = 500000) <br />
+  Property: <code class="highlighter-rouge">hoodie.compaction.target.io</code> 
<br />
+  <span style="color:grey">Amount of MBs to spend during compaction run for 
the LogFileSizeBasedCompactionStrategy. This value helps bound ingestion 
latency while compaction is run inline mode.</span></li>
+      <li><a 
href="#withTargetPartitionsPerDayBasedCompaction">withTargetPartitionsPerDayBasedCompaction</a>
 (targetPartitionsPerCompaction = 10) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.compaction.daybased.target</code> <br />
+  <span style="color:grey">Used by 
com.uber.hoodie.io.compact.strategy.DayBasedCompactionStrategy to denote the 
number of latest partitions to compact during a compaction run.</span></li>
+      <li><a href="#payloadClassName">withPayloadClass</a> (payloadClassName = 
com.uber.hoodie.common.model.HoodieAvroPayload) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.compaction.payload.class</code> <br />
+  <span style="color:grey">This needs to be same as class used during 
insert/upserts. Just like writing, compaction also uses the record payload 
class to merge records in the log against each other, merge again with the base 
file and produce the final record to be written after compaction.</span></li>
+    </ul>
+  </li>
+</ul>
+
+<h4 id="metrics-configs">Metrics configs</h4>
+<p>Enables reporting of Hudi metrics to graphite.</p>
+
+<ul>
+  <li><a href="#withMetricsConfig">withMetricsConfig</a> (HoodieMetricsConfig) 
<br />
+<span style="color:grey">Hudi publishes metrics on every commit, clean, 
rollback etc.</span>
+    <ul>
+      <li><a href="#on">on</a> (metricsOn = true) <br />
+  Property: <code class="highlighter-rouge">hoodie.metrics.on</code> <br />
   <span style="color:grey">Turn sending metrics on/off. on by 
default.</span></li>
-          <li><a href="#withReporterType">withReporterType</a> (GRAPHITE) <br 
/>
+      <li><a href="#withReporterType">withReporterType</a> (reporterType = 
GRAPHITE) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.metrics.reporter.type</code> <br />
   <span style="color:grey">Type of metrics reporter. Graphite is the default 
and the only value suppported.</span></li>
-          <li><a href="#toGraphiteHost">toGraphiteHost</a> () <br />
+      <li><a href="#toGraphiteHost">toGraphiteHost</a> (host = localhost) <br 
/>
+  Property: <code 
class="highlighter-rouge">hoodie.metrics.graphite.host</code> <br />
   <span style="color:grey">Graphite host to connect to</span></li>
-          <li><a href="#onGraphitePort">onGraphitePort</a> () <br />
+      <li><a href="#onGraphitePort">onGraphitePort</a> (port = 4756) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.metrics.graphite.port</code> <br />
   <span style="color:grey">Graphite port to connect to</span></li>
-          <li><a href="#usePrefix">usePrefix</a> () <br />
-  <span style="color:grey">Standard prefix for all metrics</span></li>
-        </ul>
-      </li>
-      <li><a href="#withMemoryConfig">withMemoryConfig</a> 
(HoodieMemoryConfig) <br />
-  <span style="color:grey">Memory related configs</span>
-        <ul>
-          <li><a 
href="#withMaxMemoryFractionPerPartitionMerge">withMaxMemoryFractionPerPartitionMerge</a>
 (maxMemoryFractionPerPartitionMerge = 0.6) <br />
-  <span style="color:grey">This fraction is multiplied with the user memory 
fraction (1 - spark.memory.fraction) to get a final fraction of heap space to 
use during merge </span></li>
-          <li><a 
href="#withMaxMemorySizePerCompactionInBytes">withMaxMemorySizePerCompactionInBytes</a>
 (maxMemorySizePerCompactionInBytes = 1GB) <br />
-  <span style="color:grey">HoodieCompactedLogScanner reads logblocks, converts 
records to HoodieRecords and then merges these log blocks and records. At any 
point, the number of entries in a log block can be less than or equal to the 
number of entries in the corresponding parquet file. This can lead to OOM in 
the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use 
this config to set the max allowable inMemory footprint of the spillable 
map.</span></li>
-        </ul>
-      </li>
-      <li>
-        <p><a href="s3_hoodie.html">S3Configs</a> (Hoodie S3 Configs) <br />
-  <span style="color:grey">Configurations required for S3 and Hoodie 
co-operability.</span></p>
-      </li>
-      <li><a href="gcs_hoodie.html">GCSConfigs</a> (Hoodie GCS Configs) <br />
-  <span style="color:grey">Configurations required for GCS and Hoodie 
co-operability.</span></li>
+      <li><a href="#usePrefix">usePrefix</a> (prefix = “”) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.metrics.graphite.metric.prefix</code> <br />
+  <span style="color:grey">Standard prefix applied to all metrics. This helps 
to add datacenter, environment information for e.g</span></li>
     </ul>
   </li>
-  <li><a href="#datasource">Hoodie Datasource</a> <br />
-<span style="color:grey">Configs for datasource</span>
+</ul>
+
+<h4 id="memory-configs">Memory configs</h4>
+<p>Controls memory usage for compaction and merges, performed internally by 
Hudi</p>
+
+<ul>
+  <li><a href="#withMemoryConfig">withMemoryConfig</a> (HoodieMemoryConfig) 
<br />
+<span style="color:grey">Memory related configs</span>
     <ul>
-      <li><a href="#writeoptions">write options</a> (write.format.option(…)) 
<br />
-  <span style="color:grey"> Options useful for writing datasets </span>
-        <ul>
-          <li><a href="#OPERATION_OPT_KEY">OPERATION_OPT_KEY</a> (Default: 
upsert) <br />
-  <span style="color:grey">whether to do upsert, insert or bulkinsert for the 
write operation</span></li>
-          <li><a href="#STORAGE_TYPE_OPT_KEY">STORAGE_TYPE_OPT_KEY</a> 
(Default: COPY_ON_WRITE) <br />
-  <span style="color:grey">The storage type for the underlying data, for this 
write. This can’t change between writes.</span></li>
-          <li><a href="#TABLE_NAME_OPT_KEY">TABLE_NAME_OPT_KEY</a> (Default: 
None (mandatory)) <br />
-  <span style="color:grey">Hive table name, to register the dataset 
into.</span></li>
-          <li><a href="#PRECOMBINE_FIELD_OPT_KEY">PRECOMBINE_FIELD_OPT_KEY</a> 
(Default: ts) <br />
-  <span style="color:grey">Field used in preCombining before actual write. 
When two records have the same key value,
-  we will pick the one with the largest value for the precombine field, 
determined by Object.compareTo(..)</span></li>
-          <li><a href="#PAYLOAD_CLASS_OPT_KEY">PAYLOAD_CLASS_OPT_KEY</a> 
(Default: com.uber.hoodie.OverwriteWithLatestAvroPayload) <br />
-  <span style="color:grey">Payload class used. Override this, if you like to 
roll your own merge logic, when upserting/inserting.
-  This will render any value set for <code 
class="highlighter-rouge">PRECOMBINE_FIELD_OPT_VAL</code> 
in-effective</span></li>
-          <li><a href="#RECORDKEY_FIELD_OPT_KEY">RECORDKEY_FIELD_OPT_KEY</a> 
(Default: uuid) <br />
-  <span style="color:grey">Record key field. Value to be used as the <code 
class="highlighter-rouge">recordKey</code> component of <code 
class="highlighter-rouge">HoodieKey</code>. Actual value
-  will be obtained by invoking .toString() on the field value. Nested fields 
can be specified using
-  the dot notation eg: <code class="highlighter-rouge">a.b.c</code></span></li>
-          <li><a 
href="#PARTITIONPATH_FIELD_OPT_KEY">PARTITIONPATH_FIELD_OPT_KEY</a> (Default: 
partitionpath) <br />
-  <span style="color:grey">Partition path field. Value to be used at the <code 
class="highlighter-rouge">partitionPath</code> component of <code 
class="highlighter-rouge">HoodieKey</code>.
-  Actual value ontained by invoking .toString()</span></li>
-          <li><a 
href="#KEYGENERATOR_CLASS_OPT_KEY">KEYGENERATOR_CLASS_OPT_KEY</a> (Default: 
com.uber.hoodie.SimpleKeyGenerator) <br />
-  <span style="color:grey">Key generator class, that implements will extract 
the key out of incoming <code class="highlighter-rouge">Row</code> 
object</span></li>
-          <li><a 
href="#COMMIT_METADATA_KEYPREFIX_OPT_KEY">COMMIT_METADATA_KEYPREFIX_OPT_KEY</a> 
(Default: <code class="highlighter-rouge">_</code>) <br />
-  <span style="color:grey">Option keys beginning with this prefix, are 
automatically added to the commit/deltacommit metadata.
-  This is useful to store checkpointing information, in a consistent way with 
the hoodie timeline</span></li>
-        </ul>
-      </li>
-      <li><a href="#readoptions">read options</a> (read.format.option(…)) <br 
/>
-  <span style="color:grey">Options useful for reading datasets</span>
-        <ul>
-          <li><a href="#VIEW_TYPE_OPT_KEY">VIEW_TYPE_OPT_KEY</a> (Default:  = 
read_optimized) <br />
-  <span style="color:grey">Whether data needs to be read, in incremental mode 
(new data since an instantTime)
-  (or) Read Optimized mode (obtain latest view, based on columnar data)
-  (or) Real time mode (obtain latest view, based on row &amp; columnar 
data)</span></li>
-          <li><a 
href="#BEGIN_INSTANTTIME_OPT_KEY">BEGIN_INSTANTTIME_OPT_KEY</a> (Default: None 
(Mandatory in incremental mode)) <br />
-  <span style="color:grey">Instant time to start incrementally pulling data 
from. The instanttime here need not
-  necessarily correspond to an instant on the timeline. New data written with 
an
-   <code class="highlighter-rouge">instant_time &gt; BEGIN_INSTANTTIME</code> 
are fetched out. For e.g: ‘20170901080000’ will get
-   all new data written after Sep 1, 2017 08:00AM.</span></li>
-          <li><a href="#END_INSTANTTIME_OPT_KEY">END_INSTANTTIME_OPT_KEY</a> 
(Default: latest instant (i.e fetches all new data since begin instant time)) 
<br />
-  <span style="color:grey"> Instant time to limit incrementally fetched data 
to. New data written with an
-  <code class="highlighter-rouge">instant_time &lt;= END_INSTANTTIME</code> 
are fetched out.</span></li>
-        </ul>
-      </li>
+      <li><a 
href="#withMaxMemoryFractionPerPartitionMerge">withMaxMemoryFractionPerPartitionMerge</a>
 (maxMemoryFractionPerPartitionMerge = 0.6) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.memory.merge.fraction</code> <br />
+  <span style="color:grey">This fraction is multiplied with the user memory 
fraction (1 - spark.memory.fraction) to get a final fraction of heap space to 
use during merge </span></li>
+      <li><a 
href="#withMaxMemorySizePerCompactionInBytes">withMaxMemorySizePerCompactionInBytes</a>
 (maxMemorySizePerCompactionInBytes = 1GB) <br />
+  Property: <code 
class="highlighter-rouge">hoodie.memory.compaction.fraction</code> <br />
+  <span style="color:grey">HoodieCompactedLogScanner reads logblocks, converts 
records to HoodieRecords and then merges these log blocks and records. At any 
point, the number of entries in a log block can be less than or equal to the 
number of entries in the corresponding parquet file. This can lead to OOM in 
the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use 
this config to set the max allowable inMemory footprint of the spillable 
map.</span></li>
     </ul>
   </li>
 </ul>
@@ -519,14 +709,11 @@
 
 <p>Writing data via Hudi happens as a Spark job and thus general rules of 
spark debugging applies here too. Below is a list of things to keep in mind, if 
you are looking to improving performance or reliability.</p>
 
-<p><strong>Write operations</strong> : Use <code 
class="highlighter-rouge">bulkinsert</code> to load new data into a table, and 
there on use <code class="highlighter-rouge">upsert</code>/<code 
class="highlighter-rouge">insert</code>.
- Difference between them is that bulk insert uses a disk based write path to 
scale to load large inputs without need to cache it.</p>
-
-<p><strong>Input Parallelism</strong> : By default, Hoodie tends to 
over-partition input (i.e <code 
class="highlighter-rouge">withParallelism(1500)</code>), to ensure each Spark 
partition stays within the 2GB limit for inputs upto 500GB. Bump this up 
accordingly if you have larger inputs. We recommend having shuffle parallelism 
<code 
class="highlighter-rouge">hoodie.[insert|upsert|bulkinsert].shuffle.parallelism</code>
 such that its atleast input_data_size/500MB</p>
+<p><strong>Input Parallelism</strong> : By default, Hudi tends to 
over-partition input (i.e <code 
class="highlighter-rouge">withParallelism(1500)</code>), to ensure each Spark 
partition stays within the 2GB limit for inputs upto 500GB. Bump this up 
accordingly if you have larger inputs. We recommend having shuffle parallelism 
<code 
class="highlighter-rouge">hoodie.[insert|upsert|bulkinsert].shuffle.parallelism</code>
 such that its atleast input_data_size/500MB</p>
 
-<p><strong>Off-heap memory</strong> : Hoodie writes parquet files and that 
needs good amount of off-heap memory proportional to schema width. Consider 
setting something like <code 
class="highlighter-rouge">spark.yarn.executor.memoryOverhead</code> or <code 
class="highlighter-rouge">spark.yarn.driver.memoryOverhead</code>, if you are 
running into such failures.</p>
+<p><strong>Off-heap memory</strong> : Hudi writes parquet files and that needs 
good amount of off-heap memory proportional to schema width. Consider setting 
something like <code 
class="highlighter-rouge">spark.yarn.executor.memoryOverhead</code> or <code 
class="highlighter-rouge">spark.yarn.driver.memoryOverhead</code>, if you are 
running into such failures.</p>
 
-<p><strong>Spark Memory</strong> : Typically, hoodie needs to be able to read 
a single file into memory to perform merges or compactions and thus the 
executor memory should be sufficient to accomodate this. In addition, Hoodie 
caches the input to be able to intelligently place data and thus leaving some 
<code class="highlighter-rouge">spark.storage.memoryFraction</code> will 
generally help boost performance.</p>
+<p><strong>Spark Memory</strong> : Typically, hudi needs to be able to read a 
single file into memory to perform merges or compactions and thus the executor 
memory should be sufficient to accomodate this. In addition, Hoodie caches the 
input to be able to intelligently place data and thus leaving some <code 
class="highlighter-rouge">spark.storage.memoryFraction</code> will generally 
help boost performance.</p>
 
 <p><strong>Sizing files</strong> : Set <code 
class="highlighter-rouge">limitFileSize</code> above judiciously, to balance 
ingest/write latency vs number of files &amp; consequently metadata overhead 
associated with it.</p>
 
diff --git a/content/contributing.html b/content/contributing.html
index 1901952..9c9e61d 100644
--- a/content/contributing.html
+++ b/content/contributing.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" developer setup">
+<meta name="keywords" content="hudi, ide, developer, setup">
 <title>Developer Setup | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -380,6 +384,8 @@ have an open source license <a 
href="https://www.apache.org/legal/resolved.html#
       <li>Add adequate tests for your new functionality</li>
       <li>[Optional] For involved changes, its best to also run the entire 
integration test suite using <code class="highlighter-rouge">mvn clean 
install</code></li>
       <li>For website changes, please build the site locally &amp; test 
navigation, formatting &amp; links thoroughly</li>
+      <li>If your code change changes some aspect of documentation (e.g new 
config, default value change), 
+please ensure there is a another PR to <a 
href="https://github.com/apache/incubator-hudi/blob/asf-site/docs/README.md";>update
 the docs</a> as well.</li>
     </ul>
   </li>
   <li>Format commit messages and the pull request title like <code 
class="highlighter-rouge">[HUDI-XXX] Fixes bug in Spark Datasource</code>,
diff --git a/content/css/customstyles.css b/content/css/customstyles.css
index d6667a5..56dcdba 100644
--- a/content/css/customstyles.css
+++ b/content/css/customstyles.css
@@ -1,5 +1,5 @@
 body {
-    font-size:15px;
+    font-size:14px;
 }
 
 .bs-callout {
@@ -607,7 +607,7 @@ a.fa.fa-envelope-o.mailto {
     font-weight: 600;
 }
 
-h3 {color: #ED1951; font-weight:normal; font-size:130%;}
+h3 {color: #545253; font-weight:normal; font-size:130%;}
 h4 {color: #808080; font-weight:normal; font-size:120%; font-style:italic;}
 
 .alert, .callout {
diff --git a/content/css/theme-blue.css b/content/css/theme-blue.css
index 9a923ef..46fbd0d 100644
--- a/content/css/theme-blue.css
+++ b/content/css/theme-blue.css
@@ -5,7 +5,7 @@
 }
 
 
-h3 {color: #ED1951; }
+h3 {color: #545253; }
 h4 {color: #808080; }
 
 .nav-tabs > li.active > a, .nav-tabs > li.active > a:hover, .nav-tabs > 
li.active > a:focus {
diff --git a/content/feed.xml b/content/feed.xml
index b21704e..cd76d50 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -5,8 +5,8 @@
         <description>Apache Hudi (pronounced “Hoodie”) provides upserts and 
incremental processing capaibilities on Big Data</description>
         <link>http://0.0.0.0:4000/</link>
         <atom:link href="http://0.0.0.0:4000/feed.xml"; rel="self" 
type="application/rss+xml"/>
-        <pubDate>Mon, 25 Feb 2019 20:49:33 +0000</pubDate>
-        <lastBuildDate>Mon, 25 Feb 2019 20:49:33 +0000</lastBuildDate>
+        <pubDate>Sat, 09 Mar 2019 21:08:53 +0000</pubDate>
+        <lastBuildDate>Sat, 09 Mar 2019 21:08:53 +0000</lastBuildDate>
         <generator>Jekyll v3.3.1</generator>
         
         <item>
@@ -25,7 +25,7 @@
         
         <item>
             <title>Connect with us at Strata San Jose March 2017</title>
-            <description>&lt;p&gt;We will be presenting Hoodie &amp;amp; 
general concepts around how incremental processing works at Uber.
+            <description>&lt;p&gt;We will be presenting Hudi &amp;amp; general 
concepts around how incremental processing works at Uber.
 Catch our talk &lt;strong&gt;“Incremental Processing on Hadoop At 
Uber”&lt;/strong&gt;&lt;/p&gt;
 
 </description>
diff --git a/content/gcs_hoodie.html b/content/gcs_hoodie.html
index f90992d..cb96011 100644
--- a/content/gcs_hoodie.html
+++ b/content/gcs_hoodie.html
@@ -4,8 +4,8 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="In this page, we go over how to configure 
hudi with Google Cloud Storage.">
-<meta name="keywords" content=" sql hive gcs spark presto">
-<title>GCS Filesystem (experimental) | Hudi</title>
+<meta name="keywords" content="hudi, hive, google cloud, storage, spark, 
presto">
+<title>GCS Filesystem | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -158,7 +162,7 @@
 
 
 
-  <a class="email" title="Submit feedback" href="#" 
onclick="javascript:window.location='mailto:[email protected]?subject=Hudi 
Documentation feedback&body=I have some feedback about the GCS Filesystem 
(experimental) page: ' + window.location.href;"><i class="fa 
fa-envelope-o"></i> Feedback</a>
+  <a class="email" title="Submit feedback" href="#" 
onclick="javascript:window.location='mailto:[email protected]?subject=Hudi 
Documentation feedback&body=I have some feedback about the GCS Filesystem page: 
' + window.location.href;"><i class="fa fa-envelope-o"></i> Feedback</a>
 
 <li>
 
@@ -176,7 +180,7 @@
                                 searchInput: 
document.getElementById('search-input'),
                                 resultsContainer: 
document.getElementById('results-container'),
                                 dataSource: 'search.json',
-                                searchResultTemplate: '<li><a href="{url}" 
title="GCS Filesystem (experimental)">{title}</a></li>',
+                                searchResultTemplate: '<li><a href="{url}" 
title="GCS Filesystem">{title}</a></li>',
                     noResultsText: 'No results found.',
                             limit: 10,
                             fuzzy: true,
@@ -327,7 +331,7 @@
     <!-- Content Column -->
     <div class="col-md-9">
         <div class="post-header">
-   <h1 class="post-title-main">GCS Filesystem (experimental)</h1>
+   <h1 class="post-title-main">GCS Filesystem</h1>
 </div>
 
 
@@ -343,7 +347,7 @@
 
     
 
-  <p>Hudi works with HDFS by default and GCS <strong>regional</strong> buckets 
provide an HDFS API with strong consistency.</p>
+  <p>For Hudi storage on GCS, <strong>regional</strong> buckets provide an DFS 
API with strong consistency.</p>
 
 <h2 id="gcs-configs">GCS Configs</h2>
 
diff --git a/content/images/hoodie_commit_duration.png 
b/content/images/hudi_commit_duration.png
similarity index 100%
rename from content/images/hoodie_commit_duration.png
rename to content/images/hudi_commit_duration.png
diff --git a/content/images/hoodie_intro_1.png b/content/images/hudi_intro_1.png
similarity index 100%
rename from content/images/hoodie_intro_1.png
rename to content/images/hudi_intro_1.png
diff --git a/content/images/hoodie_log_format_v2.png 
b/content/images/hudi_log_format_v2.png
similarity index 100%
rename from content/images/hoodie_log_format_v2.png
rename to content/images/hudi_log_format_v2.png
diff --git a/content/images/hoodie_query_perf_hive.png 
b/content/images/hudi_query_perf_hive.png
similarity index 100%
rename from content/images/hoodie_query_perf_hive.png
rename to content/images/hudi_query_perf_hive.png
diff --git a/content/images/hoodie_query_perf_presto.png 
b/content/images/hudi_query_perf_presto.png
similarity index 100%
rename from content/images/hoodie_query_perf_presto.png
rename to content/images/hudi_query_perf_presto.png
diff --git a/content/images/hoodie_query_perf_spark.png 
b/content/images/hudi_query_perf_spark.png
similarity index 100%
rename from content/images/hoodie_query_perf_spark.png
rename to content/images/hudi_query_perf_spark.png
diff --git a/content/images/hoodie_upsert_dag.png 
b/content/images/hudi_upsert_dag.png
similarity index 100%
rename from content/images/hoodie_upsert_dag.png
rename to content/images/hudi_upsert_dag.png
diff --git a/content/images/hoodie_upsert_perf1.png 
b/content/images/hudi_upsert_perf1.png
similarity index 100%
rename from content/images/hoodie_upsert_perf1.png
rename to content/images/hudi_upsert_perf1.png
diff --git a/content/images/hoodie_upsert_perf2.png 
b/content/images/hudi_upsert_perf2.png
similarity index 100%
rename from content/images/hoodie_upsert_perf2.png
rename to content/images/hudi_upsert_perf2.png
diff --git a/content/implementation.html b/content/implementation.html
index d649a70..e524ec6 100644
--- a/content/implementation.html
+++ b/content/implementation.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" implementation">
+<meta name="keywords" content="hudi, index, storage, compaction, cleaning, 
implementation">
 <title>Implementation | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -347,7 +351,7 @@ Hudi upsert/insert is merely a Spark DAG, that can be 
broken into two big pieces
 
 <ul>
   <li>
-    <p><strong>Indexing</strong> :  A big part of Hoodie’s efficiency comes 
from indexing the mapping from record keys to the file ids, to which they 
belong to.
+    <p><strong>Indexing</strong> :  A big part of Hudi’s efficiency comes from 
indexing the mapping from record keys to the file ids, to which they belong to.
  This index also helps the <code 
class="highlighter-rouge">HoodieWriteClient</code> separate upserted records 
into inserts and updates, so they can be treated differently.
  <code class="highlighter-rouge">HoodieReadClient</code> supports operations 
such as <code class="highlighter-rouge">filterExists</code> (used for 
de-duplication of table) and an efficient batch <code 
class="highlighter-rouge">read(keys)</code> api, that
  can read out the records corresponding to the keys using the index much 
quickly, than a typical scan via a query. The index is also atomically
@@ -406,7 +410,7 @@ Any remaining records after that, are again packed into new 
file id groups, agai
 <p>In the case of Copy-On-Write, a single parquet file constitutes one <code 
class="highlighter-rouge">file slice</code> which contains one complete version 
of
 the file</p>
 
-<figure><img class="docimage" src="images/hoodie_log_format_v2.png" 
alt="hoodie_log_format_v2.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_log_format_v2.png" 
alt="hudi_log_format_v2.png" style="max-width: 1000px" /></figure>
 
 <h4 id="merge-on-read">Merge On Read</h4>
 
@@ -575,7 +579,7 @@ incremental ingestion (writer at DC6) happened before the 
compaction (some time
 The below description is with regards to compaction from file-group 
perspective.
     <ul>
       <li><code class="highlighter-rouge">Reader querying at time between 
ingestion completion time for DC6 and compaction finish “Tc”</code>:
-Hoodie’s implementation will be changed to become aware of file-groups 
currently waiting for compaction and
+Hudi’s implementation will be changed to become aware of file-groups currently 
waiting for compaction and
 merge log-files corresponding to DC2-DC6 with the base-file corresponding to 
SC1. In essence, Hudi will create
 a pseudo file-slice by combining the 2 file-slices starting at base-commits 
SC1 and SC5 to one.
 For file-groups not waiting for compaction, the reader behavior is essentially 
the same - read latest file-slice
@@ -602,12 +606,12 @@ the conventional alternatives for achieving these 
tasks.</p>
 <p>Following shows the speed up obtained for NoSQL ingestion, by switching 
from bulk loads off HBase to Parquet to incrementally upserting
 on a Hudi dataset, on 5 tables ranging from small to huge.</p>
 
-<figure><img class="docimage" src="images/hoodie_upsert_perf1.png" 
alt="hoodie_upsert_perf1.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_upsert_perf1.png" 
alt="hudi_upsert_perf1.png" style="max-width: 1000px" /></figure>
 
 <p>Given Hudi can build the dataset incrementally, it opens doors for also 
scheduling ingesting more frequently thus reducing latency, with
 significant savings on the overall compute cost.</p>
 
-<figure><img class="docimage" src="images/hoodie_upsert_perf2.png" 
alt="hoodie_upsert_perf2.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_upsert_perf2.png" 
alt="hudi_upsert_perf2.png" style="max-width: 1000px" /></figure>
 
 <p>Hudi upserts have been stress tested upto 4TB in a single commit across the 
t1 table.</p>
 
@@ -618,15 +622,15 @@ with no impact on queries. Following charts compare the 
Hudi vs non-Hudi dataset
 
 <p><strong>Hive</strong></p>
 
-<figure><img class="docimage" src="images/hoodie_query_perf_hive.png" 
alt="hoodie_query_perf_hive.png" style="max-width: 800px" /></figure>
+<figure><img class="docimage" src="images/hudi_query_perf_hive.png" 
alt="hudi_query_perf_hive.png" style="max-width: 800px" /></figure>
 
 <p><strong>Spark</strong></p>
 
-<figure><img class="docimage" src="images/hoodie_query_perf_spark.png" 
alt="hoodie_query_perf_spark.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_query_perf_spark.png" 
alt="hudi_query_perf_spark.png" style="max-width: 1000px" /></figure>
 
 <p><strong>Presto</strong></p>
 
-<figure><img class="docimage" src="images/hoodie_query_perf_presto.png" 
alt="hoodie_query_perf_presto.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_query_perf_presto.png" 
alt="hudi_query_perf_presto.png" style="max-width: 1000px" /></figure>
 
 
 
diff --git a/content/incremental_processing.html 
b/content/incremental_processing.html
index a694881..c487368 100644
--- a/content/incremental_processing.html
+++ b/content/incremental_processing.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="In this page, we will discuss some available 
tools for ingesting data incrementally & consuming the changes.">
-<meta name="keywords" content=" incremental processing">
+<meta name="keywords" content="hudi, incremental, batch, stream, processing, 
Hive, ETL, Spark SQL">
 <title>Incremental Processing | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -349,7 +353,7 @@ discusses a few tools that can be used to achieve these on 
different contexts.</
 
 <h2 id="incremental-ingestion">Incremental Ingestion</h2>
 
-<p>Following means can be used to apply a delta or an incremental change to a 
Hudi dataset. For e.g, the incremental changes could be from a Kafka topic or 
files uploaded to HDFS or
+<p>Following means can be used to apply a delta or an incremental change to a 
Hudi dataset. For e.g, the incremental changes could be from a Kafka topic or 
files uploaded to DFS or
 even changes pulled from another Hudi dataset.</p>
 
 <h4 id="deltastreamer-tool">DeltaStreamer Tool</h4>
@@ -360,9 +364,10 @@ from different sources such as DFS or Kafka.</p>
 <p>The tool is a spark job (part of hoodie-utilities), that provides the 
following functionality</p>
 
 <ul>
-  <li>Ability to consume new events from Kafka, incremental imports from Sqoop 
or output of <code class="highlighter-rouge">HiveIncrementalPuller</code> or 
files under a folder on HDFS</li>
+  <li>Ability to consume new events from Kafka, incremental imports from Sqoop 
or output of <code class="highlighter-rouge">HiveIncrementalPuller</code> or 
files under a folder on DFS</li>
   <li>Support json, avro or a custom payload types for the incoming data</li>
-  <li>New data is written to a Hudi dataset, with support for checkpointing 
&amp; schemas and registered onto Hive</li>
+  <li>Pick up avro schemas from DFS or Confluent <a 
href="https://github.com/confluentinc/schema-registry";>schema registry</a>.</li>
+  <li>New data is written to a Hudi dataset, with support for checkpointing 
and registered onto Hive</li>
 </ul>
 
 <p>Command line options describe capabilities in more detail (first build 
hoodie-utilities using <code class="highlighter-rouge">mvn clean 
package</code>).</p>
@@ -423,10 +428,10 @@ Usage: &lt;main class&gt; [options]
   * --target-table
       name of the target table in Hive
     --transformer-class
-      subclass of com.uber.hoodie.utilities.transform.Transformer. UDF to 
-      transform raw source dataset to a target dataset (conforming to target 
-      schema) before writing. Default : Not set. E:g - 
-      com.uber.hoodie.utilities.transform.SqlQueryBasedTransformer (which 
+      subclass of com.uber.hoodie.utilities.transform.Transformer. UDF to
+      transform raw source dataset to a target dataset (conforming to target
+      schema) before writing. Default : Not set. E:g -
+      com.uber.hoodie.utilities.transform.SqlQueryBasedTransformer (which
       allows a SQL query template to be passed as a transformation function)
 
 </code></pre>
@@ -453,7 +458,7 @@ provided under <code 
class="highlighter-rouge">hoodie-utilities/src/test/resourc
 </code></pre>
 </div>
 
-<p>In some cases, you may want to convert your existing dataset into Hoodie, 
before you can begin ingesting new data. This can be accomplished using the 
<code class="highlighter-rouge">hdfsparquetimport</code> command on the <code 
class="highlighter-rouge">hoodie-cli</code>.
+<p>In some cases, you may want to convert your existing dataset into Hudi, 
before you can begin ingesting new data. This can be accomplished using the 
<code class="highlighter-rouge">hdfsparquetimport</code> command on the <code 
class="highlighter-rouge">hoodie-cli</code>.
 Currently, there is support for converting parquet datasets.</p>
 
 <h4 id="via-custom-spark-job">Via Custom Spark Job</h4>
@@ -503,8 +508,6 @@ Usage: &lt;main class&gt; [options]
 </code></pre>
 </div>
 
-<div class="bs-callout bs-callout-info">Note that for now, due to jar 
mismatches between Spark &amp; Hive, its recommended to run this as a separate 
Java task in your workflow manager/cron. This is getting fix <a 
href="https://github.com/uber/hoodie/issues/123";>here</a></div>
-
 <h2 id="incrementally-pulling">Incrementally Pulling</h2>
 
 <p>Hudi datasets can be pulled incrementally, which means you can get ALL and 
ONLY the updated &amp; new rows since a specified commit timestamp.
@@ -530,7 +533,7 @@ This class can be used within existing Spark jobs and 
offers the following funct
 
 <p>Please refer to <a href="configurations.html">configurations</a> section, 
to view all datasource options.</p>
 
-<p>Additionally, <code class="highlighter-rouge">HoodieReadClient</code> 
offers the following functionality using Hoodie’s implicit indexing.</p>
+<p>Additionally, <code class="highlighter-rouge">HoodieReadClient</code> 
offers the following functionality using Hudi’s implicit indexing.</p>
 
 <table>
   <tbody>
@@ -540,7 +543,7 @@ This class can be used within existing Spark jobs and 
offers the following funct
     </tr>
     <tr>
       <td>read(keys)</td>
-      <td>Read out the data corresponding to the keys as a DataFrame, using 
Hoodie’s own index for faster lookup</td>
+      <td>Read out the data corresponding to the keys as a DataFrame, using 
Hudi’s own index for faster lookup</td>
     </tr>
     <tr>
       <td>filterExists()</td>
@@ -590,7 +593,7 @@ e.g: <code 
class="highlighter-rouge">/app/incremental-hql/intermediate/{source_t
     </tr>
     <tr>
       <td>tmp</td>
-      <td>Directory where the temporary delta data is stored in HDFS. The 
directory structure will follow conventions. Please see the below section.</td>
+      <td>Directory where the temporary delta data is stored in DFS. The 
directory structure will follow conventions. Please see the below section.</td>
       <td> </td>
     </tr>
     <tr>
@@ -610,12 +613,12 @@ e.g: <code 
class="highlighter-rouge">/app/incremental-hql/intermediate/{source_t
     </tr>
     <tr>
       <td>sourceDataPath</td>
-      <td>Source HDFS Base Path. This is where the Hudi metadata will be 
read.</td>
+      <td>Source DFS Base Path. This is where the Hudi metadata will be 
read.</td>
       <td> </td>
     </tr>
     <tr>
       <td>targetDataPath</td>
-      <td>Target HDFS Base path. This is needed to compute the fromCommitTime. 
This is not needed if fromCommitTime is specified explicitly.</td>
+      <td>Target DFS Base path. This is needed to compute the fromCommitTime. 
This is not needed if fromCommitTime is specified explicitly.</td>
       <td> </td>
     </tr>
     <tr>
@@ -647,7 +650,6 @@ it will automatically use the backfill configuration, since 
applying the last 24
 is the lack of support for self-joining the same table in mixed mode (normal 
and incremental modes).</p>
 
 
-
     <div class="tags">
         
     </div>
diff --git a/content/index.html b/content/index.html
index bd31b4d..1a1c5ff 100644
--- a/content/index.html
+++ b/content/index.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="Hudi brings stream processing to big data, 
providing fresh data while being an order of magnitude efficient over 
traditional batch processing.">
-<meta name="keywords" content="getting_started,  homepage">
+<meta name="keywords" content="big data, stream processing, cloud, hdfs, 
storage, upserts, change capture">
 <title>What is Hudi? | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -366,7 +370,7 @@ $('#toc').on('click', 'a', function() {
 
     
 
-  <p>Hudi (pronounced “Hoodie”) ingests &amp; manages storage of large 
analytical datasets on <a 
href="http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html";>HDFS</a>
 or cloud stores and provides three logical views for query access.</p>
+  <p>Hudi (pronounced “Hoodie”) ingests &amp; manages storage of large 
analytical datasets over DFS (<a 
href="http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html";>HDFS</a>
 or cloud stores) and provides three logical views for query access.</p>
 
 <ul>
   <li><strong>Read Optimized View</strong> - Provides excellent query 
performance on pure columnar storage, much like plain <a 
href="https://parquet.apache.org/";>Parquet</a> tables.</li>
@@ -374,7 +378,7 @@ $('#toc').on('click', 'a', function() {
   <li><strong>Near-Real time Table</strong> - Provides queries on real-time 
data, using a combination of columnar &amp; row based storage (e.g Parquet + <a 
href="http://avro.apache.org/docs/current/mr.html";>Avro</a>)</li>
 </ul>
 
-<figure><img class="docimage" src="images/hoodie_intro_1.png" 
alt="hoodie_intro_1.png" /></figure>
+<figure><img class="docimage" src="images/hudi_intro_1.png" 
alt="hudi_intro_1.png" /></figure>
 
 <p>By carefully managing how data is laid out in storage &amp; how it’s 
exposed to queries, Hudi is able to power a rich data ecosystem where external 
sources can be ingested in near real-time and made available for interactive 
SQL Engines like <a href="https://prestodb.io";>Presto</a> &amp; <a 
href="https://spark.apache.org/sql/";>Spark</a>, while at the same time capable 
of being consumed incrementally from processing/ETL frameworks like <a 
href="https://hive.apache.org/";>Hive</a> &amp;  [...]
 
diff --git a/content/js/mydoc_scroll.html b/content/js/mydoc_scroll.html
index b23a6ad..ee70719 100644
--- a/content/js/mydoc_scroll.html
+++ b/content/js/mydoc_scroll.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="This page demonstrates how you the 
integration of a script called ScrollTo, which is used here to link definitions 
of a JSON code sample to a list of definit...">
-<meta name="keywords" content="special_layouts,  json, scrolling, scrollto, 
jquery plugin">
+<meta name="keywords" content="json, scrolling, scrollto, jquery plugin">
 <title>Scroll layout | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
diff --git a/content/migration_guide.html b/content/migration_guide.html
index 7bcfa1d..03ea8a1 100644
--- a/content/migration_guide.html
+++ b/content/migration_guide.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="In this page, we will discuss some available 
tools for migrating your existing dataset into a Hudi dataset">
-<meta name="keywords" content=" migration guide">
+<meta name="keywords" content="hudi, migration, use case">
 <title>Migration Guide | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -362,7 +366,7 @@ Take this approach if your dataset is an append only type 
of dataset and you do
 
 <p>Import your existing dataset into a Hudi managed dataset. Since all the 
data is Hudi managed, none of the limitations
  of Approach 1 apply here. Updates spanning any partitions can be applied to 
this dataset and Hudi will efficiently
- make the update available to queries. Note that not only do you get to use 
all Hoodie primitives on this dataset,
+ make the update available to queries. Note that not only do you get to use 
all Hudi primitives on this dataset,
  there are other additional advantages of doing this. Hudi automatically 
manages file sizes of a Hudi managed dataset
  . You can define the desired file size when converting this dataset and Hudi 
will ensure it writes out files
  adhering to the config. It will also ensure that smaller files later get 
corrected by routing some new inserts into
@@ -371,9 +375,8 @@ Take this approach if your dataset is an append only type 
of dataset and you do
 <p>There are a few options when choosing this approach.</p>
 
 <h4 id="option-1">Option 1</h4>
-<p>Use the HDFSParquetImporter tool. As the name suggests, this only works if 
your existing dataset is in
-parquet file
-format. This tool essentially starts a Spark Job to read the existing parquet 
dataset and converts it into a HUDI managed dataset by re-writing all the 
data.</p>
+<p>Use the HDFSParquetImporter tool. As the name suggests, this only works if 
your existing dataset is in parquet file format.
+This tool essentially starts a Spark Job to read the existing parquet dataset 
and converts it into a HUDI managed dataset by re-writing all the data.</p>
 
 <h4 id="option-2">Option 2</h4>
 <p>For huge datasets, this could be as simple as : for partition in [list of 
partitions in source dataset] {
@@ -385,7 +388,7 @@ format. This tool essentially starts a Spark Job to read 
the existing parquet da
 <p>Write your own custom logic of how to load an existing dataset into a Hudi 
managed one. Please read about the RDD API
  <a href="quickstart.html">here</a>.</p>
 
-<div class="highlighter-rouge"><pre class="highlight"><code>Using the 
HDFSParquetImporter Tool. Once hoodie has been built via `mvn clean install 
-DskipTests`, the shell can be
+<div class="highlighter-rouge"><pre class="highlight"><code>Using the 
HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install 
-DskipTests`, the shell can be
 fired by via `cd hoodie-cli &amp;&amp; ./hoodie-cli.sh`.
 
 hoodie-&gt;hdfsparquetimport
diff --git a/content/news.html b/content/news.html
index 645bae0..43d92a3 100644
--- a/content/news.html
+++ b/content/news.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" news, blog, updates, release notes, 
announcements">
+<meta name="keywords" content="apache, hudi, news, blog, updates, release 
notes, announcements">
 <title>News | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -266,7 +270,7 @@
                 <a href="tag_news.html">news</a>
 
                 </span>
-        <p> We will be presenting Hoodie &amp; general concepts around how 
incremental processing works at Uber.
+        <p> We will be presenting Hudi &amp; general concepts around how 
incremental processing works at Uber.
 Catch our talk “Incremental Processing on Hadoop At Uber”
 
  </p>
diff --git a/content/news_archive.html b/content/news_archive.html
index 4d80715..d1986b5 100644
--- a/content/news_archive.html
+++ b/content/news_archive.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" news, blog, updates, release notes, 
announcements">
+<meta name="keywords" content="news, blog, updates, release notes, 
announcements">
 <title>News | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
diff --git a/content/powered_by.html b/content/powered_by.html
index 8f4b0d4..99991ca 100644
--- a/content/powered_by.html
+++ b/content/powered_by.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" talks">
+<meta name="keywords" content="hudi, talks, presentation">
 <title>Talks & Powered By | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -383,7 +387,6 @@ October 2018, Spark+AI Summit Europe, London, UK</p>
 </ol>
 
 
-
     <div class="tags">
         
     </div>
diff --git a/content/privacy.html b/content/privacy.html
index 704bd3d..1804b9f 100644
--- a/content/privacy.html
+++ b/content/privacy.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content=" privacy">
+<meta name="keywords" content="hudi, privacy">
 <title>Privacy Policy | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
diff --git a/content/quickstart.html b/content/quickstart.html
index a73534d..b7781b3 100644
--- a/content/quickstart.html
+++ b/content/quickstart.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content="quickstart,  quickstart">
+<meta name="keywords" content="hudi, quickstart">
 <title>Quickstart | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -362,7 +366,8 @@ $ mvn clean install -DskipTests -DskipITs -Dhive11
 
 <h2 id="version-compatibility">Version Compatibility</h2>
 
-<p>Hudi requires Java 8 to be installed. Hudi works with Spark-2.x versions. 
We have verified that Hudi works with the following combination of 
Hadoop/Hive/Spark.</p>
+<p>Hudi requires Java 8 to be installed on a *nix system. Hudi works with 
Spark-2.x versions. 
+Further, we have verified that Hudi works with the following combination of 
Hadoop/Hive/Spark.</p>
 
 <table>
   <thead>
@@ -395,8 +400,9 @@ $ mvn clean install -DskipTests -DskipITs -Dhive11
   </tbody>
 </table>
 
-<p>If your environment has other versions of hadoop/hive/spark, please try out 
Hudi and let us know if there are any issues. We are limited by our bandwidth 
to certify other combinations.
-It would be of great help if you can reach out to us with your setup and 
experience with hoodie.</p>
+<p>If your environment has other versions of hadoop/hive/spark, please try out 
Hudi and let us know if there are any issues.
+We are limited by our bandwidth to certify other combinations (e.g Docker on 
Windows).
+It would be of great help if you can reach out to us with your setup and 
experience with hudi.</p>
 
 <h2 id="generate-a-hudi-dataset">Generate a Hudi Dataset</h2>
 
@@ -424,7 +430,7 @@ Use the RDD API to perform more involved actions on a Hudi 
dataset</p>
 
 <h4 id="datasource-api">DataSource API</h4>
 
-<p>Run <strong>hoodie-spark/src/test/java/HoodieJavaApp.java</strong> class, 
to place a two commits (commit 1 =&gt; 100 inserts, commit 2 =&gt; 100 updates 
to previously inserted 100 records) onto your HDFS/local filesystem. Use the 
wrapper script
+<p>Run <strong>hoodie-spark/src/test/java/HoodieJavaApp.java</strong> class, 
to place a two commits (commit 1 =&gt; 100 inserts, commit 2 =&gt; 100 updates 
to previously inserted 100 records) onto your DFS/local filesystem. Use the 
wrapper script
 to run from command-line</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>cd hoodie-spark
@@ -679,9 +685,9 @@ data infrastructure is brought up in a local docker cluster 
within your computer
 
 <h3 id="setting-up-docker-cluster">Setting up Docker Cluster</h3>
 
-<h4 id="build-hoodie">Build Hoodie</h4>
+<h4 id="build-hudi">Build Hudi</h4>
 
-<p>The first step is to build hoodie
+<p>The first step is to build hudi
 <code class="highlighter-rouge">
 cd &lt;HUDI_WORKSPACE&gt;
 mvn package -DskipTests
@@ -801,7 +807,7 @@ automatically initializes the datasets in the file-system 
if they do not exist y
 <div class="highlighter-rouge"><pre class="highlight"><code>docker exec -it 
adhoc-2 /bin/bash
 
 # Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
stock_ticks_cow --props /var/demo/config/kafka-source.properties
+spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
stock_ticks_cow --props /var/demo/config/kafka-source.properties 
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
 ....
 ....
 2018-09-24 22:20:00 INFO  
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - 
OutputCommitCoordinator stopped!
@@ -1329,7 +1335,7 @@ scala&gt; spark.sql("select `_hoodie_commit_time`, 
symbol, ts, volume, open, clo
 Again, You can use Hudi CLI to manually schedule and run compaction</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>docker exec -it 
adhoc-1 /bin/bash
-^[[Aroot@adhoc-1:/opt#   /var/hoodie/ws/hoodie-cli/hoodie-cli.sh
+root@adhoc-1:/opt#   /var/hoodie/ws/hoodie-cli/hoodie-cli.sh
 ============================================
 *                                          *
 *     _    _                 _ _           *
@@ -1514,7 +1520,7 @@ scala&gt; spark.sql("select `_hoodie_commit_time`, 
symbol, ts, volume, open, clo
 
 <h2 id="testing-hudi-in-local-docker-environment">Testing Hudi in Local Docker 
environment</h2>
 
-<p>You can bring up a hadoop docker environment containing Hadoop, Hive and 
Spark services with support for hoodie.
+<p>You can bring up a hadoop docker environment containing Hadoop, Hive and 
Spark services with support for hudi.
 <code class="highlighter-rouge">
 $ mvn pre-integration-test -DskipTests
 </code>
diff --git a/content/s3_hoodie.html b/content/s3_hoodie.html
index 217005c..0366721 100644
--- a/content/s3_hoodie.html
+++ b/content/s3_hoodie.html
@@ -4,8 +4,8 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="In this page, we go over how to configure 
Hudi with S3 filesystem.">
-<meta name="keywords" content=" sql hive s3 spark presto">
-<title>S3 Filesystem (experimental) | Hudi</title>
+<meta name="keywords" content="hudi, hive, aws, s3, spark, presto">
+<title>S3 Filesystem | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -158,7 +162,7 @@
 
 
 
-  <a class="email" title="Submit feedback" href="#" 
onclick="javascript:window.location='mailto:[email protected]?subject=Hudi 
Documentation feedback&body=I have some feedback about the S3 Filesystem 
(experimental) page: ' + window.location.href;"><i class="fa 
fa-envelope-o"></i> Feedback</a>
+  <a class="email" title="Submit feedback" href="#" 
onclick="javascript:window.location='mailto:[email protected]?subject=Hudi 
Documentation feedback&body=I have some feedback about the S3 Filesystem page: 
' + window.location.href;"><i class="fa fa-envelope-o"></i> Feedback</a>
 
 <li>
 
@@ -176,7 +180,7 @@
                                 searchInput: 
document.getElementById('search-input'),
                                 resultsContainer: 
document.getElementById('results-container'),
                                 dataSource: 'search.json',
-                                searchResultTemplate: '<li><a href="{url}" 
title="S3 Filesystem (experimental)">{title}</a></li>',
+                                searchResultTemplate: '<li><a href="{url}" 
title="S3 Filesystem">{title}</a></li>',
                     noResultsText: 'No results found.',
                             limit: 10,
                             fuzzy: true,
@@ -327,7 +331,7 @@
     <!-- Content Column -->
     <div class="col-md-9">
         <div class="post-header">
-   <h1 class="post-title-main">S3 Filesystem (experimental)</h1>
+   <h1 class="post-title-main">S3 Filesystem</h1>
 </div>
 
 
@@ -343,11 +347,11 @@
 
     
 
-  <p>Hudi works with HDFS by default. There is an experimental work going on 
Hoodie-S3 compatibility.</p>
+  <p>In this page, we explain how to get your Hudi spark job to store into AWS 
S3.</p>
 
 <h2 id="aws-configs">AWS configs</h2>
 
-<p>There are two configurations required for Hoodie-S3 compatibility:</p>
+<p>There are two configurations required for Hudi-S3 compatibility:</p>
 
 <ul>
   <li>Adding AWS Credentials for Hudi</li>
@@ -415,7 +419,6 @@ export 
HOODIE_ENV_fs_DOT_s3n_DOT_impl=org.apache.hadoop.fs.s3a.S3AFileSystem
 </ul>
 
 
-
     <div class="tags">
         
     </div>
diff --git a/content/search.json b/content/search.json
index 3f7eb15..0473b34 100644
--- a/content/search.json
+++ b/content/search.json
@@ -6,7 +6,7 @@
 {
 "title": "Admin Guide",
 "tags": "",
-"keywords": "admin",
+"keywords": "hudi, administration, operation, devops",
 "url": "admin_guide.html",
 "summary": "This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets"
 }
@@ -17,7 +17,7 @@
 {
 "title": "Community",
 "tags": "",
-"keywords": "usecases",
+"keywords": "hudi, use cases, big data, apache",
 "url": "community.html",
 "summary": ""
 }
@@ -28,7 +28,7 @@
 {
 "title": "Comparison",
 "tags": "",
-"keywords": "usecases",
+"keywords": "apache, hudi, kafka, kudu, hive, hbase, stream processing",
 "url": "comparison.html",
 "summary": ""
 }
@@ -39,7 +39,7 @@
 {
 "title": "Concepts",
 "tags": "",
-"keywords": "concepts",
+"keywords": "hudi, design, storage, views, timeline",
 "url": "concepts.html",
 "summary": "Here we introduce some basic concepts & give a broad technical 
overview of Hudi"
 }
@@ -50,7 +50,7 @@
 {
 "title": "Configurations",
 "tags": "",
-"keywords": "configurations",
+"keywords": "garbage collection, hudi, jvm, configs, tuning",
 "url": "configurations.html",
 "summary": "Here we list all possible configurations and what they mean"
 }
@@ -61,7 +61,7 @@
 {
 "title": "Developer Setup",
 "tags": "",
-"keywords": "developer setup",
+"keywords": "hudi, ide, developer, setup",
 "url": "contributing.html",
 "summary": ""
 }
@@ -72,9 +72,9 @@
 
 
 {
-"title": "GCS Filesystem (experimental)",
+"title": "GCS Filesystem",
 "tags": "",
-"keywords": "sql hive gcs spark presto",
+"keywords": "hudi, hive, google cloud, storage, spark, presto",
 "url": "gcs_hoodie.html",
 "summary": "In this page, we go over how to configure hudi with Google Cloud 
Storage."
 }
@@ -85,7 +85,7 @@
 {
 "title": "Implementation",
 "tags": "",
-"keywords": "implementation",
+"keywords": "hudi, index, storage, compaction, cleaning, implementation",
 "url": "implementation.html",
 "summary": ""
 }
@@ -96,7 +96,7 @@
 {
 "title": "Incremental Processing",
 "tags": "",
-"keywords": "incremental processing",
+"keywords": "hudi, incremental, batch, stream, processing, Hive, ETL, Spark 
SQL",
 "url": "incremental_processing.html",
 "summary": "In this page, we will discuss some available tools for ingesting 
data incrementally & consuming the changes."
 }
@@ -107,7 +107,7 @@
 {
 "title": "What is Hudi?",
 "tags": "getting_started",
-"keywords": "homepage",
+"keywords": "big data, stream processing, cloud, hdfs, storage, upserts, 
change capture",
 "url": "index.html",
 "summary": "Hudi brings stream processing to big data, providing fresh data 
while being an order of magnitude efficient over traditional batch processing."
 }
@@ -118,7 +118,7 @@
 {
 "title": "Migration Guide",
 "tags": "",
-"keywords": "migration guide",
+"keywords": "hudi, migration, use case",
 "url": "migration_guide.html",
 "summary": "In this page, we will discuss some available tools for migrating 
your existing dataset into a Hudi dataset"
 }
@@ -140,7 +140,7 @@
 {
 "title": "News",
 "tags": "",
-"keywords": "news, blog, updates, release notes, announcements",
+"keywords": "apache, hudi, news, blog, updates, release notes, announcements",
 "url": "news.html",
 "summary": ""
 }
@@ -162,7 +162,7 @@
 {
 "title": "Talks &amp; Powered By",
 "tags": "",
-"keywords": "talks",
+"keywords": "hudi, talks, presentation",
 "url": "powered_by.html",
 "summary": ""
 }
@@ -173,7 +173,7 @@
 {
 "title": "Privacy Policy",
 "tags": "",
-"keywords": "privacy",
+"keywords": "hudi, privacy",
 "url": "privacy.html",
 "summary": ""
 }
@@ -184,7 +184,7 @@
 {
 "title": "Quickstart",
 "tags": "quickstart",
-"keywords": "quickstart",
+"keywords": "hudi, quickstart",
 "url": "quickstart.html",
 "summary": ""
 }
@@ -193,9 +193,9 @@
 
 
 {
-"title": "S3 Filesystem (experimental)",
+"title": "S3 Filesystem",
 "tags": "",
-"keywords": "sql hive s3 spark presto",
+"keywords": "hudi, hive, aws, s3, spark, presto",
 "url": "s3_hoodie.html",
 "summary": "In this page, we go over how to configure Hudi with S3 filesystem."
 }
@@ -210,7 +210,7 @@
 {
 "title": "SQL Queries",
 "tags": "",
-"keywords": "sql hive spark presto",
+"keywords": "hudi, hive, spark, sql, presto",
 "url": "sql_queries.html",
 "summary": "In this page, we go over how to enable SQL queries on Hudi built 
tables."
 }
@@ -221,7 +221,7 @@
 {
 "title": "Use Cases",
 "tags": "",
-"keywords": "usecases",
+"keywords": "hudi, data ingestion, etl, real time, use cases",
 "url": "use_cases.html",
 "summary": "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
 }
diff --git a/content/sql_queries.html b/content/sql_queries.html
index 6936191..d7fa8cc 100644
--- a/content/sql_queries.html
+++ b/content/sql_queries.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="In this page, we go over how to enable SQL 
queries on Hudi built tables.">
-<meta name="keywords" content=" sql hive spark presto">
+<meta name="keywords" content="hudi, hive, spark, sql, presto">
 <title>SQL Queries | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -368,8 +372,6 @@ to using the Hive Serde to read the data 
(planning/executions is still Spark). T
 towards Parquet reading, which we will address in the next method based on 
path filters.
 However benchmarks have not revealed any real performance degradation with 
Hudi &amp; SparkSQL, compared to native support.</p>
 
-<div class="bs-callout bs-callout-info">Get involved to improve this 
integration <a href="https://github.com/uber/hoodie/issues/7";>here</a> and <a 
href="https://issues.apache.org/jira/browse/SPARK-19351";>here</a> </div>
-
 <p>Sample command is provided below to spin up Spark Shell</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>$ spark-shell 
--jars hoodie-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path 
/etc/hive/conf  --packages com.databricks:spark-avro_2.11:4.0.0 --conf 
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 
7g --executor-memory 2g  --master yarn-client
diff --git a/content/strata-talk.html b/content/strata-talk.html
index 13a8375..58b6f8a 100644
--- a/content/strata-talk.html
+++ b/content/strata-talk.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="">
-<meta name="keywords" content="news,  ">
+<meta name="keywords" content="">
 <title>Hudi entered Apache Incubator | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
diff --git a/content/use_cases.html b/content/use_cases.html
index 6df8c34..dcdf403 100644
--- a/content/use_cases.html
+++ b/content/use_cases.html
@@ -4,7 +4,7 @@
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <meta name="description" content="Following are some sample use-cases for 
Hudi, which illustrate the benefits in terms of faster processing & increased 
efficiency">
-<meta name="keywords" content=" usecases">
+<meta name="keywords" content="hudi, data ingestion, etl, real time, use 
cases">
 <title>Use Cases | Hudi</title>
 <link rel="stylesheet" href="css/syntax.css">
 
@@ -149,6 +149,10 @@
                         <li><a 
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI";
 target="_blank">Blog</a></li>
                         
                         
+                        
+                        <li><a 
href="https://projects.apache.org/project.html?incubator-hudi"; 
target="_blank">Team</a></li>
+                        
+                        
                     </ul>
                 </li>
                 
@@ -350,7 +354,7 @@ In most (if not all) Hadoop deployments, it is 
unfortunately solved in a pieceme
 even though this data is arguably the most valuable for the entire 
organization.</p>
 
 <p>For RDBMS ingestion, Hudi provides <strong>faster loads via 
Upserts</strong>, as opposed costly &amp; inefficient bulk loads. For e.g, you 
can read the MySQL BIN log or <a 
href="https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports";>Sqoop
 Incremental Import</a> and apply them to an
-equivalent Hudi table on HDFS. This would be much faster/efficient than a <a 
href="https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457";>bulk
 merge job</a>
+equivalent Hudi table on DFS. This would be much faster/efficient than a <a 
href="https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457";>bulk
 merge job</a>
 or <a 
href="http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/";>complicated
 handcrafted merge workflows</a></p>
 
 <p>For NoSQL datastores like <a 
href="http://cassandra.apache.org/";>Cassandra</a> / <a 
href="http://www.project-voldemort.com/voldemort/";>Voldemort</a> / <a 
href="https://hbase.apache.org/";>HBase</a>, even moderately big installations 
store billions of rows.
@@ -367,13 +371,13 @@ This is absolutely perfect for lower scale (<a 
href="https://blog.twitter.com/20
 But, typically these systems end up getting abused for less interactive 
queries also since data on Hadoop is intolerably stale. This leads to under 
utilization &amp; wasteful hardware/license costs.</p>
 
 <p>On the other hand, interactive SQL solutions on Hadoop such as Presto &amp; 
SparkSQL excel in <strong>queries that finish within few seconds</strong>.
-By bringing <strong>data freshness to a few minutes</strong>, Hudi can provide 
a much efficient alternative, as well unlock real-time analytics on 
<strong>several magnitudes larger datasets</strong> stored in HDFS.
+By bringing <strong>data freshness to a few minutes</strong>, Hudi can provide 
a much efficient alternative, as well unlock real-time analytics on 
<strong>several magnitudes larger datasets</strong> stored in DFS.
 Also, Hudi has no external dependencies (like a dedicated HBase cluster, 
purely used for real-time analytics) and thus enables faster analytics on much 
fresher analytics, without increasing the operational overhead.</p>
 
 <h2 id="incremental-processing-pipelines">Incremental Processing Pipelines</h2>
 
 <p>One fundamental ability Hadoop provides is to build a chain of datasets 
derived from each other via DAGs expressed as workflows.
-Workflows often depend on new data being output by multiple upstream workflows 
and traditionally, availability of new data is indicated by a new HDFS 
Folder/Hive Partition.
+Workflows often depend on new data being output by multiple upstream workflows 
and traditionally, availability of new data is indicated by a new DFS 
Folder/Hive Partition.
 Let’s take a concrete example to illustrate this. An upstream workflow <code 
class="highlighter-rouge">U</code> can create a Hive partition for every hour, 
with data for that hour (event_time) at the end of each hour (processing_time), 
providing effective freshness of 1 hour.
 Then, a downstream workflow <code class="highlighter-rouge">D</code>, kicks 
off immediately after <code class="highlighter-rouge">U</code> finishes, and 
does its own processing for the next hour, increasing the effective latency to 
2 hours.</p>
 
@@ -388,19 +392,18 @@ like 15 mins, and providing an end-end latency of 30 mins 
at <code class="highli
 
 <div class="bs-callout bs-callout-info">To achieve this, Hudi has embraced 
similar concepts from stream processing frameworks like <a 
href="https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations";>Spark
 Streaming</a> , Pub/Sub systems like <a 
href="http://kafka.apache.org/documentation/#theconsumer";>Kafka</a>
 or database replication technologies like <a 
href="https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187";>Oracle
 XStream</a>.
-For the more curious, a more detailed explanation of the benefits of 
Incremetal Processing (compared to Stream Processing &amp; Batch Processing) 
can be found <a 
href="https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop";>here</a></div>
+For the more curious, a more detailed explanation of the benefits of 
Incremental Processing (compared to Stream Processing &amp; Batch Processing) 
can be found <a 
href="https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop";>here</a></div>
 
-<h2 id="data-dispersal-from-hadoop">Data Dispersal From Hadoop</h2>
+<h2 id="data-dispersal-from-dfs">Data Dispersal From DFS</h2>
 
 <p>A popular use-case for Hadoop, is to crunch data and then disperse it back 
to an online serving store, to be used by an application.
 For e.g, a Spark Pipeline can <a 
href="https://eng.uber.com/telematics/";>determine hard braking events on 
Hadoop</a> and load them into a serving store like ElasticSearch, to be used by 
the Uber application to increase safe driving. Typical architectures for this 
employ a <code class="highlighter-rouge">queue</code> between Hadoop and 
serving store, to prevent overwhelming the target serving store.
-A popular choice for this queue is Kafka and this model often results in 
<strong>redundant storage of same data on HDFS (for offline analysis on 
computed results) and Kafka (for dispersal)</strong></p>
+A popular choice for this queue is Kafka and this model often results in 
<strong>redundant storage of same data on DFS (for offline analysis on computed 
results) and Kafka (for dispersal)</strong></p>
 
 <p>Once again Hudi can efficiently solve this problem, by having the Spark 
Pipeline upsert output from
 each run into a Hudi dataset, which can then be incrementally tailed (just 
like a Kafka topic) for new data &amp; written into the serving store.</p>
 
 
-
     <div class="tags">
         
     </div>

Reply via email to