[incubator-hudi] branch asf-site updated: Updating Apache Hudi Website with latest changes in docs

vbalaji Mon, 06 May 2019 14:55:58 -0700

This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 160b6a3  Updating Apache Hudi Website with latest changes in docs
160b6a3 is described below

commit 160b6a31515c6d9dcfff11dbbf5e540340c07141
Author: Balaji Varadarajan <varad...@uber.com>
AuthorDate: Sun May 5 17:02:50 2019 -0700

    Updating Apache Hudi Website with latest changes in docs
---
 content/admin_guide.html | 41 ++++++++++++++------------------------
 content/docker_demo.html | 51 ++++++++++++++++++++++++++----------------------
 content/feed.xml         |  4 ++--
 content/powered_by.html  |  8 ++++++++
 docs/powered_by.md       |  4 ++++
 5 files changed, 57 insertions(+), 51 deletions(-)

diff --git a/content/admin_guide.html b/content/admin_guide.html
index dde7918..da7d1be 100644
--- a/content/admin_guide.html
+++ b/content/admin_guide.html
@@ -372,8 +372,7 @@ hoodie-&gt;create --path /user/hive/warehouse/table1 
--tableName hoodie_table_1
 
 <p>To see the description of hudi table, use the command:</p>
 
-<div class="highlighter-rouge"><pre class="highlight"><code>
-hoodie:hoodie_table_1-&gt;desc
+<div class="highlighter-rouge"><pre 
class="highlight"><code>hoodie:hoodie_table_1-&gt;desc
 18/09/06 15:57:19 INFO timeline.HoodieActiveTimeline: Loaded instants []
     _________________________________________________________
     | Property                | Value                        |
@@ -384,7 +383,6 @@ hoodie:hoodie_table_1-&gt;desc
     | hoodie.table.name       | hoodie_table_1               |
     | hoodie.table.type       | COPY_ON_WRITE                |
     | hoodie.archivelog.folder|                              |
-
 </code></pre>
 </div>
 
@@ -450,7 +448,6 @@ Each commit has a monotonically increasing string/number 
called the <strong>comm
     ....
     ....
 hoodie:trips-&gt;
-
 </code></pre>
 </div>
 
@@ -551,7 +548,6 @@ pending compactions.</p>
     |==================================================================|
     | &lt;INSTANT_1&gt;            | REQUESTED| 35                           |
     | &lt;INSTANT_2&gt;            | INFLIGHT | 27                           |
-
 </code></pre>
 </div>
 
@@ -649,8 +645,6 @@ hoodie:stock_ticks_mor-&gt;compaction validate --instant 
20181005222601
     | File Id                             | Base Instant Time| Base Data File  
                                                                                
                                 | Num Delta Files| Valid| Error                
                                                           |
     
|=====================================================================================================================================================================================================================================================================================================|
     | 05320e98-9a57-4c38-b809-a6beaaeb36bd| 20181005222445   | 
hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/05320e98-9a57-4c38-b809-a6beaaeb36bd_0_20181005222445.parquet|
 1              | false| All log files specified in compaction operation is not 
present. Missing ....    |
-
-
 </code></pre>
 </div>
 
@@ -664,40 +658,35 @@ so that are preserved. Hudi provides the following CLI to 
support it</p>
 
 <h5 id="unscheduling-compaction">UnScheduling Compaction</h5>
 
-<div class="highlighter-rouge"><pre class="highlight"><code>
-hoodie:trips-&gt;compaction unscheduleFileId --fileId &lt;FileUUID&gt;
+<div class="highlighter-rouge"><pre 
class="highlight"><code>hoodie:trips-&gt;compaction unscheduleFileId --fileId 
&lt;FileUUID&gt;
 ....
 No File renames needed to unschedule file from pending compaction. Operation 
successful.
-
 </code></pre>
 </div>
 
-<p>In other cases, an entire compaction plan needs to be reverted. This is 
supported by the following CLI
-```</p>
+<p>In other cases, an entire compaction plan needs to be reverted. This is 
supported by the following CLI</p>
 
-<p>hoodie:trips-&gt;compaction unschedule –compactionInstant 
<compactionInstant>
+<div class="highlighter-rouge"><pre 
class="highlight"><code>hoodie:trips-&gt;compaction unschedule 
--compactionInstant &lt;compactionInstant&gt;
 .....
-No File renames needed to unschedule pending compaction. Operation 
successful.</compactionInstant></p>
+No File renames needed to unschedule pending compaction. Operation successful.
+</code></pre>
+</div>
 
-<div class="highlighter-rouge"><pre class="highlight"><code>
-##### Repair Compaction
+<h5 id="repair-compaction">Repair Compaction</h5>
 
-The above compaction unscheduling operations could sometimes fail partially 
(e:g -&gt; DFS temporarily unavailable). With
+<p>The above compaction unscheduling operations could sometimes fail partially 
(e:g -&gt; DFS temporarily unavailable). With
 partial failures, the compaction operation could become inconsistent with the 
state of file-slices. When you run
-`compaction validate`, you can notice invalid compaction operations if there 
is one.  In these cases, the repair
+<code class="highlighter-rouge">compaction validate</code>, you can notice 
invalid compaction operations if there is one.  In these cases, the repair
 command comes to the rescue, it will rearrange the file-slices so that there 
is no loss and the file-slices are
-consistent with the compaction plan
+consistent with the compaction plan</p>
 
+<div class="highlighter-rouge"><pre 
class="highlight"><code>hoodie:stock_ticks_mor-&gt;compaction repair --instant 
20181005222611
+......
+Compaction successfully repaired
+.....
 </code></pre>
 </div>
 
-<p>hoodie:stock_ticks_mor-&gt;compaction repair –instant 20181005222611
-……
-Compaction successfully repaired
-…..</p>
-
-<p>```</p>
-
 <h2 id="metrics">Metrics</h2>
 
 <p>Once the Hudi Client is configured with the right datasetname and 
environment for metrics, it produces the following graphite metrics, that aid 
in debugging hudi datasets</p>
diff --git a/content/docker_demo.html b/content/docker_demo.html
index a59d879..e54125e 100644
--- a/content/docker_demo.html
+++ b/content/docker_demo.html
@@ -489,8 +489,11 @@ spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
 ....
 2018-09-24 22:20:00 INFO  
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - 
OutputCommitCoordinator stopped!
 2018-09-24 22:20:00 INFO  SparkContext:54 - Successfully stopped SparkContext
+
+
+
 # Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table 
stock_ticks_mor --props /var/demo/config/kafka-source.properties
+spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table 
stock_ticks_mor --props /var/demo/config/kafka-source.properties 
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
 ....
 2018-09-24 22:22:01 INFO  
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - 
OutputCommitCoordinator stopped!
 2018-09-24 22:22:01 INFO  SparkContext:54 - Successfully stopped SparkContext
@@ -757,14 +760,16 @@ partitions, there is no need to run hive-sync</p>
 docker exec -it adhoc-2 /bin/bash
 
 # Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
stock_ticks_cow --props /var/demo/config/kafka-source.properties
+spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
stock_ticks_cow --props /var/demo/config/kafka-source.properties 
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
+
 
 # Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table 
stock_ticks_mor --props /var/demo/config/kafka-source.properties
+spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table 
stock_ticks_mor --props /var/demo/config/kafka-source.properties 
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
 
 exit
 </code></pre>
 </div>
+
 <p>With Copy-On-Write table, the second ingestion by DeltaStreamer resulted in 
a new version of Parquet file getting created.
 See <code 
class="highlighter-rouge">http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow/2018/08/31</code></p>
 
@@ -920,41 +925,41 @@ exit
 
 <p>With 2 batches of data ingested, lets showcase the support for incremental 
queries in Hudi Copy-On-Write datasets</p>
 
-<p>Lets take the same projection query example
-```
-docker exec -it adhoc-2 /bin/bash
-beeline -u jdbc:hive2://hiveserver:10000 –hiveconf 
hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat –hiveconf 
hive.stats.autogather=false</p>
+<p>Lets take the same projection query example</p>
 
-<p>0: jdbc:hive2://hiveserver:10000&gt; select <code 
class="highlighter-rouge">_hoodie_commit_time</code>, symbol, ts, volume, open, 
close  from stock_ticks_cow where  symbol = ‘GOOG’;
-+———————-+———+———————-+———+————+———–+–+
+<div class="highlighter-rouge"><pre class="highlight"><code>docker exec -it 
adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 --hiveconf 
hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf 
hive.stats.autogather=false
+
+0: jdbc:hive2://hiveserver:10000&gt; select `_hoodie_commit_time`, symbol, ts, 
volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
 | _hoodie_commit_time  | symbol  |          ts          | volume  |    open    
|   close   |
-+———————-+———+———————-+———+————+———–+–+
++----------------------+---------+----------------------+---------+------------+-----------+--+
 | 20180924064621       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     
| 1230.02   |
 | 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  
| 1227.215  |
-+———————-+———+———————-+———+————+———–+–+</p>
++----------------------+---------+----------------------+---------+------------+-----------+--+
+</code></pre>
+</div>
 
-<div class="highlighter-rouge"><pre class="highlight"><code>
-As you notice from the above queries, there are 2 commits - 20180924064621 and 
20180924065039 in timeline order.
+<p>As you notice from the above queries, there are 2 commits - 20180924064621 
and 20180924065039 in timeline order.
 When you follow the steps, you will be getting different timestamps for 
commits. Substitute them
-in place of the above timestamps.
+in place of the above timestamps.</p>
 
-To show the effects of incremental-query, let us assume that a reader has 
already seen the changes as part of
+<p>To show the effects of incremental-query, let us assume that a reader has 
already seen the changes as part of
 ingesting first batch. Now, for the reader to see effect of the second batch, 
he/she has to keep the start timestamp to
-the commit time of the first batch (20180924064621) and run incremental query
+the commit time of the first batch (20180924064621) and run incremental 
query</p>
 
-`Hudi incremental mode` provides efficient scanning for incremental queries by 
filtering out files that do not have any
-candidate rows using hudi-managed metadata.
+<p>Hudi incremental mode provides efficient scanning for incremental queries 
by filtering out files that do not have any
+candidate rows using hudi-managed metadata.</p>
 
-</code></pre>
-</div>
-<p>docker exec -it adhoc-2 /bin/bash
-beeline -u jdbc:hive2://hiveserver:10000 –hiveconf 
hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat –hiveconf 
hive.stats.autogather=false
+<div class="highlighter-rouge"><pre class="highlight"><code>docker exec -it 
adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 --hiveconf 
hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf 
hive.stats.autogather=false
 0: jdbc:hive2://hiveserver:10000&gt; set 
hoodie.stock_ticks_cow.consume.mode=INCREMENTAL;
 No rows affected (0.009 seconds)
 0: jdbc:hive2://hiveserver:10000&gt;  set 
hoodie.stock_ticks_cow.consume.max.commits=3;
 No rows affected (0.009 seconds)
 0: jdbc:hive2://hiveserver:10000&gt; set 
hoodie.stock_ticks_cow.consume.start.timestamp=20180924064621;
-```</p>
+</code></pre>
+</div>
 
 <p>With the above setting, file-ids that do not have any updates from the 
commit 20180924065039 is filtered out without scanning.
 Here is the incremental query :</p>
diff --git a/content/feed.xml b/content/feed.xml
index 0c27805..af8b991 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -5,8 +5,8 @@
         <description>Apache Hudi (pronounced “Hoodie”) provides upserts and 
incremental processing capaibilities on Big Data</description>
         <link>http://0.0.0.0:4000/</link>
         <atom:link href="http://0.0.0.0:4000/feed.xml"; rel="self" 
type="application/rss+xml"/>
-        <pubDate>Fri, 22 Mar 2019 19:49:42 +0000</pubDate>
-        <lastBuildDate>Fri, 22 Mar 2019 19:49:42 +0000</lastBuildDate>
+        <pubDate>Mon, 06 May 2019 21:51:11 +0000</pubDate>
+        <lastBuildDate>Mon, 06 May 2019 21:51:11 +0000</lastBuildDate>
         <generator>Jekyll v3.3.1</generator>
         
         <item>
diff --git a/content/powered_by.html b/content/powered_by.html
index 2294ea1..870ca11 100644
--- a/content/powered_by.html
+++ b/content/powered_by.html
@@ -339,6 +339,14 @@
 It has been in production since Aug 2016, powering ~100 highly business 
critical tables on Hadoop, worth 100s of TBs(including top 10 including 
trips,riders,partners).
 It also powers several incremental Hive ETL pipelines and being currently 
integrated into Uber’s data dispersal system.</p>
 
+<h4 id="emis-health">EMIS Health</h4>
+
+<p>[EMIS Health][https://www.emishealth.com/] is the largest provider of 
Primary Care IT software in the UK with datasets including more than 500Bn 
healthcare records. HUDI is used to manage their analytics dataset in 
production and keeping them up-to-date with their upstream source. Presto is 
being used to query the data written in HUDI format.</p>
+
+<h4 id="yieldsio">Yields.io</h4>
+
+<p>Yields.io is the first FinTech platform that uses AI for automated model 
validation and real-time monitoring on an enterprise-wide scale. Their data 
lake is managed by Hudi. They are also actively building their infrastructure 
for incremental, cross language/platform machine learning using Hudi.</p>
+
 <h2 id="talks--presentations">Talks &amp; Presentations</h2>
 
 <ol>
diff --git a/docs/powered_by.md b/docs/powered_by.md
index e4058fd..36abcb0 100644
--- a/docs/powered_by.md
+++ b/docs/powered_by.md
@@ -18,6 +18,10 @@ It also powers several incremental Hive ETL pipelines and 
being currently integr
 
 [EMIS Health][https://www.emishealth.com/] is the largest provider of Primary 
Care IT software in the UK with datasets including more than 500Bn healthcare 
records. HUDI is used to manage their analytics dataset in production and 
keeping them up-to-date with their upstream source. Presto is being used to 
query the data written in HUDI format.
 
+#### Yields.io
+
+Yields.io is the first FinTech platform that uses AI for automated model 
validation and real-time monitoring on an enterprise-wide scale. Their data 
lake is managed by Hudi. They are also actively building their infrastructure 
for incremental, cross language/platform machine learning using Hudi.
+ 
 
 ## Talks & Presentations

[incubator-hudi] branch asf-site updated: Updating Apache Hudi Website with latest changes in docs

Reply via email to