[incubator-hudi] branch asf-site updated: Updating site with latest content from docs folder (#783)

vinoth Thu, 11 Jul 2019 23:05:53 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new a9539c1  Updating site with latest content from docs folder (#783)
a9539c1 is described below

commit a9539c19fe1926e03d17b4ab660a3e882ee45933
Author: vinoth chandar <[email protected]>
AuthorDate: Thu Jul 11 23:04:53 2019 -0700

    Updating site with latest content from docs folder (#783)
    
    - yotpo usage
     - hoodie-utilities-bundle jar replacement in deltastreamer commands
---
 content/Gemfile            |  10 ---
 content/Gemfile.lock       | 156 ---------------------------------------------
 content/contributing.html  |   4 +-
 content/docker_demo.html   |  14 +++-
 content/feed.xml           |   4 +-
 content/powered_by.html    |  14 +++-
 content/querying_data.html |   7 ++
 content/writing_data.html  |   6 +-
 docs/contributing.md       |   2 +-
 9 files changed, 39 insertions(+), 178 deletions(-)

diff --git a/content/Gemfile b/content/Gemfile
deleted file mode 100644
index b301eda..0000000
--- a/content/Gemfile
+++ /dev/null
@@ -1,10 +0,0 @@
-source "https://rubygems.org";
-
-
-gem "jekyll", "3.3.1"
-
-
-group :jekyll_plugins do
-   gem "jekyll-feed", "~> 0.6"
-   gem 'github-pages', '~> 106'
-end
diff --git a/content/Gemfile.lock b/content/Gemfile.lock
deleted file mode 100644
index b72b9b1..0000000
--- a/content/Gemfile.lock
+++ /dev/null
@@ -1,156 +0,0 @@
-GEM
-  remote: https://rubygems.org/
-  specs:
-    activesupport (4.2.7)
-      i18n (~> 0.7)
-      json (~> 1.7, >= 1.7.7)
-      minitest (~> 5.1)
-      thread_safe (~> 0.3, >= 0.3.4)
-      tzinfo (~> 1.1)
-    addressable (2.4.0)
-    coffee-script (2.4.1)
-      coffee-script-source
-      execjs
-    coffee-script-source (1.12.2)
-    colorator (1.1.0)
-    concurrent-ruby (1.1.4)
-    ethon (0.12.0)
-      ffi (>= 1.3.0)
-    execjs (2.7.0)
-    faraday (0.15.4)
-      multipart-post (>= 1.2, < 3)
-    ffi (1.10.0)
-    forwardable-extended (2.6.0)
-    gemoji (2.1.0)
-    github-pages (106)
-      activesupport (= 4.2.7)
-      github-pages-health-check (= 1.2.0)
-      jekyll (= 3.3.1)
-      jekyll-avatar (= 0.4.2)
-      jekyll-coffeescript (= 1.0.1)
-      jekyll-feed (= 0.8.0)
-      jekyll-gist (= 1.4.0)
-      jekyll-github-metadata (= 2.2.0)
-      jekyll-mentions (= 1.2.0)
-      jekyll-paginate (= 1.1.0)
-      jekyll-redirect-from (= 0.11.0)
-      jekyll-relative-links (= 0.2.1)
-      jekyll-sass-converter (= 1.3.0)
-      jekyll-seo-tag (= 2.1.0)
-      jekyll-sitemap (= 0.12.0)
-      jekyll-swiss (= 0.4.0)
-      jemoji (= 0.7.0)
-      kramdown (= 1.11.1)
-      liquid (= 3.0.6)
-      listen (= 3.0.6)
-      mercenary (~> 0.3)
-      minima (= 2.0.0)
-      rouge (= 1.11.1)
-      terminal-table (~> 1.4)
-    github-pages-health-check (1.2.0)
-      addressable (~> 2.3)
-      net-dns (~> 0.8)
-      octokit (~> 4.0)
-      public_suffix (~> 1.4)
-      typhoeus (~> 0.7)
-    html-pipeline (2.10.0)
-      activesupport (>= 2)
-      nokogiri (>= 1.4)
-    i18n (0.9.5)
-      concurrent-ruby (~> 1.0)
-    jekyll (3.3.1)
-      addressable (~> 2.4)
-      colorator (~> 1.0)
-      jekyll-sass-converter (~> 1.0)
-      jekyll-watch (~> 1.1)
-      kramdown (~> 1.3)
-      liquid (~> 3.0)
-      mercenary (~> 0.3.3)
-      pathutil (~> 0.9)
-      rouge (~> 1.7)
-      safe_yaml (~> 1.0)
-    jekyll-avatar (0.4.2)
-      jekyll (~> 3.0)
-    jekyll-coffeescript (1.0.1)
-      coffee-script (~> 2.2)
-    jekyll-feed (0.8.0)
-      jekyll (~> 3.3)
-    jekyll-gist (1.4.0)
-      octokit (~> 4.2)
-    jekyll-github-metadata (2.2.0)
-      jekyll (~> 3.1)
-      octokit (~> 4.0, != 4.4.0)
-    jekyll-mentions (1.2.0)
-      activesupport (~> 4.0)
-      html-pipeline (~> 2.3)
-      jekyll (~> 3.0)
-    jekyll-paginate (1.1.0)
-    jekyll-redirect-from (0.11.0)
-      jekyll (>= 2.0)
-    jekyll-relative-links (0.2.1)
-      jekyll (~> 3.3)
-    jekyll-sass-converter (1.3.0)
-      sass (~> 3.2)
-    jekyll-seo-tag (2.1.0)
-      jekyll (~> 3.3)
-    jekyll-sitemap (0.12.0)
-      jekyll (~> 3.3)
-    jekyll-swiss (0.4.0)
-    jekyll-watch (1.5.1)
-      listen (~> 3.0)
-    jemoji (0.7.0)
-      activesupport (~> 4.0)
-      gemoji (~> 2.0)
-      html-pipeline (~> 2.2)
-      jekyll (>= 3.0)
-    json (2.1.0)
-    kramdown (1.11.1)
-    liquid (3.0.6)
-    listen (3.0.6)
-      rb-fsevent (>= 0.9.3)
-      rb-inotify (>= 0.9.7)
-    mercenary (0.3.6)
-    mini_portile2 (2.4.0)
-    minima (2.0.0)
-    minitest (5.11.3)
-    multipart-post (2.0.0)
-    net-dns (0.9.0)
-    nokogiri (1.10.1)
-      mini_portile2 (~> 2.4.0)
-    octokit (4.13.0)
-      sawyer (~> 0.8.0, >= 0.5.3)
-    pathutil (0.16.2)
-      forwardable-extended (~> 2.6)
-    public_suffix (1.5.3)
-    rb-fsevent (0.10.3)
-    rb-inotify (0.10.0)
-      ffi (~> 1.0)
-    rouge (1.11.1)
-    safe_yaml (1.0.4)
-    sass (3.7.3)
-      sass-listen (~> 4.0.0)
-    sass-listen (4.0.0)
-      rb-fsevent (~> 0.9, >= 0.9.4)
-      rb-inotify (~> 0.9, >= 0.9.7)
-    sawyer (0.8.1)
-      addressable (>= 2.3.5, < 2.6)
-      faraday (~> 0.8, < 1.0)
-    terminal-table (1.8.0)
-      unicode-display_width (~> 1.1, >= 1.1.1)
-    thread_safe (0.3.6)
-    typhoeus (0.8.0)
-      ethon (>= 0.8.0)
-    tzinfo (1.2.5)
-      thread_safe (~> 0.1)
-    unicode-display_width (1.4.1)
-
-PLATFORMS
-  ruby
-
-DEPENDENCIES
-  github-pages (~> 106)
-  jekyll (= 3.3.1)
-  jekyll-feed (~> 0.6)
-
-BUNDLED WITH
-   1.14.3
diff --git a/content/contributing.html b/content/contributing.html
index 662a8bf..6ce7219 100644
--- a/content/contributing.html
+++ b/content/contributing.html
@@ -350,7 +350,7 @@ Software Foundation (ASF).</li>
 
 <p>To contribute, you would need to fork the Hudi code on Github &amp; then 
clone your own fork locally. Once cloned, we recommend building as per 
instructions on <a href="quickstart.html">quickstart</a></p>
 
-<p>We have embraced the code style largely based on <a 
href="https://google.github.io/styleguide/javaguide.html";>google format</a>. 
Please setup your IDE with style files from <a href="../style/">here</a>.
+<p>We have embraced the code style largely based on <a 
href="https://google.github.io/styleguide/javaguide.html";>google format</a>. 
Please setup your IDE with style files from <a 
href="https://github.com/apache/incubator-hudi/tree/master/style";>here</a>.
 These instructions have been tested on IntelliJ. We also recommend setting up 
the <a href="https://plugins.jetbrains.com/plugin/7642-save-actions";>Save 
Action Plugin</a> to auto format &amp; organize imports on save. The Maven 
Compilation life-cycle will fail if there are checkstyle violations.</p>
 
 <h2 id="lifecycle">Lifecycle</h2>
@@ -431,7 +431,7 @@ Discussion about contributing code to Hudi happens on the 
<a href="community.htm
   <li><code class="highlighter-rouge">hoodie-integ-test</code> : Longer 
running integration test processes</li>
   <li><code class="highlighter-rouge">hoodie-spark</code> : Spark datasource 
for writing and reading Hudi datasets. Streaming sink.</li>
   <li><code class="highlighter-rouge">hoodie-utilities</code> : Houses tools 
like DeltaStreamer, SnapshotCopier</li>
-  <li><code class="highlighter-rouge">packaging</code> : Poms for building out 
bundles for easier drop in to Spark, Hive, Presto</li>
+  <li><code class="highlighter-rouge">packaging</code> : Poms for building out 
bundles for easier drop in to Spark, Hive, Presto, Utilities</li>
   <li><code class="highlighter-rouge">style</code>  : Code formatting, 
checkstyle files</li>
 </ul>
 
diff --git a/content/docker_demo.html b/content/docker_demo.html
index e54125e..3f9d0a6 100644
--- a/content/docker_demo.html
+++ b/content/docker_demo.html
@@ -493,7 +493,7 @@ spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
 
 
 # Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table 
stock_ticks_mor --props /var/demo/config/kafka-source.properties 
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
+spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table 
stock_ticks_mor --props /var/demo/config/kafka-source.properties 
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider 
--disable-compaction
 ....
 2018-09-24 22:22:01 INFO  
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - 
OutputCommitCoordinator stopped!
 2018-09-24 22:22:01 INFO  SparkContext:54 - Successfully stopped SparkContext
@@ -764,7 +764,7 @@ spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
 
 
 # Run the following spark-submit command to execute the delta-streamer and 
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table 
stock_ticks_mor --props /var/demo/config/kafka-source.properties 
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
+spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table 
stock_ticks_mor --props /var/demo/config/kafka-source.properties 
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider 
--disable-compaction
 
 exit
 </code></pre>
@@ -1048,6 +1048,9 @@ hoodie:stock_ticks_mor-&gt;compactions show all
     | Compaction Instant Time| State    | Total FileIds to be Compacted|
     |==================================================================|
 
+
+
+
 # Schedule a compaction. This will use Spark Launcher to schedule compaction
 hoodie:stock_ticks_mor-&gt;compaction schedule
 ....
@@ -1062,6 +1065,8 @@ hoodie:stock_ticks-&gt;connect --path 
/user/hive/warehouse/stock_ticks_mor
 18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
 Metadata for table stock_ticks_mor loaded
 
+
+
 hoodie:stock_ticks_mor-&gt;compactions show all
 18/09/24 06:34:12 INFO timeline.HoodieActiveTimeline: Loaded instants 
[[20180924041125__clean__COMPLETED], [20180924041125__deltacommit__COMPLETED], 
[20180924042735__clean__COMPLETED], [20180924042735__deltacommit__COMPLETED], 
[==&gt;20180924063245__compaction__REQUESTED]]
     ___________________________________________________________________
@@ -1069,6 +1074,9 @@ hoodie:stock_ticks_mor-&gt;compactions show all
     |==================================================================|
     | 20180924070031         | REQUESTED| 1                            |
 
+
+
+
 # Execute the compaction. The compaction instant value passed below must be 
the one displayed in the above "compactions show all" query
 hoodie:stock_ticks_mor-&gt;compaction run --compactionInstant  20180924070031 
--parallelism 2 --sparkMemory 1G  --schemaFilePath /var/demo/config/schema.avsc 
--retry 1  
 ....
@@ -1084,6 +1092,8 @@ hoodie:stock_ticks_mor-&gt;connect --path 
/user/hive/warehouse/stock_ticks_mor
 18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
 Metadata for table stock_ticks_mor loaded
 
+
+
 hoodie:stock_ticks-&gt;compactions show all
 18/09/24 07:03:15 INFO timeline.HoodieActiveTimeline: Loaded instants 
[[20180924064636__clean__COMPLETED], [20180924064636__deltacommit__COMPLETED], 
[20180924065057__clean__COMPLETED], [20180924065057__deltacommit__COMPLETED], 
[20180924070031__commit__COMPLETED]]
     ___________________________________________________________________
diff --git a/content/feed.xml b/content/feed.xml
index 6e0b82a..993ea91 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -5,8 +5,8 @@
         <description>Apache Hudi (pronounced “Hoodie”) provides upserts and 
incremental processing capaibilities on Big Data</description>
         <link>http://0.0.0.0:4000/</link>
         <atom:link href="http://0.0.0.0:4000/feed.xml"; rel="self" 
type="application/rss+xml"/>
-        <pubDate>Tue, 14 May 2019 12:07:03 +0000</pubDate>
-        <lastBuildDate>Tue, 14 May 2019 12:07:03 +0000</lastBuildDate>
+        <pubDate>Fri, 12 Jul 2019 05:57:42 +0000</pubDate>
+        <lastBuildDate>Fri, 12 Jul 2019 05:57:42 +0000</lastBuildDate>
         <generator>Jekyll v3.3.1</generator>
         
         <item>
diff --git a/content/powered_by.html b/content/powered_by.html
index 870ca11..5ebc76c 100644
--- a/content/powered_by.html
+++ b/content/powered_by.html
@@ -347,6 +347,10 @@ It also powers several incremental Hive ETL pipelines and 
being currently integr
 
 <p>Yields.io is the first FinTech platform that uses AI for automated model 
validation and real-time monitoring on an enterprise-wide scale. Their data 
lake is managed by Hudi. They are also actively building their infrastructure 
for incremental, cross language/platform machine learning using Hudi.</p>
 
+<h4 id="yotpo">Yotpo</h4>
+
+<p>Using Hudi at Yotpo for several usages. Firstly, integrated Hudi as a 
writer in their open source ETL framework https://github.com/YotpoLtd/metorikku 
and using as an output writer for a CDC pipeline, with events that are being 
generated from a database binlog streams to Kafka and then are written to 
S3.</p>
+
 <h2 id="talks--presentations">Talks &amp; Presentations</h2>
 
 <ol>
@@ -367,10 +371,16 @@ June 2017, Spark Summit 2017, San Francisco, CA. <a 
href="https://www.slideshare
 September 2018, Strata Data Conference, New York, NY</p>
   </li>
   <li>
-    <p><a href="https://databricks
-.com/session/hudi-near-real-time-spark-pipelines-at-petabyte-scale">“Hudi: 
Large-Scale, Near Real-Time Pipelines at Uber”</a> - By Vinoth Chander &amp; 
Nishith Agarwal
+    <p><a 
href="https://databricks.com/session/hudi-near-real-time-spark-pipelines-at-petabyte-scale";>“Hudi:
 Large-Scale, Near Real-Time Pipelines at Uber”</a> - By Vinoth Chandar &amp; 
Nishith Agarwal
 October 2018, Spark+AI Summit Europe, London, UK</p>
   </li>
+  <li>
+    <p><a href="https://www.youtube.com/watch?v=1w3IpavhSWA";>“Powering Uber’s 
global network analytics pipelines in real-time with Apache Hudi”</a> - By 
Ethan Guo &amp; Nishith Agarwal, April 2019, Data Council SF19, San Francisco, 
CA.</p>
+  </li>
+  <li>
+    <p><a 
href="https://www.slideshare.net/ChesterChen/sf-big-analytics-20190612-building-highly-efficient-data-lakes-using-apache-hudi";>“Building
 highly efficient data lakes using Apache Hudi (Incubating)”</a> - By Vinoth 
Chandar 
+June 2019, SF Big Analytics Meetup, San Mateo, CA</p>
+  </li>
 </ol>
 
 <h2 id="articles">Articles</h2>
diff --git a/content/querying_data.html b/content/querying_data.html
index c970927..446cee8 100644
--- a/content/querying_data.html
+++ b/content/querying_data.html
@@ -462,6 +462,13 @@ then the utility can determine if the target dataset has 
no commits or is behind
 it will automatically use the backfill configuration, since applying the last 
24 hours incrementally could take more time than doing a backfill. The current 
limitation of the tool
 is the lack of support for self-joining the same table in mixed mode (normal 
and incremental modes).</p>
 
+<p><strong>NOTE on Hive queries that are executed using Fetch task:</strong>
+Since Fetch tasks invoke InputFormat.listStatus() per partition, Hoodie 
metadata can be listed in
+every such listStatus() call. In order to avoid this, it might be useful to 
disable fetch tasks
+using the hive session property for incremental queries: <code 
class="highlighter-rouge">set hive.fetch.task.conversion=none;</code> This
+would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
+separated) and calls InputFormat.listStatus() only once with all those 
partitions.</p>
+
 <h2 id="spark">Spark</h2>
 
 <p>Spark provides much easier deployment &amp; management of Hudi jars and 
bundles into jobs/notebooks. At a high level, there are two ways to access Hudi 
datasets in Spark.</p>
diff --git a/content/writing_data.html b/content/writing_data.html
index 061abb7..30e9be3 100644
--- a/content/writing_data.html
+++ b/content/writing_data.html
@@ -338,7 +338,7 @@ speeding up large Spark jobs via upserts using the <a 
href="#datasource-writer">
 
 <h2 id="deltastreamer">DeltaStreamer</h2>
 
-<p>The <code class="highlighter-rouge">HoodieDeltaStreamer</code> utility 
(part of hoodie-utilities) provides the way to ingest from different sources 
such as DFS or Kafka, with the following capabilities.</p>
+<p>The <code class="highlighter-rouge">HoodieDeltaStreamer</code> utility 
(part of hoodie-utilities-bundle) provides the way to ingest from different 
sources such as DFS or Kafka, with the following capabilities.</p>
 
 <ul>
   <li>Exactly once ingestion of new events from Kafka, <a 
href="https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports";>incremental
 imports</a> from Sqoop or output of <code 
class="highlighter-rouge">HiveIncrementalPuller</code> or files under a DFS 
folder</li>
@@ -350,7 +350,7 @@ speeding up large Spark jobs via upserts using the <a 
href="#datasource-writer">
 
 <p>Command line options describe capabilities in more detail</p>
 
-<div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$ 
spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls 
hoodie-utilities/target/hoodie-utilities-*-SNAPSHOT.jar` --help
+<div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$ 
spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls 
packaging/hoodie-utilities-bundle/target/hoodie-utilities-bundle-*.jar` --help
 Usage: &lt;main class&gt; [options]
   Options:
     --commit-on-errors
@@ -439,7 +439,7 @@ provided under <code 
class="highlighter-rouge">hoodie-utilities/src/test/resourc
 
 <p>and then ingest it as follows.</p>
 
-<div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$ 
spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls 
hoodie-utilities/target/hoodie-utilities-*-SNAPSHOT.jar` \
+<div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$ 
spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls 
packaging/hoodie-utilities-bundle/target/hoodie-utilities-bundle-*.jar` \
   --props 
file://${PWD}/hoodie-utilities/src/test/resources/delta-streamer-config/kafka-source.properties
 \
   --schemaprovider-class 
com.uber.hoodie.utilities.schema.SchemaRegistryProvider \
   --source-class com.uber.hoodie.utilities.sources.AvroKafkaSource \
diff --git a/docs/contributing.md b/docs/contributing.md
index 71673c8..c8c505d 100644
--- a/docs/contributing.md
+++ b/docs/contributing.md
@@ -23,7 +23,7 @@ To contribute code, you need
 
 To contribute, you would need to fork the Hudi code on Github & then clone 
your own fork locally. Once cloned, we recommend building as per instructions 
on [quickstart](quickstart.html)
 
-We have embraced the code style largely based on [google 
format](https://google.github.io/styleguide/javaguide.html). Please setup your 
IDE with style files from [here](../style/).
+We have embraced the code style largely based on [google 
format](https://google.github.io/styleguide/javaguide.html). Please setup your 
IDE with style files from 
[here](https://github.com/apache/incubator-hudi/tree/master/style).
 These instructions have been tested on IntelliJ. We also recommend setting up 
the [Save Action 
Plugin](https://plugins.jetbrains.com/plugin/7642-save-actions) to auto format 
& organize imports on save. The Maven Compilation life-cycle will fail if there 
are checkstyle violations.

[incubator-hudi] branch asf-site updated: Updating site with latest content from docs folder (#783)

Reply via email to