This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new a9539c1 Updating site with latest content from docs folder (#783)
a9539c1 is described below
commit a9539c19fe1926e03d17b4ab660a3e882ee45933
Author: vinoth chandar <[email protected]>
AuthorDate: Thu Jul 11 23:04:53 2019 -0700
Updating site with latest content from docs folder (#783)
- yotpo usage
- hoodie-utilities-bundle jar replacement in deltastreamer commands
---
content/Gemfile | 10 ---
content/Gemfile.lock | 156 ---------------------------------------------
content/contributing.html | 4 +-
content/docker_demo.html | 14 +++-
content/feed.xml | 4 +-
content/powered_by.html | 14 +++-
content/querying_data.html | 7 ++
content/writing_data.html | 6 +-
docs/contributing.md | 2 +-
9 files changed, 39 insertions(+), 178 deletions(-)
diff --git a/content/Gemfile b/content/Gemfile
deleted file mode 100644
index b301eda..0000000
--- a/content/Gemfile
+++ /dev/null
@@ -1,10 +0,0 @@
-source "https://rubygems.org"
-
-
-gem "jekyll", "3.3.1"
-
-
-group :jekyll_plugins do
- gem "jekyll-feed", "~> 0.6"
- gem 'github-pages', '~> 106'
-end
diff --git a/content/Gemfile.lock b/content/Gemfile.lock
deleted file mode 100644
index b72b9b1..0000000
--- a/content/Gemfile.lock
+++ /dev/null
@@ -1,156 +0,0 @@
-GEM
- remote: https://rubygems.org/
- specs:
- activesupport (4.2.7)
- i18n (~> 0.7)
- json (~> 1.7, >= 1.7.7)
- minitest (~> 5.1)
- thread_safe (~> 0.3, >= 0.3.4)
- tzinfo (~> 1.1)
- addressable (2.4.0)
- coffee-script (2.4.1)
- coffee-script-source
- execjs
- coffee-script-source (1.12.2)
- colorator (1.1.0)
- concurrent-ruby (1.1.4)
- ethon (0.12.0)
- ffi (>= 1.3.0)
- execjs (2.7.0)
- faraday (0.15.4)
- multipart-post (>= 1.2, < 3)
- ffi (1.10.0)
- forwardable-extended (2.6.0)
- gemoji (2.1.0)
- github-pages (106)
- activesupport (= 4.2.7)
- github-pages-health-check (= 1.2.0)
- jekyll (= 3.3.1)
- jekyll-avatar (= 0.4.2)
- jekyll-coffeescript (= 1.0.1)
- jekyll-feed (= 0.8.0)
- jekyll-gist (= 1.4.0)
- jekyll-github-metadata (= 2.2.0)
- jekyll-mentions (= 1.2.0)
- jekyll-paginate (= 1.1.0)
- jekyll-redirect-from (= 0.11.0)
- jekyll-relative-links (= 0.2.1)
- jekyll-sass-converter (= 1.3.0)
- jekyll-seo-tag (= 2.1.0)
- jekyll-sitemap (= 0.12.0)
- jekyll-swiss (= 0.4.0)
- jemoji (= 0.7.0)
- kramdown (= 1.11.1)
- liquid (= 3.0.6)
- listen (= 3.0.6)
- mercenary (~> 0.3)
- minima (= 2.0.0)
- rouge (= 1.11.1)
- terminal-table (~> 1.4)
- github-pages-health-check (1.2.0)
- addressable (~> 2.3)
- net-dns (~> 0.8)
- octokit (~> 4.0)
- public_suffix (~> 1.4)
- typhoeus (~> 0.7)
- html-pipeline (2.10.0)
- activesupport (>= 2)
- nokogiri (>= 1.4)
- i18n (0.9.5)
- concurrent-ruby (~> 1.0)
- jekyll (3.3.1)
- addressable (~> 2.4)
- colorator (~> 1.0)
- jekyll-sass-converter (~> 1.0)
- jekyll-watch (~> 1.1)
- kramdown (~> 1.3)
- liquid (~> 3.0)
- mercenary (~> 0.3.3)
- pathutil (~> 0.9)
- rouge (~> 1.7)
- safe_yaml (~> 1.0)
- jekyll-avatar (0.4.2)
- jekyll (~> 3.0)
- jekyll-coffeescript (1.0.1)
- coffee-script (~> 2.2)
- jekyll-feed (0.8.0)
- jekyll (~> 3.3)
- jekyll-gist (1.4.0)
- octokit (~> 4.2)
- jekyll-github-metadata (2.2.0)
- jekyll (~> 3.1)
- octokit (~> 4.0, != 4.4.0)
- jekyll-mentions (1.2.0)
- activesupport (~> 4.0)
- html-pipeline (~> 2.3)
- jekyll (~> 3.0)
- jekyll-paginate (1.1.0)
- jekyll-redirect-from (0.11.0)
- jekyll (>= 2.0)
- jekyll-relative-links (0.2.1)
- jekyll (~> 3.3)
- jekyll-sass-converter (1.3.0)
- sass (~> 3.2)
- jekyll-seo-tag (2.1.0)
- jekyll (~> 3.3)
- jekyll-sitemap (0.12.0)
- jekyll (~> 3.3)
- jekyll-swiss (0.4.0)
- jekyll-watch (1.5.1)
- listen (~> 3.0)
- jemoji (0.7.0)
- activesupport (~> 4.0)
- gemoji (~> 2.0)
- html-pipeline (~> 2.2)
- jekyll (>= 3.0)
- json (2.1.0)
- kramdown (1.11.1)
- liquid (3.0.6)
- listen (3.0.6)
- rb-fsevent (>= 0.9.3)
- rb-inotify (>= 0.9.7)
- mercenary (0.3.6)
- mini_portile2 (2.4.0)
- minima (2.0.0)
- minitest (5.11.3)
- multipart-post (2.0.0)
- net-dns (0.9.0)
- nokogiri (1.10.1)
- mini_portile2 (~> 2.4.0)
- octokit (4.13.0)
- sawyer (~> 0.8.0, >= 0.5.3)
- pathutil (0.16.2)
- forwardable-extended (~> 2.6)
- public_suffix (1.5.3)
- rb-fsevent (0.10.3)
- rb-inotify (0.10.0)
- ffi (~> 1.0)
- rouge (1.11.1)
- safe_yaml (1.0.4)
- sass (3.7.3)
- sass-listen (~> 4.0.0)
- sass-listen (4.0.0)
- rb-fsevent (~> 0.9, >= 0.9.4)
- rb-inotify (~> 0.9, >= 0.9.7)
- sawyer (0.8.1)
- addressable (>= 2.3.5, < 2.6)
- faraday (~> 0.8, < 1.0)
- terminal-table (1.8.0)
- unicode-display_width (~> 1.1, >= 1.1.1)
- thread_safe (0.3.6)
- typhoeus (0.8.0)
- ethon (>= 0.8.0)
- tzinfo (1.2.5)
- thread_safe (~> 0.1)
- unicode-display_width (1.4.1)
-
-PLATFORMS
- ruby
-
-DEPENDENCIES
- github-pages (~> 106)
- jekyll (= 3.3.1)
- jekyll-feed (~> 0.6)
-
-BUNDLED WITH
- 1.14.3
diff --git a/content/contributing.html b/content/contributing.html
index 662a8bf..6ce7219 100644
--- a/content/contributing.html
+++ b/content/contributing.html
@@ -350,7 +350,7 @@ Software Foundation (ASF).</li>
<p>To contribute, you would need to fork the Hudi code on Github & then
clone your own fork locally. Once cloned, we recommend building as per
instructions on <a href="quickstart.html">quickstart</a></p>
-<p>We have embraced the code style largely based on <a
href="https://google.github.io/styleguide/javaguide.html">google format</a>.
Please setup your IDE with style files from <a href="../style/">here</a>.
+<p>We have embraced the code style largely based on <a
href="https://google.github.io/styleguide/javaguide.html">google format</a>.
Please setup your IDE with style files from <a
href="https://github.com/apache/incubator-hudi/tree/master/style">here</a>.
These instructions have been tested on IntelliJ. We also recommend setting up
the <a href="https://plugins.jetbrains.com/plugin/7642-save-actions">Save
Action Plugin</a> to auto format & organize imports on save. The Maven
Compilation life-cycle will fail if there are checkstyle violations.</p>
<h2 id="lifecycle">Lifecycle</h2>
@@ -431,7 +431,7 @@ Discussion about contributing code to Hudi happens on the
<a href="community.htm
<li><code class="highlighter-rouge">hoodie-integ-test</code> : Longer
running integration test processes</li>
<li><code class="highlighter-rouge">hoodie-spark</code> : Spark datasource
for writing and reading Hudi datasets. Streaming sink.</li>
<li><code class="highlighter-rouge">hoodie-utilities</code> : Houses tools
like DeltaStreamer, SnapshotCopier</li>
- <li><code class="highlighter-rouge">packaging</code> : Poms for building out
bundles for easier drop in to Spark, Hive, Presto</li>
+ <li><code class="highlighter-rouge">packaging</code> : Poms for building out
bundles for easier drop in to Spark, Hive, Presto, Utilities</li>
<li><code class="highlighter-rouge">style</code> : Code formatting,
checkstyle files</li>
</ul>
diff --git a/content/docker_demo.html b/content/docker_demo.html
index e54125e..3f9d0a6 100644
--- a/content/docker_demo.html
+++ b/content/docker_demo.html
@@ -493,7 +493,7 @@ spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
+spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
--disable-compaction
....
2018-09-24 22:22:01 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 -
OutputCommitCoordinator stopped!
2018-09-24 22:22:01 INFO SparkContext:54 - Successfully stopped SparkContext
@@ -764,7 +764,7 @@ spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
+spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
--disable-compaction
exit
</code></pre>
@@ -1048,6 +1048,9 @@ hoodie:stock_ticks_mor->compactions show all
| Compaction Instant Time| State | Total FileIds to be Compacted|
|==================================================================|
+
+
+
# Schedule a compaction. This will use Spark Launcher to schedule compaction
hoodie:stock_ticks_mor->compaction schedule
....
@@ -1062,6 +1065,8 @@ hoodie:stock_ticks->connect --path
/user/hive/warehouse/stock_ticks_mor
18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Finished Loading Table of
type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
Metadata for table stock_ticks_mor loaded
+
+
hoodie:stock_ticks_mor->compactions show all
18/09/24 06:34:12 INFO timeline.HoodieActiveTimeline: Loaded instants
[[20180924041125__clean__COMPLETED], [20180924041125__deltacommit__COMPLETED],
[20180924042735__clean__COMPLETED], [20180924042735__deltacommit__COMPLETED],
[==>20180924063245__compaction__REQUESTED]]
___________________________________________________________________
@@ -1069,6 +1074,9 @@ hoodie:stock_ticks_mor->compactions show all
|==================================================================|
| 20180924070031 | REQUESTED| 1 |
+
+
+
# Execute the compaction. The compaction instant value passed below must be
the one displayed in the above "compactions show all" query
hoodie:stock_ticks_mor->compaction run --compactionInstant 20180924070031
--parallelism 2 --sparkMemory 1G --schemaFilePath /var/demo/config/schema.avsc
--retry 1
....
@@ -1084,6 +1092,8 @@ hoodie:stock_ticks_mor->connect --path
/user/hive/warehouse/stock_ticks_mor
18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Finished Loading Table of
type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
Metadata for table stock_ticks_mor loaded
+
+
hoodie:stock_ticks->compactions show all
18/09/24 07:03:15 INFO timeline.HoodieActiveTimeline: Loaded instants
[[20180924064636__clean__COMPLETED], [20180924064636__deltacommit__COMPLETED],
[20180924065057__clean__COMPLETED], [20180924065057__deltacommit__COMPLETED],
[20180924070031__commit__COMPLETED]]
___________________________________________________________________
diff --git a/content/feed.xml b/content/feed.xml
index 6e0b82a..993ea91 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -5,8 +5,8 @@
<description>Apache Hudi (pronounced “Hoodie”) provides upserts and
incremental processing capaibilities on Big Data</description>
<link>http://0.0.0.0:4000/</link>
<atom:link href="http://0.0.0.0:4000/feed.xml" rel="self"
type="application/rss+xml"/>
- <pubDate>Tue, 14 May 2019 12:07:03 +0000</pubDate>
- <lastBuildDate>Tue, 14 May 2019 12:07:03 +0000</lastBuildDate>
+ <pubDate>Fri, 12 Jul 2019 05:57:42 +0000</pubDate>
+ <lastBuildDate>Fri, 12 Jul 2019 05:57:42 +0000</lastBuildDate>
<generator>Jekyll v3.3.1</generator>
<item>
diff --git a/content/powered_by.html b/content/powered_by.html
index 870ca11..5ebc76c 100644
--- a/content/powered_by.html
+++ b/content/powered_by.html
@@ -347,6 +347,10 @@ It also powers several incremental Hive ETL pipelines and
being currently integr
<p>Yields.io is the first FinTech platform that uses AI for automated model
validation and real-time monitoring on an enterprise-wide scale. Their data
lake is managed by Hudi. They are also actively building their infrastructure
for incremental, cross language/platform machine learning using Hudi.</p>
+<h4 id="yotpo">Yotpo</h4>
+
+<p>Using Hudi at Yotpo for several usages. Firstly, integrated Hudi as a
writer in their open source ETL framework https://github.com/YotpoLtd/metorikku
and using as an output writer for a CDC pipeline, with events that are being
generated from a database binlog streams to Kafka and then are written to
S3.</p>
+
<h2 id="talks--presentations">Talks & Presentations</h2>
<ol>
@@ -367,10 +371,16 @@ June 2017, Spark Summit 2017, San Francisco, CA. <a
href="https://www.slideshare
September 2018, Strata Data Conference, New York, NY</p>
</li>
<li>
- <p><a href="https://databricks
-.com/session/hudi-near-real-time-spark-pipelines-at-petabyte-scale">“Hudi:
Large-Scale, Near Real-Time Pipelines at Uber”</a> - By Vinoth Chander &
Nishith Agarwal
+ <p><a
href="https://databricks.com/session/hudi-near-real-time-spark-pipelines-at-petabyte-scale">“Hudi:
Large-Scale, Near Real-Time Pipelines at Uber”</a> - By Vinoth Chandar &
Nishith Agarwal
October 2018, Spark+AI Summit Europe, London, UK</p>
</li>
+ <li>
+ <p><a href="https://www.youtube.com/watch?v=1w3IpavhSWA">“Powering Uber’s
global network analytics pipelines in real-time with Apache Hudi”</a> - By
Ethan Guo & Nishith Agarwal, April 2019, Data Council SF19, San Francisco,
CA.</p>
+ </li>
+ <li>
+ <p><a
href="https://www.slideshare.net/ChesterChen/sf-big-analytics-20190612-building-highly-efficient-data-lakes-using-apache-hudi">“Building
highly efficient data lakes using Apache Hudi (Incubating)”</a> - By Vinoth
Chandar
+June 2019, SF Big Analytics Meetup, San Mateo, CA</p>
+ </li>
</ol>
<h2 id="articles">Articles</h2>
diff --git a/content/querying_data.html b/content/querying_data.html
index c970927..446cee8 100644
--- a/content/querying_data.html
+++ b/content/querying_data.html
@@ -462,6 +462,13 @@ then the utility can determine if the target dataset has
no commits or is behind
it will automatically use the backfill configuration, since applying the last
24 hours incrementally could take more time than doing a backfill. The current
limitation of the tool
is the lack of support for self-joining the same table in mixed mode (normal
and incremental modes).</p>
+<p><strong>NOTE on Hive queries that are executed using Fetch task:</strong>
+Since Fetch tasks invoke InputFormat.listStatus() per partition, Hoodie
metadata can be listed in
+every such listStatus() call. In order to avoid this, it might be useful to
disable fetch tasks
+using the hive session property for incremental queries: <code
class="highlighter-rouge">set hive.fetch.task.conversion=none;</code> This
+would ensure Map Reduce execution is chosen for a Hive query, which combines
partitions (comma
+separated) and calls InputFormat.listStatus() only once with all those
partitions.</p>
+
<h2 id="spark">Spark</h2>
<p>Spark provides much easier deployment & management of Hudi jars and
bundles into jobs/notebooks. At a high level, there are two ways to access Hudi
datasets in Spark.</p>
diff --git a/content/writing_data.html b/content/writing_data.html
index 061abb7..30e9be3 100644
--- a/content/writing_data.html
+++ b/content/writing_data.html
@@ -338,7 +338,7 @@ speeding up large Spark jobs via upserts using the <a
href="#datasource-writer">
<h2 id="deltastreamer">DeltaStreamer</h2>
-<p>The <code class="highlighter-rouge">HoodieDeltaStreamer</code> utility
(part of hoodie-utilities) provides the way to ingest from different sources
such as DFS or Kafka, with the following capabilities.</p>
+<p>The <code class="highlighter-rouge">HoodieDeltaStreamer</code> utility
(part of hoodie-utilities-bundle) provides the way to ingest from different
sources such as DFS or Kafka, with the following capabilities.</p>
<ul>
<li>Exactly once ingestion of new events from Kafka, <a
href="https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports">incremental
imports</a> from Sqoop or output of <code
class="highlighter-rouge">HiveIncrementalPuller</code> or files under a DFS
folder</li>
@@ -350,7 +350,7 @@ speeding up large Spark jobs via upserts using the <a
href="#datasource-writer">
<p>Command line options describe capabilities in more detail</p>
-<div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$
spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls
hoodie-utilities/target/hoodie-utilities-*-SNAPSHOT.jar` --help
+<div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$
spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hoodie-utilities-bundle/target/hoodie-utilities-bundle-*.jar` --help
Usage: <main class> [options]
Options:
--commit-on-errors
@@ -439,7 +439,7 @@ provided under <code
class="highlighter-rouge">hoodie-utilities/src/test/resourc
<p>and then ingest it as follows.</p>
-<div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$
spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls
hoodie-utilities/target/hoodie-utilities-*-SNAPSHOT.jar` \
+<div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$
spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hoodie-utilities-bundle/target/hoodie-utilities-bundle-*.jar` \
--props
file://${PWD}/hoodie-utilities/src/test/resources/delta-streamer-config/kafka-source.properties
\
--schemaprovider-class
com.uber.hoodie.utilities.schema.SchemaRegistryProvider \
--source-class com.uber.hoodie.utilities.sources.AvroKafkaSource \
diff --git a/docs/contributing.md b/docs/contributing.md
index 71673c8..c8c505d 100644
--- a/docs/contributing.md
+++ b/docs/contributing.md
@@ -23,7 +23,7 @@ To contribute code, you need
To contribute, you would need to fork the Hudi code on Github & then clone
your own fork locally. Once cloned, we recommend building as per instructions
on [quickstart](quickstart.html)
-We have embraced the code style largely based on [google
format](https://google.github.io/styleguide/javaguide.html). Please setup your
IDE with style files from [here](../style/).
+We have embraced the code style largely based on [google
format](https://google.github.io/styleguide/javaguide.html). Please setup your
IDE with style files from
[here](https://github.com/apache/incubator-hudi/tree/master/style).
These instructions have been tested on IntelliJ. We also recommend setting up
the [Save Action
Plugin](https://plugins.jetbrains.com/plugin/7642-save-actions) to auto format
& organize imports on save. The Maven Compilation life-cycle will fail if there
are checkstyle violations.