This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 8c871ca HUDI-201 Updated docs to reflect migration from
com.uber.hoodie to org.apache.hudi (#831)
8c871ca is described below
commit 8c871caa2cae60880cbb5e90a2a5dd80beebc64f
Author: Balaji Varadarajan <[email protected]>
AuthorDate: Sun Aug 11 17:59:09 2019 -0700
HUDI-201 Updated docs to reflect migration from com.uber.hoodie to
org.apache.hudi (#831)
---
content/admin_guide.html | 16 ++++-----
content/community.html | 4 +--
content/configurations.html | 14 ++++----
content/contributing.html | 18 +++++-----
content/docker_demo.html | 84 ++++++++++++++++++++++----------------------
content/feed.xml | 4 +--
content/gcs_hoodie.html | 2 +-
content/migration_guide.html | 6 ++--
content/querying_data.html | 18 +++++-----
content/quickstart.html | 18 +++++-----
content/s3_hoodie.html | 2 +-
content/writing_data.html | 50 +++++++++++++-------------
docs/admin_guide.md | 16 ++++-----
docs/configurations.md | 14 ++++----
docs/contributing.md | 16 ++++-----
docs/docker_demo.md | 84 ++++++++++++++++++++++----------------------
docs/gcs_filesystem.md | 2 +-
docs/migration_guide.md | 6 ++--
docs/querying_data.md | 18 +++++-----
docs/quickstart.md | 18 +++++-----
docs/s3_filesystem.md | 2 +-
docs/writing_data.md | 50 +++++++++++++-------------
22 files changed, 231 insertions(+), 231 deletions(-)
diff --git a/content/admin_guide.html b/content/admin_guide.html
index 6e85f8d..20323a4 100644
--- a/content/admin_guide.html
+++ b/content/admin_guide.html
@@ -345,7 +345,7 @@
<h2 id="admin-cli">Admin CLI</h2>
-<p>Once hudi has been built, the shell can be fired by via <code
class="highlighter-rouge">cd hoodie-cli && ./hoodie-cli.sh</code>.
+<p>Once hudi has been built, the shell can be fired by via <code
class="highlighter-rouge">cd hudi-cli && ./hudi-cli.sh</code>.
A hudi dataset resides on DFS, in a location referred to as the
<strong>basePath</strong> and we would need this location in order to connect
to a Hudi dataset.
Hudi library effectively manages this dataset internally, using .hoodie
subfolder to track all metadata</p>
@@ -354,17 +354,17 @@ Hudi library effectively manages this dataset internally,
using .hoodie subfolde
<div class="highlighter-rouge"><pre class="highlight"><code>18/09/06 15:56:52
INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330
'javax.inject.Inject' annotation found and supported for autowiring
============================================
* *
-* _ _ _ _ *
-* | | | | | (_) *
-* | |__| | ___ ___ __| |_ ___ *
-* | __ |/ _ \ / _ \ / _` | |/ _ \ *
-* | | | | (_) | (_) | (_| | | __/ *
-* |_| |_|\___/ \___/ \__,_|_|\___| *
+* _ _ _ _ *
+* | | | | | | (_) *
+* | |__| | __| | - *
+* | __ || | / _` | || *
+* | | | || || (_| | || *
+* |_| |_|\___/ \____/ || *
* *
============================================
Welcome to Hoodie CLI. Please type help if you are looking for help.
-hoodie->create --path /user/hive/warehouse/table1 --tableName
hoodie_table_1 --tableType COPY_ON_WRITE
+hudi->create --path /user/hive/warehouse/table1 --tableName hoodie_table_1
--tableType COPY_ON_WRITE
.....
18/09/06 15:57:15 INFO table.HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from ...
</code></pre>
diff --git a/content/community.html b/content/community.html
index 853a5f6..89eb6c7 100644
--- a/content/community.html
+++ b/content/community.html
@@ -46,7 +46,7 @@
<script
src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
-<link rel="alternate" type="application/rss+xml" title=""
href="http://0.0.0.0:4000feed.xml">
+<link rel="alternate" type="application/rss+xml" title=""
href="http://localhost:4000feed.xml">
<script>
$(document).ready(function() {
@@ -470,4 +470,4 @@ Specifically, please refer to the detailed <a
href="contributing.html">contribut
</body>
-</html>
+</html>
\ No newline at end of file
diff --git a/content/configurations.html b/content/configurations.html
index bb073f0..09c4f8e 100644
--- a/content/configurations.html
+++ b/content/configurations.html
@@ -390,7 +390,7 @@ The actual datasource level configs are listed below.</p>
<p>Additionally, you can pass down any of the WriteClient level configs
directly using <code class="highlighter-rouge">options()</code> or <code
class="highlighter-rouge">option(k,v)</code> methods.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>inputDF.write()
-.format("com.uber.hoodie")
+.format("org.apache.hudi")
.options(clientOpts) // any of the Hudi client opts can be passed in as well
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
@@ -422,7 +422,7 @@ The actual datasource level configs are listed below.</p>
we will pick the one with the largest value for the precombine field,
determined by Object.compareTo(..)</span></p>
<h5 id="PAYLOAD_CLASS_OPT_KEY">PAYLOAD_CLASS_OPT_KEY</h5>
-<p>Property: <code
class="highlighter-rouge">hoodie.datasource.write.payload.class</code>,
Default: <code
class="highlighter-rouge">com.uber.hoodie.OverwriteWithLatestAvroPayload</code>
<br />
+<p>Property: <code
class="highlighter-rouge">hoodie.datasource.write.payload.class</code>,
Default: <code
class="highlighter-rouge">org.apache.hudi.OverwriteWithLatestAvroPayload</code>
<br />
<span style="color:grey">Payload class used. Override this, if you like to
roll your own merge logic, when upserting/inserting.
This will render any value set for <code
class="highlighter-rouge">PRECOMBINE_FIELD_OPT_VAL</code>
in-effective</span></p>
@@ -438,7 +438,7 @@ the dot notation eg: <code
class="highlighter-rouge">a.b.c</code></span></p>
Actual value ontained by invoking .toString()</span></p>
<h5 id="KEYGENERATOR_CLASS_OPT_KEY">KEYGENERATOR_CLASS_OPT_KEY</h5>
-<p>Property: <code
class="highlighter-rouge">hoodie.datasource.write.keygenerator.class</code>,
Default: <code
class="highlighter-rouge">com.uber.hoodie.SimpleKeyGenerator</code> <br />
+<p>Property: <code
class="highlighter-rouge">hoodie.datasource.write.keygenerator.class</code>,
Default: <code
class="highlighter-rouge">org.apache.hudi.SimpleKeyGenerator</code> <br />
<span style="color:grey">Key generator class, that implements will extract
the key out of incoming <code class="highlighter-rouge">Row</code>
object</span></p>
<h5
id="COMMIT_METADATA_KEYPREFIX_OPT_KEY">COMMIT_METADATA_KEYPREFIX_OPT_KEY</h5>
@@ -479,7 +479,7 @@ This is useful to store checkpointing information, in a
consistent way with the
<span style="color:grey">field in the dataset to use for determining hive
partition columns.</span></p>
<h5
id="HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY">HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY</h5>
-<p>Property: <code
class="highlighter-rouge">hoodie.datasource.hive_sync.partition_extractor_class</code>,
Default: <code
class="highlighter-rouge">com.uber.hoodie.hive.SlashEncodedDayPartitionValueExtractor</code>
<br />
+<p>Property: <code
class="highlighter-rouge">hoodie.datasource.hive_sync.partition_extractor_class</code>,
Default: <code
class="highlighter-rouge">org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor</code>
<br />
<span style="color:grey">Class used to extract partition field values into
hive partition columns.</span></p>
<h5
id="HIVE_ASSUME_DATE_PARTITION_OPT_KEY">HIVE_ASSUME_DATE_PARTITION_OPT_KEY</h5>
@@ -722,7 +722,7 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<p>Property: <code class="highlighter-rouge">hoodie.cleaner.parallelism</code>
<br />
<span style="color:grey">Increase this if cleaning becomes slow.</span></p>
-<h5 id="withCompactionStrategy">withCompactionStrategy(compactionStrategy =
com.uber.hoodie.io.compact.strategy.LogFileSizeBasedCompactionStrategy)</h5>
+<h5 id="withCompactionStrategy">withCompactionStrategy(compactionStrategy =
org.apache.hudi.io.compact.strategy.LogFileSizeBasedCompactionStrategy)</h5>
<p>Property: <code class="highlighter-rouge">hoodie.compaction.strategy</code>
<br />
<span style="color:grey">Compaction strategy decides which file groups are
picked up for compaction during each compaction run. By default. Hudi picks the
log file with most accumulated unmerged data</span></p>
@@ -732,9 +732,9 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<h5
id="withTargetPartitionsPerDayBasedCompaction">withTargetPartitionsPerDayBasedCompaction(targetPartitionsPerCompaction
= 10)</h5>
<p>Property: <code
class="highlighter-rouge">hoodie.compaction.daybased.target</code> <br />
-<span style="color:grey">Used by
com.uber.hoodie.io.compact.strategy.DayBasedCompactionStrategy to denote the
number of latest partitions to compact during a compaction run.</span></p>
+<span style="color:grey">Used by
org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy to denote the
number of latest partitions to compact during a compaction run.</span></p>
-<h5 id="payloadClassName">withPayloadClass(payloadClassName =
com.uber.hoodie.common.model.HoodieAvroPayload)</h5>
+<h5 id="payloadClassName">withPayloadClass(payloadClassName =
org.apache.hudi.common.model.HoodieAvroPayload)</h5>
<p>Property: <code
class="highlighter-rouge">hoodie.compaction.payload.class</code> <br />
<span style="color:grey">This needs to be same as class used during
insert/upserts. Just like writing, compaction also uses the record payload
class to merge records in the log against each other, merge again with the base
file and produce the final record to be written after compaction.</span></p>
diff --git a/content/contributing.html b/content/contributing.html
index 62d54ac..6a5090c 100644
--- a/content/contributing.html
+++ b/content/contributing.html
@@ -46,7 +46,7 @@
<script
src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
-<link rel="alternate" type="application/rss+xml" title=""
href="http://0.0.0.0:4000feed.xml">
+<link rel="alternate" type="application/rss+xml" title=""
href="http://localhost:4000feed.xml">
<script>
$(document).ready(function() {
@@ -425,14 +425,14 @@ Discussion about contributing code to Hudi happens on the
<a href="community.htm
<ul>
<li><code class="highlighter-rouge">docker</code> : Docker containers used
by demo and integration tests. Brings up a mini data ecosystem locally</li>
- <li><code class="highlighter-rouge">hoodie-cli</code> : CLI to inspect,
manage and administer datasets</li>
- <li><code class="highlighter-rouge">hoodie-client</code> : Spark client
library to take a bunch of inserts + updates and apply them to a Hoodie
table</li>
- <li><code class="highlighter-rouge">hoodie-common</code> : Common classes
used across modules</li>
- <li><code class="highlighter-rouge">hoodie-hadoop-mr</code> : InputFormat
implementations for ReadOptimized, Incremental, Realtime views</li>
- <li><code class="highlighter-rouge">hoodie-hive</code> : Manage hive tables
off Hudi datasets and houses the HiveSyncTool</li>
- <li><code class="highlighter-rouge">hoodie-integ-test</code> : Longer
running integration test processes</li>
- <li><code class="highlighter-rouge">hoodie-spark</code> : Spark datasource
for writing and reading Hudi datasets. Streaming sink.</li>
- <li><code class="highlighter-rouge">hoodie-utilities</code> : Houses tools
like DeltaStreamer, SnapshotCopier</li>
+ <li><code class="highlighter-rouge">hudi-cli</code> : CLI to inspect, manage
and administer datasets</li>
+ <li><code class="highlighter-rouge">hudi-client</code> : Spark client
library to take a bunch of inserts + updates and apply them to a Hoodie
table</li>
+ <li><code class="highlighter-rouge">hudi-common</code> : Common classes used
across modules</li>
+ <li><code class="highlighter-rouge">hudi-hadoop-mr</code> : InputFormat
implementations for ReadOptimized, Incremental, Realtime views</li>
+ <li><code class="highlighter-rouge">hudi-hive</code> : Manage hive tables
off Hudi datasets and houses the HiveSyncTool</li>
+ <li><code class="highlighter-rouge">hudi-integ-test</code> : Longer running
integration test processes</li>
+ <li><code class="highlighter-rouge">hudi-spark</code> : Spark datasource for
writing and reading Hudi datasets. Streaming sink.</li>
+ <li><code class="highlighter-rouge">hudi-utilities</code> : Houses tools
like DeltaStreamer, SnapshotCopier</li>
<li><code class="highlighter-rouge">packaging</code> : Poms for building out
bundles for easier drop in to Spark, Hive, Presto, Utilities</li>
<li><code class="highlighter-rouge">style</code> : Code formatting,
checkstyle files</li>
</ul>
diff --git a/content/docker_demo.html b/content/docker_demo.html
index 60999f4..b904d84 100644
--- a/content/docker_demo.html
+++ b/content/docker_demo.html
@@ -484,7 +484,7 @@ automatically initializes the datasets in the file-system
if they do not exist y
<div class="highlighter-rouge"><pre class="highlight"><code>docker exec -it
adhoc-2 /bin/bash
# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
+spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
....
....
2018-09-24 22:20:00 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 -
OutputCommitCoordinator stopped!
@@ -493,7 +493,7 @@ spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
--disable-compaction
+spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
--disable-compaction
....
2018-09-24 22:22:01 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 -
OutputCommitCoordinator stopped!
2018-09-24 22:22:01 INFO SparkContext:54 - Successfully stopped SparkContext
@@ -523,13 +523,13 @@ inorder to run Hive queries against those datasets.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>docker exec -it
adhoc-2 /bin/bash
# THis command takes in HIveServer URL and COW Hudi Dataset location in HDFS
and sync the HDFS state to Hive
-/var/hoodie/ws/hoodie-hive/run_sync_tool.sh --jdbc-url
jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt
--base-path /user/hive/warehouse/stock_ticks_cow --database default --table
stock_ticks_cow
+/var/hoodie/ws/hudi-hive/run_sync_tool.sh --jdbc-url
jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt
--base-path /user/hive/warehouse/stock_ticks_cow --database default --table
stock_ticks_cow
.....
2018-09-24 22:22:45,568 INFO [main] hive.HiveSyncTool
(HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_cow
.....
# Now run hive-sync for the second data-set in HDFS using Merge-On-Read (MOR
storage)
-/var/hoodie/ws/hoodie-hive/run_sync_tool.sh --jdbc-url
jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt
--base-path /user/hive/warehouse/stock_ticks_mor --database default --table
stock_ticks_mor
+/var/hoodie/ws/hudi-hive/run_sync_tool.sh --jdbc-url
jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt
--base-path /user/hive/warehouse/stock_ticks_mor --database default --table
stock_ticks_mor
...
2018-09-24 22:23:09,171 INFO [main] hive.HiveSyncTool
(HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_mor
...
@@ -760,11 +760,11 @@ partitions, there is no need to run hive-sync</p>
docker exec -it adhoc-2 /bin/bash
# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
+spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
--disable-compaction
+spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
--disable-compaction
exit
</code></pre>
@@ -990,11 +990,11 @@ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit
Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.
-scala> import com.uber.hoodie.DataSourceReadOptions
-import com.uber.hoodie.DataSourceReadOptions
+scala> import org.apache.hudi.DataSourceReadOptions
+import org.apache.hudi.DataSourceReadOptions
# In the below query, 20180925045257 is the first commit's timestamp
-scala> val hoodieIncViewDF =
spark.read.format("com.uber.hoodie").option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,
DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL).option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY,
"20180924064621").load("/user/hive/warehouse/stock_ticks_cow")
+scala> val hoodieIncViewDF =
spark.read.format("org.apache.hudi").option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,
DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL).option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY,
"20180924064621").load("/user/hive/warehouse/stock_ticks_cow")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
@@ -1019,20 +1019,20 @@ scala> spark.sql("select `_hoodie_commit_time`,
symbol, ts, volume, open, clo
Again, You can use Hudi CLI to manually schedule and run compaction</p>
<div class="highlighter-rouge"><pre class="highlight"><code>docker exec -it
adhoc-1 /bin/bash
-root@adhoc-1:/opt# /var/hoodie/ws/hoodie-cli/hoodie-cli.sh
+root@adhoc-1:/opt# /var/hoodie/ws/hudi-cli/hudi-cli.sh
============================================
* *
-* _ _ _ _ *
-* | | | | | (_) *
-* | |__| | ___ ___ __| |_ ___ *
-* | __ |/ _ \ / _ \ / _` | |/ _ \ *
-* | | | | (_) | (_) | (_| | | __/ *
-* |_| |_|\___/ \___/ \__,_|_|\___| *
+* _ _ _ _ *
+* | | | | | | (_) *
+* | |__| | __| | - *
+* | __ || | / _` | || *
+* | | | || || (_| | || *
+* |_| |_|\___/ \____/ || *
* *
============================================
Welcome to Hoodie CLI. Please type help if you are looking for help.
-hoodie->connect --path /user/hive/warehouse/stock_ticks_mor
+hudi->connect --path /user/hive/warehouse/stock_ticks_mor
18/09/24 06:59:34 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
18/09/24 06:59:35 INFO table.HoodieTableMetaClient: Loading
HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
18/09/24 06:59:35 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml,
mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml,
hdfs-default.xml, hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root
(auth:SIMPLE)]]]
@@ -1224,20 +1224,20 @@ currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark
(v2.3.1) in docker images
<p>To bring down the containers
<code class="highlighter-rouge">
-$ cd hoodie-integ-test
+$ cd hudi-integ-test
$ mvn docker-compose:down
</code></p>
<p>If you want to bring up the docker containers, use
<code class="highlighter-rouge">
-$ cd hoodie-integ-test
+$ cd hudi-integ-test
$ mvn docker-compose:up -DdetachedMode=true
</code></p>
<p>Hudi is a library that is operated in a broader data analytics/ingestion
environment
involving Hadoop, Hive and Spark. Interoperability with all these systems is a
key objective for us. We are
-actively adding integration-tests under
<strong>hoodie-integ-test/src/test/java</strong> that makes use of this
-docker environment (See
<strong>hoodie-integ-test/src/test/java/com/uber/hoodie/integ/ITTestHoodieSanity.java</strong>
)</p>
+actively adding integration-tests under
<strong>hudi-integ-test/src/test/java</strong> that makes use of this
+docker environment (See
<strong>hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestHoodieSanity.java</strong>
)</p>
<h4 id="building-local-docker-containers">Building Local Docker
Containers:</h4>
@@ -1265,27 +1265,27 @@ run the script
[INFO] Reactor Summary:
[INFO]
[INFO] hoodie ............................................. SUCCESS [ 1.709 s]
-[INFO] hoodie-common ...................................... SUCCESS [ 9.015 s]
-[INFO] hoodie-hadoop-mr ................................... SUCCESS [ 1.108 s]
-[INFO] hoodie-client ...................................... SUCCESS [ 4.409 s]
-[INFO] hoodie-hive ........................................ SUCCESS [ 0.976 s]
-[INFO] hoodie-spark ....................................... SUCCESS [ 26.522 s]
-[INFO] hoodie-utilities ................................... SUCCESS [ 16.256 s]
-[INFO] hoodie-cli ......................................... SUCCESS [ 11.341 s]
-[INFO] hoodie-hadoop-mr-bundle ............................ SUCCESS [ 1.893 s]
-[INFO] hoodie-hive-bundle ................................. SUCCESS [ 14.099 s]
-[INFO] hoodie-spark-bundle ................................ SUCCESS [ 58.252 s]
-[INFO] hoodie-hadoop-docker ............................... SUCCESS [ 0.612 s]
-[INFO] hoodie-hadoop-base-docker .......................... SUCCESS [04:04 min]
-[INFO] hoodie-hadoop-namenode-docker ...................... SUCCESS [ 6.142 s]
-[INFO] hoodie-hadoop-datanode-docker ...................... SUCCESS [ 7.763 s]
-[INFO] hoodie-hadoop-history-docker ....................... SUCCESS [ 5.922 s]
-[INFO] hoodie-hadoop-hive-docker .......................... SUCCESS [ 56.152 s]
-[INFO] hoodie-hadoop-sparkbase-docker ..................... SUCCESS [01:18 min]
-[INFO] hoodie-hadoop-sparkmaster-docker ................... SUCCESS [ 2.964 s]
-[INFO] hoodie-hadoop-sparkworker-docker ................... SUCCESS [ 3.032 s]
-[INFO] hoodie-hadoop-sparkadhoc-docker .................... SUCCESS [ 2.764 s]
-[INFO] hoodie-integ-test .................................. SUCCESS [ 1.785 s]
+[INFO] hudi-common ...................................... SUCCESS [ 9.015 s]
+[INFO] hudi-hadoop-mr ................................... SUCCESS [ 1.108 s]
+[INFO] hudi-client ...................................... SUCCESS [ 4.409 s]
+[INFO] hudi-hive ........................................ SUCCESS [ 0.976 s]
+[INFO] hudi-spark ....................................... SUCCESS [ 26.522 s]
+[INFO] hudi-utilities ................................... SUCCESS [ 16.256 s]
+[INFO] hudi-cli ......................................... SUCCESS [ 11.341 s]
+[INFO] hudi-hadoop-mr-bundle ............................ SUCCESS [ 1.893 s]
+[INFO] hudi-hive-bundle ................................. SUCCESS [ 14.099 s]
+[INFO] hudi-spark-bundle ................................ SUCCESS [ 58.252 s]
+[INFO] hudi-hadoop-docker ............................... SUCCESS [ 0.612 s]
+[INFO] hudi-hadoop-base-docker .......................... SUCCESS [04:04 min]
+[INFO] hudi-hadoop-namenode-docker ...................... SUCCESS [ 6.142 s]
+[INFO] hudi-hadoop-datanode-docker ...................... SUCCESS [ 7.763 s]
+[INFO] hudi-hadoop-history-docker ....................... SUCCESS [ 5.922 s]
+[INFO] hudi-hadoop-hive-docker .......................... SUCCESS [ 56.152 s]
+[INFO] hudi-hadoop-sparkbase-docker ..................... SUCCESS [01:18 min]
+[INFO] hudi-hadoop-sparkmaster-docker ................... SUCCESS [ 2.964 s]
+[INFO] hudi-hadoop-sparkworker-docker ................... SUCCESS [ 3.032 s]
+[INFO] hudi-hadoop-sparkadhoc-docker .................... SUCCESS [ 2.764 s]
+[INFO] hudi-integ-test .................................. SUCCESS [ 1.785 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
diff --git a/content/feed.xml b/content/feed.xml
index eb343ae..3ff5547 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -5,8 +5,8 @@
<description>Apache Hudi (pronounced “Hoodie”) provides upserts and
incremental processing capaibilities on Big Data</description>
<link>http://localhost:4000/</link>
<atom:link href="http://localhost:4000/feed.xml" rel="self"
type="application/rss+xml"/>
- <pubDate>Fri, 02 Aug 2019 05:38:42 -0700</pubDate>
- <lastBuildDate>Fri, 02 Aug 2019 05:38:42 -0700</lastBuildDate>
+ <pubDate>Sun, 11 Aug 2019 17:25:16 -0700</pubDate>
+ <lastBuildDate>Sun, 11 Aug 2019 17:25:16 -0700</lastBuildDate>
<generator>Jekyll v3.3.1</generator>
<item>
diff --git a/content/gcs_hoodie.html b/content/gcs_hoodie.html
index e5a3cd4..0902ab1 100644
--- a/content/gcs_hoodie.html
+++ b/content/gcs_hoodie.html
@@ -350,7 +350,7 @@
<div class="language-xml highlighter-rouge"><pre class="highlight"><code>
<span class="nt"><property></span>
<span class="nt"><name></span>fs.defaultFS<span
class="nt"></name></span>
- <span class="nt"><value></span>gs://hoodie-bucket<span
class="nt"></value></span>
+ <span class="nt"><value></span>gs://hudi-bucket<span
class="nt"></value></span>
<span class="nt"></property></span>
<span class="nt"><property></span>
diff --git a/content/migration_guide.html b/content/migration_guide.html
index c4172dd..7cb85db 100644
--- a/content/migration_guide.html
+++ b/content/migration_guide.html
@@ -367,7 +367,7 @@ This tool essentially starts a Spark Job to read the
existing parquet dataset an
<h4 id="option-2">Option 2</h4>
<p>For huge datasets, this could be as simple as : for partition in [list of
partitions in source dataset] {
val inputDF =
spark.read.format(“any_input_format”).load(“partition_path”)
- inputDF.write.format(“com.uber.hoodie”).option()….save(“basePath”)
+ inputDF.write.format(“org.apache.hudi”).option()….save(“basePath”)
}</p>
<h4 id="option-3">Option 3</h4>
@@ -375,9 +375,9 @@ This tool essentially starts a Spark Job to read the
existing parquet dataset an
<a href="quickstart.html">here</a>.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>Using the
HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install
-DskipTests`, the shell can be
-fired by via `cd hoodie-cli && ./hoodie-cli.sh`.
+fired by via `cd hudi-cli && ./hudi-cli.sh`.
-hoodie->hdfsparquetimport
+hudi->hdfsparquetimport
--upsert false
--srcPath /user/parquet/dataset/basepath
--targetPath
diff --git a/content/querying_data.html b/content/querying_data.html
index 0078bbd..0e1b9f1 100644
--- a/content/querying_data.html
+++ b/content/querying_data.html
@@ -355,13 +355,13 @@ with special configurations that indicates to query
planning that only increment
<h2 id="hive">Hive</h2>
-<p>In order for Hive to recognize Hudi datasets and query correctly, the
HiveServer2 needs to be provided with the <code
class="highlighter-rouge">hoodie-hadoop-hive-bundle-x.y.z-SNAPSHOT.jar</code>
+<p>In order for Hive to recognize Hudi datasets and query correctly, the
HiveServer2 needs to be provided with the <code
class="highlighter-rouge">hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar</code>
in its <a
href="https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr">aux
jars path</a>. This will ensure the input format
classes with its dependencies are available for query planning &
execution.</p>
<h3 id="hive-ro-view">Read Optimized table</h3>
<p>In addition to setup above, for beeline cli access, the <code
class="highlighter-rouge">hive.input.format</code> variable needs to be set to
the fully qualified path name of the
-inputformat <code
class="highlighter-rouge">com.uber.hoodie.hadoop.HoodieInputFormat</code>. For
Tez, additionally the <code
class="highlighter-rouge">hive.tez.input.format</code> needs to be set
+inputformat <code
class="highlighter-rouge">org.apache.hudi.hadoop.HoodieInputFormat</code>. For
Tez, additionally the <code
class="highlighter-rouge">hive.tez.input.format</code> needs to be set
to <code
class="highlighter-rouge">org.apache.hadoop.hive.ql.io.HiveInputFormat</code></p>
<h3 id="hive-rt-view">Real time table</h3>
@@ -478,20 +478,20 @@ separated) and calls InputFormat.listStatus() only once
with all those partition
<li><strong>Read as Hive tables</strong> : Supports all three views,
including the real time view, relying on the custom Hudi input formats again
like Hive.</li>
</ul>
-<p>In general, your spark job needs a dependency to <code
class="highlighter-rouge">hoodie-spark</code> or <code
class="highlighter-rouge">hoodie-spark-bundle-x.y.z.jar</code> needs to be on
the class path of driver & executors (hint: use <code
class="highlighter-rouge">--jars</code> argument)</p>
+<p>In general, your spark job needs a dependency to <code
class="highlighter-rouge">hudi-spark</code> or <code
class="highlighter-rouge">hudi-spark-bundle-x.y.z.jar</code> needs to be on the
class path of driver & executors (hint: use <code
class="highlighter-rouge">--jars</code> argument)</p>
<h3 id="spark-ro-view">Read Optimized table</h3>
<p>To read RO table as a Hive table using SparkSQL, simply push a path filter
into sparkContext as follows.
This method retains Spark built-in optimizations for reading Parquet files
like vectorized reading on Hudi tables.</p>
-<div class="highlighter-rouge"><pre
class="highlight"><code>spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
classOf[org.apache.hadoop.fs.PathFilter]);
+<div class="highlighter-rouge"><pre
class="highlight"><code>spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
classOf[org.apache.hadoop.fs.PathFilter]);
</code></pre>
</div>
<p>If you prefer to glob paths on DFS via the datasource, you can simply do
something like below to get a Spark dataframe to work with.</p>
-<div class="highlighter-rouge"><pre class="highlight"><code>Dataset<Row>
hoodieROViewDF = spark.read().format("com.uber.hoodie")
+<div class="highlighter-rouge"><pre class="highlight"><code>Dataset<Row>
hoodieROViewDF = spark.read().format("org.apache.hudi")
// pass any path glob, can include hudi & non-hudi datasets
.load("/glob/path/pattern");
</code></pre>
@@ -501,18 +501,18 @@ This method retains Spark built-in optimizations for
reading Parquet files like
<p>Currently, real time table can only be queried as a Hive table in Spark. In
order to do this, set <code
class="highlighter-rouge">spark.sql.hive.convertMetastoreParquet=false</code>,
forcing Spark to fallback
to using the Hive Serde to read the data (planning/executions is still
Spark).</p>
-<div class="highlighter-rouge"><pre class="highlight"><code>$ spark-shell
--jars hoodie-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path
/etc/hive/conf --packages com.databricks:spark-avro_2.11:4.0.0 --conf
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory
7g --executor-memory 2g --master yarn-client
+<div class="highlighter-rouge"><pre class="highlight"><code>$ spark-shell
--jars hudi-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path /etc/hive/conf
--packages com.databricks:spark-avro_2.11:4.0.0 --conf
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory
7g --executor-memory 2g --master yarn-client
scala> sqlContext.sql("select count(*) from hudi_rt where datestr =
'2016-10-02'").show()
</code></pre>
</div>
<h3 id="spark-incr-pull">Incremental Pulling</h3>
-<p>The <code class="highlighter-rouge">hoodie-spark</code> module offers the
DataSource API, a more elegant way to pull data from Hudi dataset and process
it via Spark.
+<p>The <code class="highlighter-rouge">hudi-spark</code> module offers the
DataSource API, a more elegant way to pull data from Hudi dataset and process
it via Spark.
A sample incremental pull, that will obtain all records written since <code
class="highlighter-rouge">beginInstantTime</code>, looks like below.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>
Dataset<Row> hoodieIncViewDF = spark.read()
- .format("com.uber.hoodie")
+ .format("org.apache.hudi")
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY(),
DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
@@ -549,7 +549,7 @@ A sample incremental pull, that will obtain all records
written since <code clas
<h2 id="presto">Presto</h2>
<p>Presto is a popular query engine, providing interactive query performance.
Hudi RO tables can be queries seamlessly in Presto.
-This requires the <code class="highlighter-rouge">hoodie-presto-bundle</code>
jar to be placed into <code
class="highlighter-rouge"><presto_install>/plugin/hive-hadoop2/</code>,
across the installation.</p>
+This requires the <code class="highlighter-rouge">hudi-presto-bundle</code>
jar to be placed into <code
class="highlighter-rouge"><presto_install>/plugin/hive-hadoop2/</code>,
across the installation.</p>
<div class="tags">
diff --git a/content/quickstart.html b/content/quickstart.html
index 453ba6f..ccbcbdd 100644
--- a/content/quickstart.html
+++ b/content/quickstart.html
@@ -341,7 +341,7 @@ refer to <a href="migration_guide.html">migration
guide</a>.</p>
<h2 id="download-hudi">Download Hudi</h2>
-<p>Check out <a href="https://github.com/apache/incubator-hudi">code</a> or
download <a
href="https://github.com/apache/incubator-hudi/archive/hoodie-0.4.5.zip">latest
release</a>
+<p>Check out <a href="https://github.com/apache/incubator-hudi">code</a> or
download <a
href="https://github.com/apache/incubator-hudi/archive/hudi-0.4.5.zip">latest
release</a>
and normally build the maven project, from command line</p>
<div class="highlighter-rouge"><pre class="highlight"><code>$ mvn clean
install -DskipTests -DskipITs
@@ -416,10 +416,10 @@ export
PATH=$JAVA_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_INSTALL/bin:$P
<h3 id="run-hoodiejavaapp">Run HoodieJavaApp</h3>
-<p>Run <strong>hoodie-spark/src/test/java/HoodieJavaApp.java</strong> class,
to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates
to previously inserted 100 records) onto your DFS/local filesystem. Use the
wrapper script
+<p>Run <strong>hudi-spark/src/test/java/HoodieJavaApp.java</strong> class, to
place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to
previously inserted 100 records) onto your DFS/local filesystem. Use the
wrapper script
to run from command-line</p>
-<div class="highlighter-rouge"><pre class="highlight"><code>cd hoodie-spark
+<div class="highlighter-rouge"><pre class="highlight"><code>cd hudi-spark
./run_hoodie_app.sh --help
Usage: <main class> [options]
Options:
@@ -437,7 +437,7 @@ Usage: <main class> [options]
</code></pre>
</div>
-<p>The class lets you choose table names, output paths and one of the storage
types. In your own applications, be sure to include the <code
class="highlighter-rouge">hoodie-spark</code> module as dependency
+<p>The class lets you choose table names, output paths and one of the storage
types. In your own applications, be sure to include the <code
class="highlighter-rouge">hudi-spark</code> module as dependency
and follow a similar pattern to write/read datasets via the datasource.</p>
<h2 id="query-a-hudi-dataset">Query a Hudi dataset</h2>
@@ -454,7 +454,7 @@ bin/hiveserver2 \
--hiveconf hive.root.logger=INFO,console \
--hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
--hiveconf hive.stats.autogather=false \
- --hiveconf
hive.aux.jars.path=/path/to/packaging/hoodie-hive-bundle/target/hoodie-hive-bundle-0.4.6-SNAPSHOT.jar
+ --hiveconf
hive.aux.jars.path=/path/to/packaging/hudi-hive-bundle/target/hudi-hive-bundle-0.4.6-SNAPSHOT.jar
</code></pre>
</div>
@@ -464,7 +464,7 @@ bin/hiveserver2 \
It uses an incremental approach by storing the last commit time synced in the
TBLPROPERTIES and only syncing the commits from the last sync commit time
stored.
Both <a href="writing_data.html#datasource-writer">Spark Datasource</a> &
<a href="writing_data.html#deltastreamer">DeltaStreamer</a> have capability to
do this, after each write.</p>
-<div class="highlighter-rouge"><pre class="highlight"><code>cd hoodie-hive
+<div class="highlighter-rouge"><pre class="highlight"><code>cd hudi-hive
./run_sync_tool.sh
--user hive
--pass hive
@@ -485,7 +485,7 @@ follow <a
href="https://cwiki.apache.org/confluence/display/HUDI/Registering+sam
<div class="highlighter-rouge"><pre class="highlight"><code>hive> set
hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
hive> set hive.stats.autogather=false;
-hive> add jar file:///path/to/hoodie-hive-bundle-0.4.6-SNAPSHOT.jar;
+hive> add jar file:///path/to/hudi-hive-bundle-0.4.6-SNAPSHOT.jar;
hive> select count(*) from hoodie_test;
...
OK
@@ -500,7 +500,7 @@ hive>
<p>Spark is super easy, once you get Hive working as above. Just spin up a
Spark Shell as below</p>
<div class="highlighter-rouge"><pre class="highlight"><code>$ cd $SPARK_INSTALL
-$ spark-shell --jars
$HUDI_SRC/packaging/hoodie-spark-bundle/target/hoodie-spark-bundle-0.4.6-SNAPSHOT.jar
--driver-class-path $HADOOP_CONF_DIR --conf
spark.sql.hive.convertMetastoreParquet=false --packages
com.databricks:spark-avro_2.11:4.0.0
+$ spark-shell --jars
$HUDI_SRC/packaging/hudi-spark-bundle/target/hudi-spark-bundle-0.4.6-SNAPSHOT.jar
--driver-class-path $HADOOP_CONF_DIR --conf
spark.sql.hive.convertMetastoreParquet=false --packages
com.databricks:spark-avro_2.11:4.0.0
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> sqlContext.sql("show tables").show(10000)
@@ -515,7 +515,7 @@ scala> sqlContext.sql("select count(*) from
hoodie_test").show(10000)
<p>Checkout the ‘master’ branch on OSS Presto, build it, and place your
installation somewhere.</p>
<ul>
- <li>Copy the
hudi/packaging/hoodie-presto-bundle/target/hoodie-presto-bundle-*.jar into
$PRESTO_INSTALL/plugin/hive-hadoop2/</li>
+ <li>Copy the
hudi/packaging/hudi-presto-bundle/target/hudi-presto-bundle-*.jar into
$PRESTO_INSTALL/plugin/hive-hadoop2/</li>
<li>Startup your server and you should be able to query the same Hive table
via Presto</li>
</ul>
diff --git a/content/s3_hoodie.html b/content/s3_hoodie.html
index b01fea7..1a36a0c 100644
--- a/content/s3_hoodie.html
+++ b/content/s3_hoodie.html
@@ -382,7 +382,7 @@
</code></pre>
</div>
-<p>Utilities such as hoodie-cli or deltastreamer tool, can pick up s3 creds
via environmental variable prefixed with <code
class="highlighter-rouge">HOODIE_ENV_</code>. For e.g below is a bash snippet
to setup
+<p>Utilities such as hudi-cli or deltastreamer tool, can pick up s3 creds via
environmental variable prefixed with <code
class="highlighter-rouge">HOODIE_ENV_</code>. For e.g below is a bash snippet
to setup
such variables and then have cli be able to work on datasets stored in s3</p>
<div class="highlighter-rouge"><pre class="highlight"><code>export
HOODIE_ENV_fs_DOT_s3a_DOT_access_DOT_key=$accessKey
diff --git a/content/writing_data.html b/content/writing_data.html
index 23f7e65..70c8ac9 100644
--- a/content/writing_data.html
+++ b/content/writing_data.html
@@ -355,7 +355,7 @@ can be chosen/changed across each commit/deltacommit issued
against the dataset.
<h2 id="deltastreamer">DeltaStreamer</h2>
-<p>The <code class="highlighter-rouge">HoodieDeltaStreamer</code> utility
(part of hoodie-utilities-bundle) provides the way to ingest from different
sources such as DFS or Kafka, with the following capabilities.</p>
+<p>The <code class="highlighter-rouge">HoodieDeltaStreamer</code> utility
(part of hudi-utilities-bundle) provides the way to ingest from different
sources such as DFS or Kafka, with the following capabilities.</p>
<ul>
<li>Exactly once ingestion of new events from Kafka, <a
href="https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports">incremental
imports</a> from Sqoop or output of <code
class="highlighter-rouge">HiveIncrementalPuller</code> or files under a DFS
folder</li>
@@ -367,7 +367,7 @@ can be chosen/changed across each commit/deltacommit issued
against the dataset.
<p>Command line options describe capabilities in more detail</p>
-<div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$
spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hoodie-utilities-bundle/target/hoodie-utilities-bundle-*.jar` --help
+<div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$
spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
Usage: <main class> [options]
Options:
--commit-on-errors
@@ -381,7 +381,7 @@ Usage: <main class> [options]
insert/bulk-insert
Default: false
--help, -h
- --hoodie-conf
+ --hudi-conf
Any configuration that can be set in the properties file (using the
CLI
parameter "--propsFilePath") can also be passed command line using
this
parameter
@@ -395,7 +395,7 @@ Usage: <main class> [options]
subclass of HoodieRecordPayload, that works off a GenericRecord.
Implement your own, if you want to do something other than overwriting
existing value
- Default: com.uber.hoodie.OverwriteWithLatestAvroPayload
+ Default: org.apache.hudi.OverwriteWithLatestAvroPayload
--props
path to properties file on localfs or dfs, with configurations for
Hudi client, schema provider, key generator and data source. For
@@ -404,15 +404,15 @@ Usage: <main class> [options]
sources, referto individual classes, for supported properties.
Default:
file:///Users/vinoth/bin/hoodie/src/test/resources/delta-streamer-config/dfs-source.properties
--schemaprovider-class
- subclass of com.uber.hoodie.utilities.schema.SchemaProvider to attach
+ subclass of org.apache.hudi.utilities.schema.SchemaProvider to attach
schemas to input & target table data, built in options:
FilebasedSchemaProvider
- Default: com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
+ Default: org.apache.hudi.utilities.schema.FilebasedSchemaProvider
--source-class
- Subclass of com.uber.hoodie.utilities.sources to read data. Built-in
- options: com.uber.hoodie.utilities.sources.{JsonDFSSource (default),
+ Subclass of org.apache.hudi.utilities.sources to read data. Built-in
+ options: org.apache.hudi.utilities.sources.{JsonDFSSource (default),
AvroDFSSource, JsonKafkaSource, AvroKafkaSource, HiveIncrPullSource}
- Default: com.uber.hoodie.utilities.sources.JsonDFSSource
+ Default: org.apache.hudi.utilities.sources.JsonDFSSource
--source-limit
Maximum amount of data to read from source. Default: No limit For e.g:
DFSSource => max bytes to read, KafkaSource => max events to read
@@ -431,16 +431,16 @@ Usage: <main class> [options]
* --target-table
name of the target table in Hive
--transformer-class
- subclass of com.uber.hoodie.utilities.transform.Transformer. UDF to
+ subclass of org.apache.hudi.utilities.transform.Transformer. UDF to
transform raw source dataset to a target dataset (conforming to target
schema) before writing. Default : Not set. E:g -
- com.uber.hoodie.utilities.transform.SqlQueryBasedTransformer (which
+ org.apache.hudi.utilities.transform.SqlQueryBasedTransformer (which
allows a SQL query template to be passed as a transformation function)
</code></pre>
</div>
<p>The tool takes a hierarchically composed property file and has pluggable
interfaces for extracting data, key generation and providing schema. Sample
configs for ingesting from kafka and dfs are
-provided under <code
class="highlighter-rouge">hoodie-utilities/src/test/resources/delta-streamer-config</code>.</p>
+provided under <code
class="highlighter-rouge">hudi-utilities/src/test/resources/delta-streamer-config</code>.</p>
<p>For e.g: once you have Confluent Kafka, Schema registry up & running,
produce some test data using (<a
href="https://docs.confluent.io/current/ksql/docs/tutorials/generate-custom-test-data.html">impressions.avro</a>
provided by schema-registry repo)</p>
@@ -450,12 +450,12 @@ provided under <code
class="highlighter-rouge">hoodie-utilities/src/test/resourc
<p>and then ingest it as follows.</p>
-<div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$
spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hoodie-utilities-bundle/target/hoodie-utilities-bundle-*.jar` \
- --props
file://${PWD}/hoodie-utilities/src/test/resources/delta-streamer-config/kafka-source.properties
\
- --schemaprovider-class
com.uber.hoodie.utilities.schema.SchemaRegistryProvider \
- --source-class com.uber.hoodie.utilities.sources.AvroKafkaSource \
+<div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$
spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+ --props
file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties
\
+ --schemaprovider-class
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
+ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
--source-ordering-field impresssiontime \
- --target-base-path file:///tmp/hoodie-deltastreamer-op --target-table
uber.impressions \
+ --target-base-path file:///tmp/hudi-deltastreamer-op --target-table
uber.impressions \
--op BULK_INSERT
</code></pre>
</div>
@@ -464,12 +464,12 @@ provided under <code
class="highlighter-rouge">hoodie-utilities/src/test/resourc
<h2 id="datasource-writer">Datasource Writer</h2>
-<p>The <code class="highlighter-rouge">hoodie-spark</code> module offers the
DataSource API to write (and also read) any data frame into a Hudi dataset.
+<p>The <code class="highlighter-rouge">hudi-spark</code> module offers the
DataSource API to write (and also read) any data frame into a Hudi dataset.
Following is how we can upsert a dataframe, while specifying the field names
that need to be used
for <code class="highlighter-rouge">recordKey => _row_key</code>, <code
class="highlighter-rouge">partitionPath => partition</code> and <code
class="highlighter-rouge">precombineKey => timestamp</code></p>
<div class="highlighter-rouge"><pre class="highlight"><code>inputDF.write()
- .format("com.uber.hoodie")
+ .format("org.apache.hudi")
.options(clientOpts) // any of the Hudi client opts can be passed in as
well
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
"partition")
@@ -484,11 +484,11 @@ for <code class="highlighter-rouge">recordKey =>
_row_key</code>, <code class
<p>Both tools above support syncing of the dataset’s latest schema to Hive
metastore, such that queries can pick up new columns and partitions.
In case, its preferable to run this from commandline or in an independent jvm,
Hudi provides a <code class="highlighter-rouge">HiveSyncTool</code>, which can
be invoked as below,
-once you have built the hoodie-hive module.</p>
+once you have built the hudi-hive module.</p>
-<div class="highlighter-rouge"><pre class="highlight"><code>cd hoodie-hive
+<div class="highlighter-rouge"><pre class="highlight"><code>cd hudi-hive
./run_sync_tool.sh
- [hoodie-hive]$ ./run_sync_tool.sh --help
+ [hudi-hive]$ ./run_sync_tool.sh --help
Usage: <main class> [options]
Options:
* --base-path
@@ -516,14 +516,14 @@ Usage: <main class> [options]
<li><strong>Soft Deletes</strong> : With soft deletes, user wants to retain
the key but just null out the values for all other fields.
This can be simply achieved by ensuring the appropriate fields are nullable
in the dataset schema and simply upserting the dataset after setting these
fields to null.</li>
<li><strong>Hard Deletes</strong> : A stronger form of delete is to
physically remove any trace of the record from the dataset. This can be
achieved by issuing an upsert with a custom payload implementation
- via either DataSource or DeltaStreamer which always returns Optional.Empty as
the combined value. Hudi ships with a built-in <code
class="highlighter-rouge">com.uber.hoodie.EmptyHoodieRecordPayload</code> class
that does exactly this.</li>
+ via either DataSource or DeltaStreamer which always returns Optional.Empty as
the combined value. Hudi ships with a built-in <code
class="highlighter-rouge">org.apache.hudi.EmptyHoodieRecordPayload</code> class
that does exactly this.</li>
</ul>
<div class="highlighter-rouge"><pre class="highlight"><code> deleteDF //
dataframe containing just records to be deleted
- .write().format("com.uber.hoodie")
+ .write().format("org.apache.hudi")
.option(...) // Add HUDI options like record-key, partition-path and others
as needed for your setup
// specify record_key, partition_key, precombine_fieldkey & usual params
- .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY,
"com.uber.hoodie.EmptyHoodieRecordPayload")
+ .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY,
"org.apache.hudi.EmptyHoodieRecordPayload")
</code></pre>
</div>
diff --git a/docs/admin_guide.md b/docs/admin_guide.md
index eeae8bf..5a1bdfe 100644
--- a/docs/admin_guide.md
+++ b/docs/admin_guide.md
@@ -17,7 +17,7 @@ This section provides a glimpse into each of these, with some
general guidance o
## Admin CLI {#admin-cli}
-Once hudi has been built, the shell can be fired by via `cd hoodie-cli &&
./hoodie-cli.sh`.
+Once hudi has been built, the shell can be fired by via `cd hudi-cli &&
./hudi-cli.sh`.
A hudi dataset resides on DFS, in a location referred to as the **basePath**
and we would need this location in order to connect to a Hudi dataset.
Hudi library effectively manages this dataset internally, using .hoodie
subfolder to track all metadata
@@ -27,17 +27,17 @@ To initialize a hudi table, use the following command.
18/09/06 15:56:52 INFO annotation.AutowiredAnnotationBeanPostProcessor:
JSR-330 'javax.inject.Inject' annotation found and supported for autowiring
============================================
* *
-* _ _ _ _ *
-* | | | | | (_) *
-* | |__| | ___ ___ __| |_ ___ *
-* | __ |/ _ \ / _ \ / _` | |/ _ \ *
-* | | | | (_) | (_) | (_| | | __/ *
-* |_| |_|\___/ \___/ \__,_|_|\___| *
+* _ _ _ _ *
+* | | | | | | (_) *
+* | |__| | __| | - *
+* | __ || | / _` | || *
+* | | | || || (_| | || *
+* |_| |_|\___/ \____/ || *
* *
============================================
Welcome to Hoodie CLI. Please type help if you are looking for help.
-hoodie->create --path /user/hive/warehouse/table1 --tableName hoodie_table_1
--tableType COPY_ON_WRITE
+hudi->create --path /user/hive/warehouse/table1 --tableName hoodie_table_1
--tableType COPY_ON_WRITE
.....
18/09/06 15:57:15 INFO table.HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from ...
```
diff --git a/docs/configurations.md b/docs/configurations.md
index 9580aa3..997f553 100644
--- a/docs/configurations.md
+++ b/docs/configurations.md
@@ -41,7 +41,7 @@ Additionally, you can pass down any of the WriteClient level
configs directly us
```
inputDF.write()
-.format("com.uber.hoodie")
+.format("org.apache.hudi")
.options(clientOpts) // any of the Hudi client opts can be passed in as well
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
@@ -72,7 +72,7 @@ Options useful for writing datasets via
`write.format.option(...)`
we will pick the one with the largest value for the precombine field,
determined by Object.compareTo(..)</span>
##### PAYLOAD_CLASS_OPT_KEY {#PAYLOAD_CLASS_OPT_KEY}
- Property: `hoodie.datasource.write.payload.class`, Default:
`com.uber.hoodie.OverwriteWithLatestAvroPayload` <br/>
+ Property: `hoodie.datasource.write.payload.class`, Default:
`org.apache.hudi.OverwriteWithLatestAvroPayload` <br/>
<span style="color:grey">Payload class used. Override this, if you like to
roll your own merge logic, when upserting/inserting.
This will render any value set for `PRECOMBINE_FIELD_OPT_VAL`
in-effective</span>
@@ -88,7 +88,7 @@ the dot notation eg: `a.b.c`</span>
Actual value ontained by invoking .toString()</span>
##### KEYGENERATOR_CLASS_OPT_KEY {#KEYGENERATOR_CLASS_OPT_KEY}
- Property: `hoodie.datasource.write.keygenerator.class`, Default:
`com.uber.hoodie.SimpleKeyGenerator` <br/>
+ Property: `hoodie.datasource.write.keygenerator.class`, Default:
`org.apache.hudi.SimpleKeyGenerator` <br/>
<span style="color:grey">Key generator class, that implements will extract
the key out of incoming `Row` object</span>
##### COMMIT_METADATA_KEYPREFIX_OPT_KEY {#COMMIT_METADATA_KEYPREFIX_OPT_KEY}
@@ -129,7 +129,7 @@ This is useful to store checkpointing information, in a
consistent way with the
<span style="color:grey">field in the dataset to use for determining hive
partition columns.</span>
##### HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY
{#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY}
- Property: `hoodie.datasource.hive_sync.partition_extractor_class`, Default:
`com.uber.hoodie.hive.SlashEncodedDayPartitionValueExtractor` <br/>
+ Property: `hoodie.datasource.hive_sync.partition_extractor_class`, Default:
`org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor` <br/>
<span style="color:grey">Class used to extract partition field values into
hive partition columns.</span>
##### HIVE_ASSUME_DATE_PARTITION_OPT_KEY {#HIVE_ASSUME_DATE_PARTITION_OPT_KEY}
@@ -374,7 +374,7 @@ Property: `hoodie.compaction.reverse.log.read` <br/>
Property: `hoodie.cleaner.parallelism` <br/>
<span style="color:grey">Increase this if cleaning becomes slow.</span>
-##### withCompactionStrategy(compactionStrategy =
com.uber.hoodie.io.compact.strategy.LogFileSizeBasedCompactionStrategy)
{#withCompactionStrategy}
+##### withCompactionStrategy(compactionStrategy =
org.apache.hudi.io.compact.strategy.LogFileSizeBasedCompactionStrategy)
{#withCompactionStrategy}
Property: `hoodie.compaction.strategy` <br/>
<span style="color:grey">Compaction strategy decides which file groups are
picked up for compaction during each compaction run. By default. Hudi picks the
log file with most accumulated unmerged data</span>
@@ -384,9 +384,9 @@ Property: `hoodie.compaction.target.io` <br/>
##### withTargetPartitionsPerDayBasedCompaction(targetPartitionsPerCompaction
= 10) {#withTargetPartitionsPerDayBasedCompaction}
Property: `hoodie.compaction.daybased.target` <br/>
-<span style="color:grey">Used by
com.uber.hoodie.io.compact.strategy.DayBasedCompactionStrategy to denote the
number of latest partitions to compact during a compaction run.</span>
+<span style="color:grey">Used by
org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy to denote the
number of latest partitions to compact during a compaction run.</span>
-##### withPayloadClass(payloadClassName =
com.uber.hoodie.common.model.HoodieAvroPayload) {#payloadClassName}
+##### withPayloadClass(payloadClassName =
org.apache.hudi.common.model.HoodieAvroPayload) {#payloadClassName}
Property: `hoodie.compaction.payload.class` <br/>
<span style="color:grey">This needs to be same as class used during
insert/upserts. Just like writing, compaction also uses the record payload
class to merge records in the log against each other, merge again with the base
file and produce the final record to be written after compaction.</span>
diff --git a/docs/contributing.md b/docs/contributing.md
index 02c6375..f79ef03 100644
--- a/docs/contributing.md
+++ b/docs/contributing.md
@@ -86,14 +86,14 @@ Discussion about contributing code to Hudi happens on the
[dev@ mailing list](co
## Code & Project Structure
* `docker` : Docker containers used by demo and integration tests. Brings up
a mini data ecosystem locally
- * `hoodie-cli` : CLI to inspect, manage and administer datasets
- * `hoodie-client` : Spark client library to take a bunch of inserts +
updates and apply them to a Hoodie table
- * `hoodie-common` : Common classes used across modules
- * `hoodie-hadoop-mr` : InputFormat implementations for ReadOptimized,
Incremental, Realtime views
- * `hoodie-hive` : Manage hive tables off Hudi datasets and houses the
HiveSyncTool
- * `hoodie-integ-test` : Longer running integration test processes
- * `hoodie-spark` : Spark datasource for writing and reading Hudi datasets.
Streaming sink.
- * `hoodie-utilities` : Houses tools like DeltaStreamer, SnapshotCopier
+ * `hudi-cli` : CLI to inspect, manage and administer datasets
+ * `hudi-client` : Spark client library to take a bunch of inserts + updates
and apply them to a Hoodie table
+ * `hudi-common` : Common classes used across modules
+ * `hudi-hadoop-mr` : InputFormat implementations for ReadOptimized,
Incremental, Realtime views
+ * `hudi-hive` : Manage hive tables off Hudi datasets and houses the
HiveSyncTool
+ * `hudi-integ-test` : Longer running integration test processes
+ * `hudi-spark` : Spark datasource for writing and reading Hudi datasets.
Streaming sink.
+ * `hudi-utilities` : Houses tools like DeltaStreamer, SnapshotCopier
* `packaging` : Poms for building out bundles for easier drop in to Spark,
Hive, Presto, Utilities
* `style` : Code formatting, checkstyle files
diff --git a/docs/docker_demo.md b/docs/docker_demo.md
index 89f7e7f..d363a89 100644
--- a/docs/docker_demo.md
+++ b/docs/docker_demo.md
@@ -163,7 +163,7 @@ automatically initializes the datasets in the file-system
if they do not exist y
docker exec -it adhoc-2 /bin/bash
# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
+spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
....
....
2018-09-24 22:20:00 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 -
OutputCommitCoordinator stopped!
@@ -172,7 +172,7 @@ spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
--disable-compaction
+spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
--disable-compaction
....
2018-09-24 22:22:01 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 -
OutputCommitCoordinator stopped!
2018-09-24 22:22:01 INFO SparkContext:54 - Successfully stopped SparkContext
@@ -203,13 +203,13 @@ inorder to run Hive queries against those datasets.
docker exec -it adhoc-2 /bin/bash
# THis command takes in HIveServer URL and COW Hudi Dataset location in HDFS
and sync the HDFS state to Hive
-/var/hoodie/ws/hoodie-hive/run_sync_tool.sh --jdbc-url
jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt
--base-path /user/hive/warehouse/stock_ticks_cow --database default --table
stock_ticks_cow
+/var/hoodie/ws/hudi-hive/run_sync_tool.sh --jdbc-url
jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt
--base-path /user/hive/warehouse/stock_ticks_cow --database default --table
stock_ticks_cow
.....
2018-09-24 22:22:45,568 INFO [main] hive.HiveSyncTool
(HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_cow
.....
# Now run hive-sync for the second data-set in HDFS using Merge-On-Read (MOR
storage)
-/var/hoodie/ws/hoodie-hive/run_sync_tool.sh --jdbc-url
jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt
--base-path /user/hive/warehouse/stock_ticks_mor --database default --table
stock_ticks_mor
+/var/hoodie/ws/hudi-hive/run_sync_tool.sh --jdbc-url
jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt
--base-path /user/hive/warehouse/stock_ticks_mor --database default --table
stock_ticks_mor
...
2018-09-24 22:23:09,171 INFO [main] hive.HiveSyncTool
(HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_mor
...
@@ -440,11 +440,11 @@ cat docker/demo/data/batch_2.json | kafkacat -b
kafkabroker -t stock_ticks -P
docker exec -it adhoc-2 /bin/bash
# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
+spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_mor dataset in HDFS
-spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
--disable-compaction
+spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_mor --target-table
stock_ticks_mor --props /var/demo/config/kafka-source.properties
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
--disable-compaction
exit
```
@@ -670,11 +670,11 @@ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit
Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.
-scala> import com.uber.hoodie.DataSourceReadOptions
-import com.uber.hoodie.DataSourceReadOptions
+scala> import org.apache.hudi.DataSourceReadOptions
+import org.apache.hudi.DataSourceReadOptions
# In the below query, 20180925045257 is the first commit's timestamp
-scala> val hoodieIncViewDF =
spark.read.format("com.uber.hoodie").option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,
DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL).option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY,
"20180924064621").load("/user/hive/warehouse/stock_ticks_cow")
+scala> val hoodieIncViewDF =
spark.read.format("org.apache.hudi").option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,
DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL).option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY,
"20180924064621").load("/user/hive/warehouse/stock_ticks_cow")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
@@ -700,20 +700,20 @@ Again, You can use Hudi CLI to manually schedule and run
compaction
```
docker exec -it adhoc-1 /bin/bash
-root@adhoc-1:/opt# /var/hoodie/ws/hoodie-cli/hoodie-cli.sh
+root@adhoc-1:/opt# /var/hoodie/ws/hudi-cli/hudi-cli.sh
============================================
* *
-* _ _ _ _ *
-* | | | | | (_) *
-* | |__| | ___ ___ __| |_ ___ *
-* | __ |/ _ \ / _ \ / _` | |/ _ \ *
-* | | | | (_) | (_) | (_| | | __/ *
-* |_| |_|\___/ \___/ \__,_|_|\___| *
+* _ _ _ _ *
+* | | | | | | (_) *
+* | |__| | __| | - *
+* | __ || | / _` | || *
+* | | | || || (_| | || *
+* |_| |_|\___/ \____/ || *
* *
============================================
Welcome to Hoodie CLI. Please type help if you are looking for help.
-hoodie->connect --path /user/hive/warehouse/stock_ticks_mor
+hudi->connect --path /user/hive/warehouse/stock_ticks_mor
18/09/24 06:59:34 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
18/09/24 06:59:35 INFO table.HoodieTableMetaClient: Loading
HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
18/09/24 06:59:35 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml,
mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml,
hdfs-default.xml, hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root
(auth:SIMPLE)]]]
@@ -905,20 +905,20 @@ currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark
(v2.3.1) in docker images
To bring down the containers
```
-$ cd hoodie-integ-test
+$ cd hudi-integ-test
$ mvn docker-compose:down
```
If you want to bring up the docker containers, use
```
-$ cd hoodie-integ-test
+$ cd hudi-integ-test
$ mvn docker-compose:up -DdetachedMode=true
```
Hudi is a library that is operated in a broader data analytics/ingestion
environment
involving Hadoop, Hive and Spark. Interoperability with all these systems is a
key objective for us. We are
-actively adding integration-tests under __hoodie-integ-test/src/test/java__
that makes use of this
-docker environment (See
__hoodie-integ-test/src/test/java/com/uber/hoodie/integ/ITTestHoodieSanity.java__
)
+actively adding integration-tests under __hudi-integ-test/src/test/java__ that
makes use of this
+docker environment (See
__hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestHoodieSanity.java__
)
#### Building Local Docker Containers:
@@ -946,27 +946,27 @@ cd docker
[INFO] Reactor Summary:
[INFO]
[INFO] hoodie ............................................. SUCCESS [ 1.709 s]
-[INFO] hoodie-common ...................................... SUCCESS [ 9.015 s]
-[INFO] hoodie-hadoop-mr ................................... SUCCESS [ 1.108 s]
-[INFO] hoodie-client ...................................... SUCCESS [ 4.409 s]
-[INFO] hoodie-hive ........................................ SUCCESS [ 0.976 s]
-[INFO] hoodie-spark ....................................... SUCCESS [ 26.522 s]
-[INFO] hoodie-utilities ................................... SUCCESS [ 16.256 s]
-[INFO] hoodie-cli ......................................... SUCCESS [ 11.341 s]
-[INFO] hoodie-hadoop-mr-bundle ............................ SUCCESS [ 1.893 s]
-[INFO] hoodie-hive-bundle ................................. SUCCESS [ 14.099 s]
-[INFO] hoodie-spark-bundle ................................ SUCCESS [ 58.252 s]
-[INFO] hoodie-hadoop-docker ............................... SUCCESS [ 0.612 s]
-[INFO] hoodie-hadoop-base-docker .......................... SUCCESS [04:04 min]
-[INFO] hoodie-hadoop-namenode-docker ...................... SUCCESS [ 6.142 s]
-[INFO] hoodie-hadoop-datanode-docker ...................... SUCCESS [ 7.763 s]
-[INFO] hoodie-hadoop-history-docker ....................... SUCCESS [ 5.922 s]
-[INFO] hoodie-hadoop-hive-docker .......................... SUCCESS [ 56.152 s]
-[INFO] hoodie-hadoop-sparkbase-docker ..................... SUCCESS [01:18 min]
-[INFO] hoodie-hadoop-sparkmaster-docker ................... SUCCESS [ 2.964 s]
-[INFO] hoodie-hadoop-sparkworker-docker ................... SUCCESS [ 3.032 s]
-[INFO] hoodie-hadoop-sparkadhoc-docker .................... SUCCESS [ 2.764 s]
-[INFO] hoodie-integ-test .................................. SUCCESS [ 1.785 s]
+[INFO] hudi-common ...................................... SUCCESS [ 9.015 s]
+[INFO] hudi-hadoop-mr ................................... SUCCESS [ 1.108 s]
+[INFO] hudi-client ...................................... SUCCESS [ 4.409 s]
+[INFO] hudi-hive ........................................ SUCCESS [ 0.976 s]
+[INFO] hudi-spark ....................................... SUCCESS [ 26.522 s]
+[INFO] hudi-utilities ................................... SUCCESS [ 16.256 s]
+[INFO] hudi-cli ......................................... SUCCESS [ 11.341 s]
+[INFO] hudi-hadoop-mr-bundle ............................ SUCCESS [ 1.893 s]
+[INFO] hudi-hive-bundle ................................. SUCCESS [ 14.099 s]
+[INFO] hudi-spark-bundle ................................ SUCCESS [ 58.252 s]
+[INFO] hudi-hadoop-docker ............................... SUCCESS [ 0.612 s]
+[INFO] hudi-hadoop-base-docker .......................... SUCCESS [04:04 min]
+[INFO] hudi-hadoop-namenode-docker ...................... SUCCESS [ 6.142 s]
+[INFO] hudi-hadoop-datanode-docker ...................... SUCCESS [ 7.763 s]
+[INFO] hudi-hadoop-history-docker ....................... SUCCESS [ 5.922 s]
+[INFO] hudi-hadoop-hive-docker .......................... SUCCESS [ 56.152 s]
+[INFO] hudi-hadoop-sparkbase-docker ..................... SUCCESS [01:18 min]
+[INFO] hudi-hadoop-sparkmaster-docker ................... SUCCESS [ 2.964 s]
+[INFO] hudi-hadoop-sparkworker-docker ................... SUCCESS [ 3.032 s]
+[INFO] hudi-hadoop-sparkadhoc-docker .................... SUCCESS [ 2.764 s]
+[INFO] hudi-integ-test .................................. SUCCESS [ 1.785 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
diff --git a/docs/gcs_filesystem.md b/docs/gcs_filesystem.md
index 3919fdf..d07e320 100644
--- a/docs/gcs_filesystem.md
+++ b/docs/gcs_filesystem.md
@@ -22,7 +22,7 @@ Add the required configs in your core-site.xml from where
Hudi can fetch them. R
```xml
<property>
<name>fs.defaultFS</name>
- <value>gs://hoodie-bucket</value>
+ <value>gs://hudi-bucket</value>
</property>
<property>
diff --git a/docs/migration_guide.md b/docs/migration_guide.md
index e415ed3..6f3ed59 100644
--- a/docs/migration_guide.md
+++ b/docs/migration_guide.md
@@ -44,7 +44,7 @@ This tool essentially starts a Spark Job to read the existing
parquet dataset an
#### Option 2
For huge datasets, this could be as simple as : for partition in [list of
partitions in source dataset] {
val inputDF =
spark.read.format("any_input_format").load("partition_path")
- inputDF.write.format("com.uber.hoodie").option()....save("basePath")
+ inputDF.write.format("org.apache.hudi").option()....save("basePath")
}
#### Option 3
@@ -53,9 +53,9 @@ Write your own custom logic of how to load an existing
dataset into a Hudi manag
```
Using the HDFSParquetImporter Tool. Once hudi has been built via `mvn clean
install -DskipTests`, the shell can be
-fired by via `cd hoodie-cli && ./hoodie-cli.sh`.
+fired by via `cd hudi-cli && ./hudi-cli.sh`.
-hoodie->hdfsparquetimport
+hudi->hdfsparquetimport
--upsert false
--srcPath /user/parquet/dataset/basepath
--targetPath
diff --git a/docs/querying_data.md b/docs/querying_data.md
index f96b328..3a6fd0f 100644
--- a/docs/querying_data.md
+++ b/docs/querying_data.md
@@ -27,13 +27,13 @@ In sections, below we will discuss in detail how to access
all the 3 views on ea
## Hive
-In order for Hive to recognize Hudi datasets and query correctly, the
HiveServer2 needs to be provided with the
`hoodie-hadoop-hive-bundle-x.y.z-SNAPSHOT.jar`
+In order for Hive to recognize Hudi datasets and query correctly, the
HiveServer2 needs to be provided with the
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar`
in its [aux jars
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr).
This will ensure the input format
classes with its dependencies are available for query planning & execution.
### Read Optimized table {#hive-ro-view}
In addition to setup above, for beeline cli access, the `hive.input.format`
variable needs to be set to the fully qualified path name of the
-inputformat `com.uber.hoodie.hadoop.HoodieInputFormat`. For Tez, additionally
the `hive.tez.input.format` needs to be set
+inputformat `org.apache.hudi.hadoop.HoodieInputFormat`. For Tez, additionally
the `hive.tez.input.format` needs to be set
to `org.apache.hadoop.hive.ql.io.HiveInputFormat`
### Real time table {#hive-rt-view}
@@ -85,7 +85,7 @@ Spark provides much easier deployment & management of Hudi
jars and bundles into
- **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to
how standard datasources (e.g: `spark.read.parquet`) work.
- **Read as Hive tables** : Supports all three views, including the real time
view, relying on the custom Hudi input formats again like Hive.
- In general, your spark job needs a dependency to `hoodie-spark` or
`hoodie-spark-bundle-x.y.z.jar` needs to be on the class path of driver &
executors (hint: use `--jars` argument)
+ In general, your spark job needs a dependency to `hudi-spark` or
`hudi-spark-bundle-x.y.z.jar` needs to be on the class path of driver &
executors (hint: use `--jars` argument)
### Read Optimized table {#spark-ro-view}
@@ -93,13 +93,13 @@ To read RO table as a Hive table using SparkSQL, simply
push a path filter into
This method retains Spark built-in optimizations for reading Parquet files
like vectorized reading on Hudi tables.
```
-spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
classOf[org.apache.hadoop.fs.PathFilter]);
+spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
classOf[org.apache.hadoop.fs.PathFilter]);
```
If you prefer to glob paths on DFS via the datasource, you can simply do
something like below to get a Spark dataframe to work with.
```
-Dataset<Row> hoodieROViewDF = spark.read().format("com.uber.hoodie")
+Dataset<Row> hoodieROViewDF = spark.read().format("org.apache.hudi")
// pass any path glob, can include hudi & non-hudi datasets
.load("/glob/path/pattern");
```
@@ -109,18 +109,18 @@ Currently, real time table can only be queried as a Hive
table in Spark. In orde
to using the Hive Serde to read the data (planning/executions is still Spark).
```
-$ spark-shell --jars hoodie-spark-bundle-x.y.z-SNAPSHOT.jar
--driver-class-path /etc/hive/conf --packages
com.databricks:spark-avro_2.11:4.0.0 --conf
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory
7g --executor-memory 2g --master yarn-client
+$ spark-shell --jars hudi-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path
/etc/hive/conf --packages com.databricks:spark-avro_2.11:4.0.0 --conf
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory
7g --executor-memory 2g --master yarn-client
scala> sqlContext.sql("select count(*) from hudi_rt where datestr =
'2016-10-02'").show()
```
### Incremental Pulling {#spark-incr-pull}
-The `hoodie-spark` module offers the DataSource API, a more elegant way to
pull data from Hudi dataset and process it via Spark.
+The `hudi-spark` module offers the DataSource API, a more elegant way to pull
data from Hudi dataset and process it via Spark.
A sample incremental pull, that will obtain all records written since
`beginInstantTime`, looks like below.
```
Dataset<Row> hoodieIncViewDF = spark.read()
- .format("com.uber.hoodie")
+ .format("org.apache.hudi")
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY(),
DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
@@ -141,4 +141,4 @@ Additionally, `HoodieReadClient` offers the following
functionality using Hudi's
## Presto
Presto is a popular query engine, providing interactive query performance.
Hudi RO tables can be queries seamlessly in Presto.
-This requires the `hoodie-presto-bundle` jar to be placed into
`<presto_install>/plugin/hive-hadoop2/`, across the installation.
+This requires the `hudi-presto-bundle` jar to be placed into
`<presto_install>/plugin/hive-hadoop2/`, across the installation.
diff --git a/docs/quickstart.md b/docs/quickstart.md
index 416aca9..d045c6a 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -16,7 +16,7 @@ If you have Hive, Hadoop, Spark installed already & prefer to
do it on your own
## Download Hudi
-Check out [code](https://github.com/apache/incubator-hudi) or download [latest
release](https://github.com/apache/incubator-hudi/archive/hoodie-0.4.5.zip)
+Check out [code](https://github.com/apache/incubator-hudi) or download [latest
release](https://github.com/apache/incubator-hudi/archive/hudi-0.4.5.zip)
and normally build the maven project, from command line
```
@@ -68,11 +68,11 @@ export
PATH=$JAVA_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_INSTALL/bin:$P
### Run HoodieJavaApp
-Run __hoodie-spark/src/test/java/HoodieJavaApp.java__ class, to place a two
commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously
inserted 100 records) onto your DFS/local filesystem. Use the wrapper script
+Run __hudi-spark/src/test/java/HoodieJavaApp.java__ class, to place a two
commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously
inserted 100 records) onto your DFS/local filesystem. Use the wrapper script
to run from command-line
```
-cd hoodie-spark
+cd hudi-spark
./run_hoodie_app.sh --help
Usage: <main class> [options]
Options:
@@ -89,7 +89,7 @@ Usage: <main class> [options]
Default: COPY_ON_WRITE
```
-The class lets you choose table names, output paths and one of the storage
types. In your own applications, be sure to include the `hoodie-spark` module
as dependency
+The class lets you choose table names, output paths and one of the storage
types. In your own applications, be sure to include the `hudi-spark` module as
dependency
and follow a similar pattern to write/read datasets via the datasource.
## Query a Hudi dataset
@@ -107,7 +107,7 @@ bin/hiveserver2 \
--hiveconf hive.root.logger=INFO,console \
--hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
--hiveconf hive.stats.autogather=false \
- --hiveconf
hive.aux.jars.path=/path/to/packaging/hoodie-hive-bundle/target/hoodie-hive-bundle-0.4.6-SNAPSHOT.jar
+ --hiveconf
hive.aux.jars.path=/path/to/packaging/hudi-hive-bundle/target/hudi-hive-bundle-0.4.6-SNAPSHOT.jar
```
@@ -117,7 +117,7 @@ It uses an incremental approach by storing the last commit
time synced in the TB
Both [Spark Datasource](writing_data.html#datasource-writer) &
[DeltaStreamer](writing_data.html#deltastreamer) have capability to do this,
after each write.
```
-cd hoodie-hive
+cd hudi-hive
./run_sync_tool.sh
--user hive
--pass hive
@@ -140,7 +140,7 @@ Let's first perform a query on the latest committed
snapshot of the table
```
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
hive> set hive.stats.autogather=false;
-hive> add jar file:///path/to/hoodie-hive-bundle-0.4.6-SNAPSHOT.jar;
+hive> add jar file:///path/to/hudi-hive-bundle-0.4.6-SNAPSHOT.jar;
hive> select count(*) from hoodie_test;
...
OK
@@ -155,7 +155,7 @@ Spark is super easy, once you get Hive working as above.
Just spin up a Spark Sh
```
$ cd $SPARK_INSTALL
-$ spark-shell --jars
$HUDI_SRC/packaging/hoodie-spark-bundle/target/hoodie-spark-bundle-0.4.6-SNAPSHOT.jar
--driver-class-path $HADOOP_CONF_DIR --conf
spark.sql.hive.convertMetastoreParquet=false --packages
com.databricks:spark-avro_2.11:4.0.0
+$ spark-shell --jars
$HUDI_SRC/packaging/hudi-spark-bundle/target/hudi-spark-bundle-0.4.6-SNAPSHOT.jar
--driver-class-path $HADOOP_CONF_DIR --conf
spark.sql.hive.convertMetastoreParquet=false --packages
com.databricks:spark-avro_2.11:4.0.0
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> sqlContext.sql("show tables").show(10000)
@@ -168,7 +168,7 @@ scala> sqlContext.sql("select count(*) from
hoodie_test").show(10000)
Checkout the 'master' branch on OSS Presto, build it, and place your
installation somewhere.
-* Copy the
hudi/packaging/hoodie-presto-bundle/target/hoodie-presto-bundle-*.jar into
$PRESTO_INSTALL/plugin/hive-hadoop2/
+* Copy the hudi/packaging/hudi-presto-bundle/target/hudi-presto-bundle-*.jar
into $PRESTO_INSTALL/plugin/hive-hadoop2/
* Startup your server and you should be able to query the same Hive table via
Presto
```
diff --git a/docs/s3_filesystem.md b/docs/s3_filesystem.md
index de16123..fe9a442 100644
--- a/docs/s3_filesystem.md
+++ b/docs/s3_filesystem.md
@@ -54,7 +54,7 @@ Alternatively, add the required configs in your core-site.xml
from where Hudi ca
```
-Utilities such as hoodie-cli or deltastreamer tool, can pick up s3 creds via
environmental variable prefixed with `HOODIE_ENV_`. For e.g below is a bash
snippet to setup
+Utilities such as hudi-cli or deltastreamer tool, can pick up s3 creds via
environmental variable prefixed with `HOODIE_ENV_`. For e.g below is a bash
snippet to setup
such variables and then have cli be able to work on datasets stored in s3
```
diff --git a/docs/writing_data.md b/docs/writing_data.md
index 3036d0f..c727266 100644
--- a/docs/writing_data.md
+++ b/docs/writing_data.md
@@ -30,7 +30,7 @@ can be chosen/changed across each commit/deltacommit issued
against the dataset.
## DeltaStreamer
-The `HoodieDeltaStreamer` utility (part of hoodie-utilities-bundle) provides
the way to ingest from different sources such as DFS or Kafka, with the
following capabilities.
+The `HoodieDeltaStreamer` utility (part of hudi-utilities-bundle) provides the
way to ingest from different sources such as DFS or Kafka, with the following
capabilities.
- Exactly once ingestion of new events from Kafka, [incremental
imports](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
from Sqoop or output of `HiveIncrementalPuller` or files under a DFS folder
- Support json, avro or a custom record types for the incoming data
@@ -41,7 +41,7 @@ The `HoodieDeltaStreamer` utility (part of
hoodie-utilities-bundle) provides the
Command line options describe capabilities in more detail
```
-[hoodie]$ spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hoodie-utilities-bundle/target/hoodie-utilities-bundle-*.jar` --help
+[hoodie]$ spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
Usage: <main class> [options]
Options:
--commit-on-errors
@@ -55,7 +55,7 @@ Usage: <main class> [options]
insert/bulk-insert
Default: false
--help, -h
- --hoodie-conf
+ --hudi-conf
Any configuration that can be set in the properties file (using the
CLI
parameter "--propsFilePath") can also be passed command line using
this
parameter
@@ -69,7 +69,7 @@ Usage: <main class> [options]
subclass of HoodieRecordPayload, that works off a GenericRecord.
Implement your own, if you want to do something other than overwriting
existing value
- Default: com.uber.hoodie.OverwriteWithLatestAvroPayload
+ Default: org.apache.hudi.OverwriteWithLatestAvroPayload
--props
path to properties file on localfs or dfs, with configurations for
Hudi client, schema provider, key generator and data source. For
@@ -78,15 +78,15 @@ Usage: <main class> [options]
sources, referto individual classes, for supported properties.
Default:
file:///Users/vinoth/bin/hoodie/src/test/resources/delta-streamer-config/dfs-source.properties
--schemaprovider-class
- subclass of com.uber.hoodie.utilities.schema.SchemaProvider to attach
+ subclass of org.apache.hudi.utilities.schema.SchemaProvider to attach
schemas to input & target table data, built in options:
FilebasedSchemaProvider
- Default: com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
+ Default: org.apache.hudi.utilities.schema.FilebasedSchemaProvider
--source-class
- Subclass of com.uber.hoodie.utilities.sources to read data. Built-in
- options: com.uber.hoodie.utilities.sources.{JsonDFSSource (default),
+ Subclass of org.apache.hudi.utilities.sources to read data. Built-in
+ options: org.apache.hudi.utilities.sources.{JsonDFSSource (default),
AvroDFSSource, JsonKafkaSource, AvroKafkaSource, HiveIncrPullSource}
- Default: com.uber.hoodie.utilities.sources.JsonDFSSource
+ Default: org.apache.hudi.utilities.sources.JsonDFSSource
--source-limit
Maximum amount of data to read from source. Default: No limit For e.g:
DFSSource => max bytes to read, KafkaSource => max events to read
@@ -105,15 +105,15 @@ Usage: <main class> [options]
* --target-table
name of the target table in Hive
--transformer-class
- subclass of com.uber.hoodie.utilities.transform.Transformer. UDF to
+ subclass of org.apache.hudi.utilities.transform.Transformer. UDF to
transform raw source dataset to a target dataset (conforming to target
schema) before writing. Default : Not set. E:g -
- com.uber.hoodie.utilities.transform.SqlQueryBasedTransformer (which
+ org.apache.hudi.utilities.transform.SqlQueryBasedTransformer (which
allows a SQL query template to be passed as a transformation function)
```
The tool takes a hierarchically composed property file and has pluggable
interfaces for extracting data, key generation and providing schema. Sample
configs for ingesting from kafka and dfs are
-provided under `hoodie-utilities/src/test/resources/delta-streamer-config`.
+provided under `hudi-utilities/src/test/resources/delta-streamer-config`.
For e.g: once you have Confluent Kafka, Schema registry up & running, produce
some test data using
([impressions.avro](https://docs.confluent.io/current/ksql/docs/tutorials/generate-custom-test-data.html)
provided by schema-registry repo)
@@ -124,12 +124,12 @@ For e.g: once you have Confluent Kafka, Schema registry
up & running, produce so
and then ingest it as follows.
```
-[hoodie]$ spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hoodie-utilities-bundle/target/hoodie-utilities-bundle-*.jar` \
- --props
file://${PWD}/hoodie-utilities/src/test/resources/delta-streamer-config/kafka-source.properties
\
- --schemaprovider-class
com.uber.hoodie.utilities.schema.SchemaRegistryProvider \
- --source-class com.uber.hoodie.utilities.sources.AvroKafkaSource \
+[hoodie]$ spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+ --props
file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties
\
+ --schemaprovider-class
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
+ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
--source-ordering-field impresssiontime \
- --target-base-path file:///tmp/hoodie-deltastreamer-op --target-table
uber.impressions \
+ --target-base-path file:///tmp/hudi-deltastreamer-op --target-table
uber.impressions \
--op BULK_INSERT
```
@@ -137,14 +137,14 @@ In some cases, you may want to migrate your existing
dataset into Hudi beforehan
## Datasource Writer
-The `hoodie-spark` module offers the DataSource API to write (and also read)
any data frame into a Hudi dataset.
+The `hudi-spark` module offers the DataSource API to write (and also read) any
data frame into a Hudi dataset.
Following is how we can upsert a dataframe, while specifying the field names
that need to be used
for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey
=> timestamp`
```
inputDF.write()
- .format("com.uber.hoodie")
+ .format("org.apache.hudi")
.options(clientOpts) // any of the Hudi client opts can be passed in as
well
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
"partition")
@@ -158,12 +158,12 @@ inputDF.write()
Both tools above support syncing of the dataset's latest schema to Hive
metastore, such that queries can pick up new columns and partitions.
In case, its preferable to run this from commandline or in an independent jvm,
Hudi provides a `HiveSyncTool`, which can be invoked as below,
-once you have built the hoodie-hive module.
+once you have built the hudi-hive module.
```
-cd hoodie-hive
+cd hudi-hive
./run_sync_tool.sh
- [hoodie-hive]$ ./run_sync_tool.sh --help
+ [hudi-hive]$ ./run_sync_tool.sh --help
Usage: <main class> [options]
Options:
* --base-path
@@ -189,14 +189,14 @@ Hudi supports implementing two types of deletes on data
stored in Hudi datasets,
- **Soft Deletes** : With soft deletes, user wants to retain the key but just
null out the values for all other fields.
This can be simply achieved by ensuring the appropriate fields are nullable
in the dataset schema and simply upserting the dataset after setting these
fields to null.
- **Hard Deletes** : A stronger form of delete is to physically remove any
trace of the record from the dataset. This can be achieved by issuing an upsert
with a custom payload implementation
- via either DataSource or DeltaStreamer which always returns Optional.Empty as
the combined value. Hudi ships with a built-in
`com.uber.hoodie.EmptyHoodieRecordPayload` class that does exactly this.
+ via either DataSource or DeltaStreamer which always returns Optional.Empty as
the combined value. Hudi ships with a built-in
`org.apache.hudi.EmptyHoodieRecordPayload` class that does exactly this.
```
deleteDF // dataframe containing just records to be deleted
- .write().format("com.uber.hoodie")
+ .write().format("org.apache.hudi")
.option(...) // Add HUDI options like record-key, partition-path and others
as needed for your setup
// specify record_key, partition_key, precombine_fieldkey & usual params
- .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY,
"com.uber.hoodie.EmptyHoodieRecordPayload")
+ .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY,
"org.apache.hudi.EmptyHoodieRecordPayload")
```