This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 97b3106 Refreshing site content
97b3106 is described below
commit 97b3106520c489612dd2187eb9ce4796d5f5c49f
Author: Vinoth Chandar <[email protected]>
AuthorDate: Sat Mar 9 13:18:07 2019 -0800
Refreshing site content
---
content/.gitignore | 1 -
content/404.html | 6 +-
content/admin_guide.html | 50 ++-
content/community.html | 11 +-
content/comparison.html | 15 +-
content/concepts.html | 8 +-
content/configurations.html | 489 ++++++++++++++-------
content/contributing.html | 8 +-
content/css/customstyles.css | 4 +-
content/css/theme-blue.css | 2 +-
content/feed.xml | 6 +-
content/gcs_hoodie.html | 16 +-
...ommit_duration.png => hudi_commit_duration.png} | Bin
.../{hoodie_intro_1.png => hudi_intro_1.png} | Bin
...ie_log_format_v2.png => hudi_log_format_v2.png} | Bin
...uery_perf_hive.png => hudi_query_perf_hive.png} | Bin
..._perf_presto.png => hudi_query_perf_presto.png} | Bin
...ry_perf_spark.png => hudi_query_perf_spark.png} | Bin
.../{hoodie_upsert_dag.png => hudi_upsert_dag.png} | Bin
...odie_upsert_perf1.png => hudi_upsert_perf1.png} | Bin
...odie_upsert_perf2.png => hudi_upsert_perf2.png} | Bin
content/implementation.html | 22 +-
content/incremental_processing.html | 36 +-
content/index.html | 10 +-
content/js/mydoc_scroll.html | 6 +-
content/migration_guide.html | 15 +-
content/news.html | 8 +-
content/news_archive.html | 6 +-
content/powered_by.html | 7 +-
content/privacy.html | 6 +-
content/quickstart.html | 26 +-
content/s3_hoodie.html | 19 +-
content/search.json | 40 +-
content/sql_queries.html | 8 +-
content/strata-talk.html | 6 +-
content/use_cases.html | 19 +-
36 files changed, 558 insertions(+), 292 deletions(-)
diff --git a/content/.gitignore b/content/.gitignore
deleted file mode 100644
index e43b0f9..0000000
--- a/content/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-.DS_Store
diff --git a/content/404.html b/content/404.html
index 9491810..fedef9b 100644
--- a/content/404.html
+++ b/content/404.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
-<meta name="keywords" content=" ">
+<meta name="keywords" content="">
<title>Page Not Found | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
diff --git a/content/admin_guide.html b/content/admin_guide.html
index 470a219..3625cee 100644
--- a/content/admin_guide.html
+++ b/content/admin_guide.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="This section offers an overview of tools
available to operate an ecosystem of Hudi datasets">
-<meta name="keywords" content=" admin">
+<meta name="keywords" content="hudi, administration, operation, devops">
<title>Admin Guide | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -355,11 +359,11 @@
<h2 id="admin-cli">Admin CLI</h2>
-<p>Once hoodie has been built via <code class="highlighter-rouge">mvn clean
install -DskipTests</code>, the shell can be fired by via <code
class="highlighter-rouge">cd hoodie-cli && ./hoodie-cli.sh</code>.
-A hoodie dataset resides on HDFS, in a location referred to as the
<strong>basePath</strong> and we would need this location in order to connect
to a Hoodie dataset.
-Hoodie library effectively manages this HDFS dataset internally, using .hoodie
subfolder to track all metadata</p>
+<p>Once hudi has been built, the shell can be fired by via <code
class="highlighter-rouge">cd hoodie-cli && ./hoodie-cli.sh</code>.
+A hudi dataset resides on DFS, in a location referred to as the
<strong>basePath</strong> and we would need this location in order to connect
to a Hudi dataset.
+Hudi library effectively manages this dataset internally, using .hoodie
subfolder to track all metadata</p>
-<p>To initialize a hoodie table, use the following command.</p>
+<p>To initialize a hudi table, use the following command.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>18/09/06 15:56:52
INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330
'javax.inject.Inject' annotation found and supported for autowiring
============================================
@@ -380,7 +384,7 @@ hoodie->create --path /user/hive/warehouse/table1
--tableName hoodie_table_1
</code></pre>
</div>
-<p>To see the description of hoodie table, use the command:</p>
+<p>To see the description of hudi table, use the command:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>
hoodie:hoodie_table_1->desc
@@ -398,7 +402,7 @@ hoodie:hoodie_table_1->desc
</code></pre>
</div>
-<p>Following is a sample command to connect to a Hoodie dataset contains uber
trips.</p>
+<p>Following is a sample command to connect to a Hudi dataset contains uber
trips.</p>
<div class="highlighter-rouge"><pre
class="highlight"><code>hoodie:trips->connect --path /app/uber/trips
@@ -447,7 +451,7 @@ hoodie:trips->
<h4 id="inspecting-commits">Inspecting Commits</h4>
-<p>The task of upserting or inserting a batch of incoming records is known as
a <strong>commit</strong> in Hoodie. A commit provides basic atomicity
guarantees such that only commited data is available for querying.
+<p>The task of upserting or inserting a batch of incoming records is known as
a <strong>commit</strong> in Hudi. A commit provides basic atomicity guarantees
such that only commited data is available for querying.
Each commit has a monotonically increasing string/number called the
<strong>commit number</strong>. Typically, this is the time at which we started
the commit.</p>
<p>To view some basic information about the last 10 commits,</p>
@@ -464,7 +468,7 @@ hoodie:trips->
</code></pre>
</div>
-<p>At the start of each write, Hoodie also writes a .inflight commit to the
.hoodie folder. You can use the timestamp there to estimate how long the commit
has been inflight</p>
+<p>At the start of each write, Hudi also writes a .inflight commit to the
.hoodie folder. You can use the timestamp there to estimate how long the commit
has been inflight</p>
<div class="highlighter-rouge"><pre class="highlight"><code>$ hdfs dfs -ls
/app/uber/trips/.hoodie/*.inflight
-rw-r--r-- 3 vinoth supergroup 321984 2016-10-05 23:18
/app/uber/trips/.hoodie/20161005225920.inflight
@@ -522,7 +526,7 @@ order (See Concepts). The below commands allow users to
view the file-slices for
<h4 id="statistics">Statistics</h4>
-<p>Since Hoodie directly manages file sizes for HDFS dataset, it might be good
to get an overall picture</p>
+<p>Since Hudi directly manages file sizes for DFS dataset, it might be good to
get an overall picture</p>
<div class="highlighter-rouge"><pre
class="highlight"><code>hoodie:trips->stats filesizes --partitionPath
2016/09/01 --sortBy "95th" --desc true --limit 10
________________________________________________________________________________________________
@@ -534,7 +538,7 @@ order (See Concepts). The below commands allow users to
view the file-slices for
</code></pre>
</div>
-<p>In case of Hoodie write taking much longer, it might be good to see the
write amplification for any sudden increases</p>
+<p>In case of Hudi write taking much longer, it might be good to see the write
amplification for any sudden increases</p>
<div class="highlighter-rouge"><pre
class="highlight"><code>hoodie:trips->stats wa
__________________________________________________________________________
@@ -547,7 +551,7 @@ order (See Concepts). The below commands allow users to
view the file-slices for
<h4 id="archived-commits">Archived Commits</h4>
-<p>In order to limit the amount of growth of .commit files on HDFS, Hoodie
archives older .commit files (with due respect to the cleaner policy) into a
commits.archived file.
+<p>In order to limit the amount of growth of .commit files on DFS, Hudi
archives older .commit files (with due respect to the cleaner policy) into a
commits.archived file.
This is a sequence file that contains a mapping from commitNumber => json
with raw information about the commit (same that is nicely rolled up above).</p>
<h4 id="compactions">Compactions</h4>
@@ -692,7 +696,7 @@ No File renames needed to unschedule pending compaction.
Operation successful.</
<div class="highlighter-rouge"><pre class="highlight"><code>
##### Repair Compaction
-The above compaction unscheduling operations could sometimes fail partially
(e:g -> HDFS temporarily unavailable). With
+The above compaction unscheduling operations could sometimes fail partially
(e:g -> DFS temporarily unavailable). With
partial failures, the compaction operation could become inconsistent with the
state of file-slices. When you run
`compaction validate`, you can notice invalid compaction operations if there
is one. In these cases, the repair
command comes to the rescue, it will rearrange the file-slices so that there
is no loss and the file-slices are
@@ -710,7 +714,7 @@ Compaction successfully repaired
<h2 id="metrics">Metrics</h2>
-<p>Once the Hoodie Client is configured with the right datasetname and
environment for metrics, it produces the following graphite metrics, that aid
in debugging hoodie datasets</p>
+<p>Once the Hudi Client is configured with the right datasetname and
environment for metrics, it produces the following graphite metrics, that aid
in debugging hudi datasets</p>
<ul>
<li><strong>Commit Duration</strong> - This is amount of time it took to
successfully commit a batch of records</li>
@@ -722,29 +726,29 @@ Compaction successfully repaired
<p>These metrics can then be plotted on a standard tool like grafana. Below is
a sample commit duration chart.</p>
-<figure><img class="docimage" src="images/hoodie_commit_duration.png"
alt="hoodie_commit_duration.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_commit_duration.png"
alt="hudi_commit_duration.png" style="max-width: 1000px" /></figure>
<h2 id="troubleshooting-failures">Troubleshooting Failures</h2>
-<p>Section below generally aids in debugging Hoodie failures. Off the bat, the
following metadata is added to every record to help triage issues easily using
standard Hadoop SQL engines (Hive/Presto/Spark)</p>
+<p>Section below generally aids in debugging Hudi failures. Off the bat, the
following metadata is added to every record to help triage issues easily using
standard Hadoop SQL engines (Hive/Presto/Spark)</p>
<ul>
- <li><strong>_hoodie_record_key</strong> - Treated as a primary key within
each HDFS partition, basis of all updates/inserts</li>
+ <li><strong>_hoodie_record_key</strong> - Treated as a primary key within
each DFS partition, basis of all updates/inserts</li>
<li><strong>_hoodie_commit_time</strong> - Last commit that touched this
record</li>
<li><strong>_hoodie_file_name</strong> - Actual file name containing the
record (super useful to triage duplicates)</li>
<li><strong>_hoodie_partition_path</strong> - Path from basePath that
identifies the partition containing this record</li>
</ul>
-<div class="bs-callout bs-callout-warning">Note that as of now, Hoodie assumes
the application passes in the same deterministic partitionpath for a given
recordKey. i.e the uniqueness of record key is only enforced within each
partition</div>
+<div class="bs-callout bs-callout-warning">Note that as of now, Hudi assumes
the application passes in the same deterministic partitionpath for a given
recordKey. i.e the uniqueness of record key is only enforced within each
partition</div>
<h4 id="missing-records">Missing records</h4>
<p>Please check if there were any write errors using the admin commands above,
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hoodie, but
handed back to the application to decide what to do with it.</p>
+If you do find errors, then the record was not actually written by Hudi, but
handed back to the application to decide what to do with it.</p>
<h4 id="duplicates">Duplicates</h4>
-<p>First of all, please confirm if you do indeed have duplicates
<strong>AFTER</strong> ensuring the query is accessing the Hoodie datasets <a
href="sql_queries.html">properly</a> .</p>
+<p>First of all, please confirm if you do indeed have duplicates
<strong>AFTER</strong> ensuring the query is accessing the Hudi datasets <a
href="sql_queries.html">properly</a> .</p>
<ul>
<li>If confirmed, please use the metadata fields above, to identify the
physical files & partition files containing the records .</li>
@@ -754,10 +758,10 @@ If you do find errors, then the record was not actually
written by Hoodie, but h
<h4 id="spark-failures">Spark failures</h4>
-<p>Typical upsert() DAG looks like below. Note that Hoodie client also caches
intermediate RDDs to intelligently profile workload and size files and spark
parallelism.
+<p>Typical upsert() DAG looks like below. Note that Hudi client also caches
intermediate RDDs to intelligently profile workload and size files and spark
parallelism.
Also Spark UI shows sortByKey twice due to the probe job also being shown,
nonetheless its just a single sort.</p>
-<figure><img class="docimage" src="images/hoodie_upsert_dag.png"
alt="hoodie_upsert_dag.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_upsert_dag.png"
alt="hudi_upsert_dag.png" style="max-width: 1000px" /></figure>
<p>At a high level, there are two steps</p>
@@ -777,7 +781,7 @@ Also Spark UI shows sortByKey twice due to the probe job
also being shown, nonet
<li>Job 7 : Actual writing of data (update + insert + insert turned to
updates to maintain file size)</li>
</ul>
-<p>Depending on the exception source (Hoodie/Spark), the above knowledge of
the DAG can be used to pinpoint the actual issue. The most often encountered
failures result from YARN/HDFS temporary failures.
+<p>Depending on the exception source (Hudi/Spark), the above knowledge of the
DAG can be used to pinpoint the actual issue. The most often encountered
failures result from YARN/DFS temporary failures.
In the future, a more sophisticated debug/management UI would be added to the
project, that can help automate some of this debugging.</p>
diff --git a/content/community.html b/content/community.html
index 39488eb..34196f3 100644
--- a/content/community.html
+++ b/content/community.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
-<meta name="keywords" content=" usecases">
+<meta name="keywords" content="hudi, use cases, big data, apache">
<title>Community | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -355,7 +359,7 @@
<tbody>
<tr>
<td>For any general questions, user support, development discussions</td>
- <td>Dev Mailing list (<a
href="mailto:dev-subscribe@hudi.apache.org">Subscribe</a>,
<a
href="mailto:dev-unsubscribe@hudi.apache.or&
[...]
+ <td>Dev Mailing list (<a
href="mailto:dev-subscribe@hudi.apache.org">Subscribe</a>,
<a
href="mailto:dev-unsubscribe@hudi.apache.or&
[...]
</tr>
<tr>
<td>For reporting bugs or issues or discover known issues</td>
@@ -389,9 +393,10 @@ Apache Hudi follows the typical Apache vulnerability
handling <a href="https://a
<li>Ask (and/or) answer questions on our support channels listed above.</li>
<li>Review code or HIPs</li>
<li>Help improve documentation</li>
+ <li>Author blogs on our wiki</li>
<li>Testing; Improving out-of-box experience by reporting bugs</li>
<li>Share new ideas/directions to pursue or propose a new HIP</li>
- <li>Contributing code to the project</li>
+ <li>Contributing code to the project (<a
href="https://issues.apache.org/jira/issues/?jql=project+%3D+HUDI+AND+component+%3D+newbie">newbie
JIRAs</a>)</li>
</ul>
<h4 id="code-contributions">Code Contributions</h4>
diff --git a/content/comparison.html b/content/comparison.html
index 34082e0..59bcf75 100644
--- a/content/comparison.html
+++ b/content/comparison.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
-<meta name="keywords" content=" usecases">
+<meta name="keywords" content="apache, hudi, kafka, kudu, hive, hbase, stream
processing">
<title>Comparison | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -341,7 +345,7 @@
- <p>Apache Hudi fills a big void for processing data on top of HDFS, and thus
mostly co-exists nicely with these technologies. However,
+ <p>Apache Hudi fills a big void for processing data on top of DFS, and thus
mostly co-exists nicely with these technologies. However,
it would be useful to understand how Hudi fits into the current big data
ecosystem, contrasting it with a few related systems
and bring out the different tradeoffs these systems have accepted in their
design.</p>
@@ -380,16 +384,15 @@ just for analytics. Finally, HBase does not support
incremental processing primi
<p>A popular question, we get is : “How does Hudi relate to stream processing
systems?”, which we will try to answer here. Simply put, Hudi can integrate with
batch (<code class="highlighter-rouge">copy-on-write storage</code>) and
streaming (<code class="highlighter-rouge">merge-on-read storage</code>) jobs
of today, to store the computed results in Hadoop. For Spark apps, this can
happen via direct
integration of Hudi library with Spark/Spark streaming DAGs. In case of
Non-Spark processing systems (eg: Flink, Hive), the processing can be done in
the respective systems
-and later sent into a Hudi table via a Kafka topic/HDFS intermediate file. In
more conceptual level, data processing
+and later sent into a Hudi table via a Kafka topic/DFS intermediate file. In
more conceptual level, data processing
pipelines just consist of three components : <code
class="highlighter-rouge">source</code>, <code
class="highlighter-rouge">processing</code>, <code
class="highlighter-rouge">sink</code>, with users ultimately running queries
against the sink to use the results of the pipeline.
-Hudi can act as either a source or sink, that stores data on HDFS.
Applicability of Hudi to a given stream processing pipeline ultimately boils
down to suitability
+Hudi can act as either a source or sink, that stores data on DFS.
Applicability of Hudi to a given stream processing pipeline ultimately boils
down to suitability
of Presto/SparkSQL/Hive for your queries.</p>
<p>More advanced use cases revolve around the concepts of <a
href="https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop">incremental
processing</a>, which effectively
uses Hudi even inside the <code class="highlighter-rouge">processing</code>
engine to speed up typical batch pipelines. For e.g: Hudi can be used as a
state store inside a processing DAG (similar
to how <a
href="https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend">rocksDB</a>
is used by Flink). This is an item on the roadmap
-and will eventually happen as a <a
href="https://github.com/uber/hoodie/issues/8">Beam Runner</a></p>
-
+and will eventually happen as a <a
href="https://issues.apache.org/jira/browse/HUDI-60">Beam Runner</a></p>
<div class="tags">
diff --git a/content/concepts.html b/content/concepts.html
index 7e85d32..22754c4 100644
--- a/content/concepts.html
+++ b/content/concepts.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Here we introduce some basic concepts & give
a broad technical overview of Hudi">
-<meta name="keywords" content=" concepts">
+<meta name="keywords" content="hudi, design, storage, views, timeline">
<title>Concepts | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -343,7 +347,7 @@
- <p>Apache Hudi (pronounced “Hudi”) provides the following primitives over
datasets on HDFS</p>
+ <p>Apache Hudi (pronounced “Hudi”) provides the following primitives over
datasets on DFS</p>
<ul>
<li>Upsert (how do I change the dataset?)</li>
diff --git a/content/configurations.html b/content/configurations.html
index 5f1adb8..73f66c9 100644
--- a/content/configurations.html
+++ b/content/configurations.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Here we list all possible configurations and
what they mean">
-<meta name="keywords" content=" configurations">
+<meta name="keywords" content="garbage collection, hudi, jvm, configs, tuning">
<title>Configurations | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -343,174 +347,360 @@
- <h3 id="configuration">Configuration</h3>
+ <p>This page covers the different ways of configuring your job to write/read
Hudi datasets.
+At a high level, you can control behaviour at few levels.</p>
+
+<ul>
+ <li><strong><a href="#spark-datasource">Spark Datasource
Configs</a></strong> : These configs control the Hudi Spark Datasource,
providing ability to define keys/partitioning, pick out the write operation,
specify how to merge records or choosing view type to read.</li>
+ <li><strong><a href="#writeclient-configs">WriteClient Configs</a></strong>
: Internally, the Hudi datasource uses a RDD based <code
class="highlighter-rouge">HoodieWriteClient</code> api to actually perform
writes to storage. These configs provide deep control over lower level aspects
like
+ file sizing, compression, parallelism, compaction, write schema, cleaning
etc. Although Hudi provides sane defaults, from time-time these configs may
need to be tweaked to optimize for specific workloads.</li>
+ <li><strong><a href="#PAYLOAD_CLASS_OPT_KEY">RecordPayload
Config</a></strong> : This is the lowest level of customization offered by
Hudi. Record payloads define how to produce new values to upsert based on
incoming new record and
+ stored old record. Hudi provides default implementations such as <code
class="highlighter-rouge">OverwriteWithLatestAvroPayload</code> which simply
update storage with the latest/last-written record.
+ This can be overridden to a custom class extending <code
class="highlighter-rouge">HoodieRecordPayload</code> class, on both datasource
and WriteClient levels.</li>
+</ul>
+
+<h3 id="talking-to-cloud-storage">Talking to Cloud Storage</h3>
+
+<p>Immaterial of whether RDD/WriteClient APIs or Datasource is used, the
following information helps configure access
+to cloud stores.</p>
+
+<ul>
+ <li><a href="s3_hoodie.html">AWS S3</a> <br />
+Configurations required for S3 and Hudi co-operability.</li>
+ <li><a href="gcs_hoodie.html">Google Cloud Storage</a> <br />
+Configurations required for GCS and Hudi co-operability.</li>
+</ul>
+
+<h3 id="spark-datasource">Spark Datasource Configs</h3>
+
+<p>Spark jobs using the datasource can be configured by passing the below
options into the <code class="highlighter-rouge">option(k,v)</code> method as
usual.
+The actual datasource level configs are listed below.</p>
+
+<h4 id="write-options">Write Options</h4>
+
+<p>Additionally, you can pass down any of the WriteClient level configs
directly using <code class="highlighter-rouge">options()</code> or <code
class="highlighter-rouge">option(k,v)</code> methods.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>inputDF.write()
+.format("com.uber.hoodie")
+.options(clientOpts) // any of the Hudi client opts can be passed in as well
+.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
+.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
+.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
+.option(HoodieWriteConfig.TABLE_NAME, tableName)
+.mode(SaveMode.Append)
+.save(basePath);
+</code></pre>
+</div>
+
+<p>Options useful for writing datasets via <code
class="highlighter-rouge">write.format.option(...)</code></p>
+
+<ul>
+ <li><a href="#TABLE_NAME_OPT_KEY">TABLE_NAME_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.write.table.name</code>
[Required]<br />
+<span style="color:grey">Hive table name, to register the dataset
into.</span></li>
+ <li><a href="#OPERATION_OPT_KEY">OPERATION_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.write.operation</code>, Default:
<code class="highlighter-rouge">upsert</code><br />
+<span style="color:grey">whether to do upsert, insert or bulkinsert for the
write operation. Use <code class="highlighter-rouge">bulkinsert</code> to load
new data into a table, and there on use <code
class="highlighter-rouge">upsert</code>/<code
class="highlighter-rouge">insert</code>.
+bulk insert uses a disk based write path to scale to load large inputs without
need to cache it.</span></li>
+ <li><a href="#STORAGE_TYPE_OPT_KEY">STORAGE_TYPE_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.write.storage.type</code>, Default:
<code class="highlighter-rouge">COPY_ON_WRITE</code> <br />
+<span style="color:grey">The storage type for the underlying data, for this
write. This can’t change between writes.</span></li>
+ <li><a href="#PRECOMBINE_FIELD_OPT_KEY">PRECOMBINE_FIELD_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.write.precombine.field</code>,
Default: <code class="highlighter-rouge">ts</code> <br />
+<span style="color:grey">Field used in preCombining before actual write. When
two records have the same key value,
+we will pick the one with the largest value for the precombine field,
determined by Object.compareTo(..)</span></li>
+ <li><a href="#PAYLOAD_CLASS_OPT_KEY">PAYLOAD_CLASS_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.write.payload.class</code>,
Default: <code
class="highlighter-rouge">com.uber.hoodie.OverwriteWithLatestAvroPayload</code>
<br />
+<span style="color:grey">Payload class used. Override this, if you like to
roll your own merge logic, when upserting/inserting.
+This will render any value set for <code
class="highlighter-rouge">PRECOMBINE_FIELD_OPT_VAL</code>
in-effective</span></li>
+ <li><a href="#RECORDKEY_FIELD_OPT_KEY">RECORDKEY_FIELD_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.write.recordkey.field</code>,
Default: <code class="highlighter-rouge">uuid</code> <br />
+<span style="color:grey">Record key field. Value to be used as the <code
class="highlighter-rouge">recordKey</code> component of <code
class="highlighter-rouge">HoodieKey</code>. Actual value
+will be obtained by invoking .toString() on the field value. Nested fields can
be specified using
+the dot notation eg: <code class="highlighter-rouge">a.b.c</code></span></li>
+ <li><a
href="#PARTITIONPATH_FIELD_OPT_KEY">PARTITIONPATH_FIELD_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.write.partitionpath.field</code>,
Default: <code class="highlighter-rouge">partitionpath</code> <br />
+<span style="color:grey">Partition path field. Value to be used at the <code
class="highlighter-rouge">partitionPath</code> component of <code
class="highlighter-rouge">HoodieKey</code>.
+Actual value ontained by invoking .toString()</span></li>
+ <li><a href="#KEYGENERATOR_CLASS_OPT_KEY">KEYGENERATOR_CLASS_OPT_KEY</a><br
/>
+Property: <code
class="highlighter-rouge">hoodie.datasource.write.keygenerator.class</code>,
Default: <code
class="highlighter-rouge">com.uber.hoodie.SimpleKeyGenerator</code> <br />
+<span style="color:grey">Key generator class, that implements will extract the
key out of incoming <code class="highlighter-rouge">Row</code>
object</span></li>
+ <li><a
href="#COMMIT_METADATA_KEYPREFIX_OPT_KEY">COMMIT_METADATA_KEYPREFIX_OPT_KEY</a><br
/>
+Property: <code
class="highlighter-rouge">hoodie.datasource.write.commitmeta.key.prefix</code>,
Default: <code class="highlighter-rouge">_</code> <br />
+<span style="color:grey">Option keys beginning with this prefix, are
automatically added to the commit/deltacommit metadata.
+This is useful to store checkpointing information, in a consistent way with
the hudi timeline</span></li>
+ <li><a href="#INSERT_DROP_DUPS_OPT_KEY">INSERT_DROP_DUPS_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.write.insert.drop.duplicates</code>,
Default: <code class="highlighter-rouge">false</code> <br />
+<span style="color:grey">If set to true, filters out all duplicate records
from incoming dataframe, during insert operations. </span></li>
+ <li><a href="#HIVE_SYNC_ENABLED_OPT_KEY">HIVE_SYNC_ENABLED_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.hive_sync.enable</code>, Default:
<code class="highlighter-rouge">false</code> <br />
+<span style="color:grey">When set to true, register/sync the dataset to Apache
Hive metastore</span></li>
+ <li><a href="#HIVE_DATABASE_OPT_KEY">HIVE_DATABASE_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.hive_sync.database</code>, Default:
<code class="highlighter-rouge">default</code> <br />
+<span style="color:grey">database to sync to</span></li>
+ <li><a href="#HIVE_TABLE_OPT_KEY">HIVE_TABLE_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.hive_sync.table</code>, [Required]
<br />
+<span style="color:grey">table to sync to</span></li>
+ <li><a href="#HIVE_USER_OPT_KEY">HIVE_USER_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.hive_sync.username</code>, Default:
<code class="highlighter-rouge">hive</code> <br />
+<span style="color:grey">hive user name to use</span></li>
+ <li><a href="#HIVE_PASS_OPT_KEY">HIVE_PASS_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.hive_sync.password</code>, Default:
<code class="highlighter-rouge">hive</code> <br />
+<span style="color:grey">hive password to use</span></li>
+ <li><a href="#HIVE_URL_OPT_KEY">HIVE_URL_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.hive_sync.jdbcurl</code>, Default:
<code class="highlighter-rouge">jdbc:hive2://localhost:10000</code> <br />
+<span style="color:grey">Hive metastore url</span></li>
+ <li><a
href="#HIVE_PARTITION_FIELDS_OPT_KEY">HIVE_PARTITION_FIELDS_OPT_KEY</a><br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.hive_sync.partition_fields</code>,
Default: ` ` <br />
+<span style="color:grey">field in the dataset to use for determining hive
partition columns.</span></li>
+ <li><a
href="#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY">HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY</a><br
/>
+Property: <code
class="highlighter-rouge">hoodie.datasource.hive_sync.partition_extractor_class</code>,
Default: <code
class="highlighter-rouge">com.uber.hoodie.hive.SlashEncodedDayPartitionValueExtractor</code>
<br />
+<span style="color:grey">Class used to extract partition field values into
hive partition columns.</span></li>
+ <li><a
href="#HIVE_ASSUME_DATE_PARTITION_OPT_KEY">HIVE_ASSUME_DATE_PARTITION_OPT_KEY</a><br
/>
+Property: <code
class="highlighter-rouge">hoodie.datasource.hive_sync.assume_date_partitioning</code>,
Default: <code class="highlighter-rouge">false</code> <br />
+<span style="color:grey">Assume partitioning is yyyy/mm/dd</span></li>
+</ul>
+
+<h4 id="read-options">Read Options</h4>
+
+<p>Options useful for reading datasets via <code
class="highlighter-rouge">read.format.option(...)</code></p>
<ul>
- <li><a href="#HoodieWriteConfig">HoodieWriteConfig</a> <br />
-<span style="color:grey">Top Level Config which is passed in when
HoodieWriteClent is created.</span>
+ <li><a href="#VIEW_TYPE_OPT_KEY">VIEW_TYPE_OPT_KEY</a> <br />
+Property: <code class="highlighter-rouge">hoodie.datasource.view.type</code>,
Default: <code class="highlighter-rouge">read_optimized</code> <br />
+<span style="color:grey">Whether data needs to be read, in incremental mode
(new data since an instantTime)
+(or) Read Optimized mode (obtain latest view, based on columnar data)
+(or) Real time mode (obtain latest view, based on row & columnar
data)</span></li>
+ <li><a href="#BEGIN_INSTANTTIME_OPT_KEY">BEGIN_INSTANTTIME_OPT_KEY</a> <br
/>
+Property: <code
class="highlighter-rouge">hoodie.datasource.read.begin.instanttime</code>,
[Required in incremental mode] <br />
+<span style="color:grey">Instant time to start incrementally pulling data
from. The instanttime here need not
+necessarily correspond to an instant on the timeline. New data written with an
+ <code class="highlighter-rouge">instant_time > BEGIN_INSTANTTIME</code>
are fetched out. For e.g: ‘20170901080000’ will get
+ all new data written after Sep 1, 2017 08:00AM.</span></li>
+ <li><a href="#END_INSTANTTIME_OPT_KEY">END_INSTANTTIME_OPT_KEY</a> <br />
+Property: <code
class="highlighter-rouge">hoodie.datasource.read.end.instanttime</code>,
Default: latest instant (i.e fetches all new data since begin instant time) <br
/>
+<span style="color:grey"> Instant time to limit incrementally fetched data to.
New data written with an
+<code class="highlighter-rouge">instant_time <= END_INSTANTTIME</code> are
fetched out.</span></li>
+</ul>
+
+<h3 id="writeclient-configs">WriteClient Configs</h3>
+
+<p>Jobs programming directly against the RDD level apis can build a <code
class="highlighter-rouge">HoodieWriteConfig</code> object and pass it in to the
<code class="highlighter-rouge">HoodieWriteClient</code> constructor.
+HoodieWriteConfig can be built using a builder pattern as below.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>HoodieWriteConfig
cfg = HoodieWriteConfig.newBuilder()
+ .withPath(basePath)
+ .forTable(tableName)
+ .withSchema(schemaStr)
+ .withProps(props) // pass raw k,v pairs from a property file.
+
.withCompactionConfig(HoodieCompactionConfig.newBuilder().withXXX(...).build())
+ .withIndexConfig(HoodieIndexConfig.newBuilder().withXXX(...).build())
+ ...
+ .build();
+</code></pre>
+</div>
+
+<p>Following subsections go over different aspects of write configs,
explaining most important configs with their property names, default values.</p>
+
+<ul>
+ <li><a href="#withPath">withPath</a> (hoodie_base_path)
+Property: <code class="highlighter-rouge">hoodie.base.path</code> [Required]
<br />
+<span style="color:grey">Base DFS path under which all the data partitions are
created. Always prefix it explicitly with the storage scheme (e.g hdfs://,
s3:// etc). Hudi stores all the main meta-data about commits, savepoints,
cleaning audit logs etc in .hoodie directory under the base directory.
</span></li>
+ <li><a href="#withSchema">withSchema</a> (schema_str) <br />
+Property: <code class="highlighter-rouge">hoodie.avro.schema</code>
[Required]<br />
+<span style="color:grey">This is the current reader avro schema for the
dataset. This is a string of the entire schema. HoodieWriteClient uses this
schema to pass on to implementations of HoodieRecordPayload to convert from the
source format to avro record. This is also used when re-writing records during
an update. </span></li>
+ <li><a href="#forTable">forTable</a> (table_name)<br />
+Property: <code class="highlighter-rouge">hoodie.table.name</code> [Required]
<br />
+ <span style="color:grey">Table name for the dataset, will be used for
registering with Hive. Needs to be same across runs.</span></li>
+ <li><a href="#withBulkInsertParallelism">withBulkInsertParallelism</a>
(bulk_insert_parallelism = 1500) <br />
+Property: <code
class="highlighter-rouge">hoodie.bulkinsert.shuffle.parallelism</code><br />
+<span style="color:grey">Bulk insert is meant to be used for large initial
imports and this parallelism determines the initial number of files in your
dataset. Tune this to achieve a desired optimal size during initial
import.</span></li>
+ <li><a href="#withParallelism">withParallelism</a>
(insert_shuffle_parallelism = 1500, upsert_shuffle_parallelism = 1500)<br />
+Property: <code
class="highlighter-rouge">hoodie.insert.shuffle.parallelism</code>, <code
class="highlighter-rouge">hoodie.upsert.shuffle.parallelism</code><br />
+<span style="color:grey">Once data has been initially imported, this
parallelism controls initial parallelism for reading input records. Ensure this
value is high enough say: 1 partition for 1 GB of input data</span></li>
+ <li><a href="#combineInput">combineInput</a> (on_insert = false,
on_update=true)<br />
+Property: <code class="highlighter-rouge">hoodie.combine.before.insert</code>,
<code class="highlighter-rouge">hoodie.combine.before.upsert</code><br />
+<span style="color:grey">Flag which first combines the input RDD and merges
multiple partial records into a single record before inserting or updating in
DFS</span></li>
+ <li><a href="#withWriteStatusStorageLevel">withWriteStatusStorageLevel</a>
(level = MEMORY_AND_DISK_SER)<br />
+Property: <code
class="highlighter-rouge">hoodie.write.status.storage.level</code><br />
+<span style="color:grey">HoodieWriteClient.insert and HoodieWriteClient.upsert
returns a persisted RDD[WriteStatus], this is because the Client can choose to
inspect the WriteStatus and choose and commit or not based on the failures.
This is a configuration for the storage level for this RDD </span></li>
+ <li><a href="#withAutoCommit">withAutoCommit</a> (autoCommit = true)<br />
+Property: <code class="highlighter-rouge">hoodie.auto.commit</code><br />
+<span style="color:grey">Should HoodieWriteClient autoCommit after insert and
upsert. The client can choose to turn off auto-commit and commit on a “defined
success condition”</span></li>
+ <li><a href="#withAssumeDatePartitioning">withAssumeDatePartitioning</a>
(assumeDatePartitioning = false)<br />
+Property: ` hoodie.assume.date.partitioning`<br />
+<span style="color:grey">Should HoodieWriteClient assume the data is
partitioned by dates, i.e three levels from base path. This is a stop-gap to
support tables created by versions < 0.3.1. Will be removed eventually
</span></li>
+ <li><a href="#withConsistencyCheckEnabled">withConsistencyCheckEnabled</a>
(enabled = false)<br />
+Property: <code
class="highlighter-rouge">hoodie.consistency.check.enabled</code><br />
+<span style="color:grey">Should HoodieWriteClient perform additional checks to
ensure written files’ are listable on the underlying filesystem/storage. Set
this to true, to workaround S3’s eventual consistency model and ensure all data
written as a part of a commit is faithfully available for queries. </span></li>
+</ul>
+
+<h4 id="index-configs">Index configs</h4>
+<p>Following configs control indexing behavior, which tags incoming records as
either inserts or updates to older records.</p>
+
+<ul>
+ <li><a href="#withIndexConfig">withIndexConfig</a> (HoodieIndexConfig) <br />
+ <span style="color:grey">This is pluggable to have a external index (HBase)
or use the default bloom filter stored in the Parquet files</span>
<ul>
- <li><a href="#withPath">withPath</a> (hoodie_base_path) <br />
- <span style="color:grey">Base HDFS path under which all the data partitions
are created. Hoodie stores all the main meta-data about commits, savepoints,
cleaning audit logs etc in .hoodie directory under the base directory.
</span></li>
- <li><a href="#withSchema">withSchema</a> (schema_str) <br />
- <span style="color:grey">This is the current reader avro schema for the
Hoodie Dataset. This is a string of the entire schema. HoodieWriteClient uses
this schema to pass on to implementations of HoodieRecordPayload to convert
from the source format to avro record. This is also used when re-writing
records during an update. </span></li>
- <li><a href="#withParallelism">withParallelism</a>
(insert_shuffle_parallelism = 200, upsert_shuffle_parallelism = 200) <br />
- <span style="color:grey">Insert DAG uses the insert_parallelism in every
shuffle. Upsert DAG uses the upsert_parallelism in every shuffle. Typical
workload is profiled and once a min parallelism is established, trade off
between latency and cluster usage optimizations this is tuned and have a
conservatively high number to optimize for latency and </span></li>
- <li><a href="#combineInput">combineInput</a> (on_insert = false,
on_update=true) <br />
- <span style="color:grey">Flag which first combines the input RDD and merges
multiple partial records into a single record before inserting or updating in
HDFS</span></li>
- <li><a
href="#withWriteStatusStorageLevel">withWriteStatusStorageLevel</a> (level =
MEMORY_AND_DISK_SER) <br />
- <span style="color:grey">HoodieWriteClient.insert and
HoodieWriteClient.upsert returns a persisted RDD[WriteStatus], this is because
the Client can choose to inspect the WriteStatus and choose and commit or not
based on the failures. This is a configuration for the storage level for this
RDD </span></li>
- <li><a href="#withAutoCommit">withAutoCommit</a> (autoCommit = true) <br
/>
- <span style="color:grey">Should HoodieWriteClient autoCommit after insert
and upsert. The client can choose to turn off auto-commit and commit on a
“defined success condition”</span></li>
- <li><a href="#withAssumeDatePartitioning">withAssumeDatePartitioning</a>
(assumeDatePartitioning = false) <br />
- <span style="color:grey">Should HoodieWriteClient assume the data is
partitioned by dates, i.e three levels from base path. This is a stop-gap to
support tables created by versions < 0.3.1. Will be removed eventually
</span></li>
- <li>
- <p><a
href="#withConsistencyCheckEnabled">withConsistencyCheckEnabled</a> (enabled =
false) <br />
- <span style="color:grey">Should HoodieWriteClient perform additional checks
to ensure written files’ are listable on the underlying filesystem/storage. Set
this to true, to workaround S3’s eventual consistency model and ensure all data
written as a part of a commit is faithfully available for queries. </span></p>
- </li>
- <li><a href="#withIndexConfig">withIndexConfig</a> (HoodieIndexConfig)
<br />
- <span style="color:grey">Hoodie uses a index to help find the FileID which
contains an incoming record key. This is pluggable to have a external index
(HBase) or use the default bloom filter stored in the Parquet files</span>
- <ul>
- <li><a href="#withIndexType">withIndexType</a> (indexType = BLOOM)
<br />
+ <li><a href="#withIndexType">withIndexType</a> (indexType = BLOOM) <br />
+ Property: <code class="highlighter-rouge">hoodie.index.type</code> <br />
<span style="color:grey">Type of index to use. Default is Bloom filter.
Possible options are [BLOOM | HBASE | INMEMORY]. Bloom filters removes the
dependency on a external system and is stored in the footer of the Parquet Data
Files</span></li>
- <li><a href="#bloomFilterNumEntries">bloomFilterNumEntries</a>
(60000) <br />
- <span style="color:grey">Only applies if index type is BLOOM. <br />This is
the number of entries to be stored in the bloom filter. We assume the
maxParquetFileSize is 128MB and averageRecordSize is 1024B and hence we approx
a total of 130K records in a file. The default (60000) is roughly half of this
approximation. <a href="https://github.com/uber/hoodie/issues/70">#70</a>
tracks computing this dynamically. Warning: Setting this very low, will
generate a lot of false positives and in [...]
- <li><a href="#bloomFilterFPP">bloomFilterFPP</a> (0.000000001) <br />
+ <li><a href="#bloomFilterNumEntries">bloomFilterNumEntries</a>
(numEntries = 60000) <br />
+ Property: <code
class="highlighter-rouge">hoodie.index.bloom.num_entries</code> <br />
+ <span style="color:grey">Only applies if index type is BLOOM. <br />This is
the number of entries to be stored in the bloom filter. We assume the
maxParquetFileSize is 128MB and averageRecordSize is 1024B and hence we approx
a total of 130K records in a file. The default (60000) is roughly half of this
approximation. <a
href="https://issues.apache.org/jira/browse/HUDI-56">HUDI-56</a> tracks
computing this dynamically. Warning: Setting this very low, will generate a lot
of false positiv [...]
+ <li><a href="#bloomFilterFPP">bloomFilterFPP</a> (fpp = 0.000000001) <br
/>
+ Property: <code class="highlighter-rouge">hoodie.index.bloom.fpp</code> <br
/>
<span style="color:grey">Only applies if index type is BLOOM. <br /> Error
rate allowed given the number of entries. This is used to calculate how many
bits should be assigned for the bloom filter and the number of hash functions.
This is usually set very low (default: 0.000000001), we like to tradeoff disk
space for lower false positives</span></li>
- <li><a href="#bloomIndexPruneByRanges">bloomIndexPruneByRanges</a>
(true) <br />
+ <li><a href="#bloomIndexPruneByRanges">bloomIndexPruneByRanges</a>
(pruneRanges = true) <br />
+ Property: <code
class="highlighter-rouge">hoodie.bloom.index.prune.by.ranges</code> <br />
<span style="color:grey">Only applies if index type is BLOOM. <br /> When
true, range information from files to leveraged speed up index lookups.
Particularly helpful, if the key has a monotonously increasing prefix, such as
timestamp.</span></li>
- <li><a href="#bloomIndexUseCaching">bloomIndexUseCaching</a> (true)
<br />
+ <li><a href="#bloomIndexUseCaching">bloomIndexUseCaching</a> (useCaching
= true) <br />
+ Property: <code
class="highlighter-rouge">hoodie.bloom.index.use.caching</code> <br />
<span style="color:grey">Only applies if index type is BLOOM. <br /> When
true, the input RDD will cached to speed up index lookup by reducing IO for
computing parallelism or affected partitions</span></li>
- <li><a href="#bloomIndexParallelism">bloomIndexParallelism</a> (0)
<br />
+ <li><a href="#bloomIndexParallelism">bloomIndexParallelism</a> (0) <br />
+ Property: <code
class="highlighter-rouge">hoodie.bloom.index.parallelism</code> <br />
<span style="color:grey">Only applies if index type is BLOOM. <br /> This is
the amount of parallelism for index lookup, which involves a Spark Shuffle. By
default, this is auto computed based on input workload
characteristics</span></li>
- <li><a href="#hbaseZkQuorum">hbaseZkQuorum</a> (zkString) <br />
+ <li><a href="#hbaseZkQuorum">hbaseZkQuorum</a> (zkString) [Required]<br
/>
+ Property: <code class="highlighter-rouge">hoodie.index.hbase.zkquorum</code>
<br />
<span style="color:grey">Only application if index type is HBASE. HBase ZK
Quorum url to connect to.</span></li>
- <li><a href="#hbaseZkPort">hbaseZkPort</a> (port) <br />
+ <li><a href="#hbaseZkPort">hbaseZkPort</a> (port) [Required]<br />
+ Property: <code class="highlighter-rouge">hoodie.index.hbase.zkport</code>
<br />
<span style="color:grey">Only application if index type is HBASE. HBase ZK
Quorum port to connect to.</span></li>
- <li><a href="#hbaseTableName">hbaseTableName</a> (tableName) <br />
- <span style="color:grey">Only application if index type is HBASE. HBase
Table name to use as the index. Hoodie stores the row_key and [partition_path,
fileID, commitTime] mapping in the table.</span></li>
- </ul>
- </li>
- <li><a href="#withStorageConfig">withStorageConfig</a>
(HoodieStorageConfig) <br />
- <span style="color:grey">Storage related configs</span>
- <ul>
- <li><a href="#limitFileSize">limitFileSize</a> (size = 120MB) <br />
- <span style="color:grey">Hoodie re-writes a single file during update
(copy_on_write) or a compaction (merge_on_read). This is fundamental unit of
parallelism. It is important that this is aligned with the underlying
filesystem block size. </span></li>
- <li><a href="#parquetBlockSize">parquetBlockSize</a> (rowgroupsize =
120MB) <br />
- <span style="color:grey">Parquet RowGroup size. Its better than this is
aligned with the file size, so that a single column within a file is stored
continuously on disk</span></li>
- <li><a href="#parquetPageSize">parquetPageSize</a> (pagesize = 1MB)
<br />
+ <li><a href="#hbaseTableName">hbaseTableName</a> (tableName)
[Required]<br />
+ Property: <code class="highlighter-rouge">hoodie.index.hbase.table</code>
<br />
+ <span style="color:grey">Only application if index type is HBASE. HBase
Table name to use as the index. Hudi stores the row_key and [partition_path,
fileID, commitTime] mapping in the table.</span></li>
+ </ul>
+ </li>
+</ul>
+
+<h4 id="storage-configs">Storage configs</h4>
+<p>Controls aspects around sizing parquet and log files.</p>
+
+<ul>
+ <li><a href="#withStorageConfig">withStorageConfig</a> (HoodieStorageConfig)
<br />
+ <ul>
+ <li><a href="#limitFileSize">limitFileSize</a> (size = 120MB) <br />
+ Property: <code
class="highlighter-rouge">hoodie.parquet.max.file.size</code> <br />
+ <span style="color:grey">Target size for parquet files produced by Hudi
write phases. For DFS, this needs to be aligned with the underlying filesystem
block size for optimal performance. </span></li>
+ <li><a href="#parquetBlockSize">parquetBlockSize</a> (rowgroupsize =
120MB) <br />
+ Property: <code class="highlighter-rouge">hoodie.parquet.block.size</code>
<br />
+ <span style="color:grey">Parquet RowGroup size. Its better this is same as
the file size, so that a single column within a file is stored continuously on
disk</span></li>
+ <li><a href="#parquetPageSize">parquetPageSize</a> (pagesize = 1MB) <br
/>
+ Property: <code class="highlighter-rouge">hoodie.parquet.page.size</code>
<br />
<span style="color:grey">Parquet page size. Page is the unit of read within
a parquet file. Within a block, pages are compressed seperately. </span></li>
- <li><a href="#logFileMaxSize">logFileMaxSize</a> (logFileSize = 1GB)
<br />
+ <li><a href="#parquetCompressionRatio">parquetCompressionRatio</a>
(parquetCompressionRatio = 0.1) <br />
+ Property: <code
class="highlighter-rouge">hoodie.parquet.compression.ratio</code> <br />
+ <span style="color:grey">Expected compression of parquet data used by Hudi,
when it tries to size new parquet files. Increase this value, if bulk_insert is
producing smaller than expected sized files</span></li>
+ <li><a href="#logFileMaxSize">logFileMaxSize</a> (logFileSize = 1GB) <br
/>
+ Property: <code class="highlighter-rouge">hoodie.logfile.max.size</code> <br
/>
<span style="color:grey">LogFile max size. This is the maximum size allowed
for a log file before it is rolled over to the next version. </span></li>
- <li><a href="#logFileDataBlockMaxSize">logFileDataBlockMaxSize</a>
(dataBlockSize = 256MB) <br />
+ <li><a href="#logFileDataBlockMaxSize">logFileDataBlockMaxSize</a>
(dataBlockSize = 256MB) <br />
+ Property: <code
class="highlighter-rouge">hoodie.logfile.data.block.max.size</code> <br />
<span style="color:grey">LogFile Data block max size. This is the maximum
size allowed for a single data block to be appended to a log file. This helps
to make sure the data appended to the log file is broken up into sizable blocks
to prevent from OOM errors. This size should be greater than the JVM memory.
</span></li>
- </ul>
- </li>
- <li><a href="#withCompactionConfig">withCompactionConfig</a>
(HoodieCompactionConfig) <br />
- <span style="color:grey">Cleaning and configurations related to compaction
techniques</span>
- <ul>
- <li><a href="#withCleanerPolicy">withCleanerPolicy</a> (policy =
KEEP_LATEST_COMMITS) <br />
- <span style="color:grey">Hoodie Cleaning policy. Hoodie will delete older
versions of parquet files to re-claim space. Any Query/Computation referring to
this version of the file will fail. It is good to make sure that the data is
retained for more than the maximum query execution time.</span></li>
- <li><a href="#retainCommits">retainCommits</a>
(no_of_commits_to_retain = 24) <br />
+ <li><a
href="#logFileToParquetCompressionRatio">logFileToParquetCompressionRatio</a>
(logFileToParquetCompressionRatio = 0.35) <br />
+ Property: <code
class="highlighter-rouge">hoodie.logfile.to.parquet.compression.ratio</code>
<br />
+ <span style="color:grey">Expected additional compression as records move
from log files to parquet. Used for merge_on_read storage to send inserts into
log files & control the size of compacted parquet file.</span></li>
+ </ul>
+ </li>
+</ul>
+
+<h4 id="compaction-configs">Compaction configs</h4>
+<p>Configs that control compaction (merging of log files onto a new parquet
base file), cleaning (reclamation of older/unused file groups).</p>
+
+<ul>
+ <li><a href="#withCompactionConfig">withCompactionConfig</a>
(HoodieCompactionConfig) <br />
+ <ul>
+ <li><a href="#withCleanerPolicy">withCleanerPolicy</a> (policy =
KEEP_LATEST_COMMITS) <br />
+ Property: <code class="highlighter-rouge">hoodie.cleaner.policy</code> <br />
+ <span style="color:grey"> Cleaning policy to be used. Hudi will delete older
versions of parquet files to re-claim space. Any Query/Computation referring to
this version of the file will fail. It is good to make sure that the data is
retained for more than the maximum query execution time.</span></li>
+ <li><a href="#retainCommits">retainCommits</a> (no_of_commits_to_retain
= 24) <br />
+ Property: <code
class="highlighter-rouge">hoodie.cleaner.commits.retained</code> <br />
<span style="color:grey">Number of commits to retain. So data will be
retained for num_of_commits * time_between_commits (scheduled). This also
directly translates into how much you can incrementally pull on this
dataset</span></li>
- <li><a href="#archiveCommitsWith">archiveCommitsWith</a> (minCommits
= 96, maxCommits = 128) <br />
- <span style="color:grey">Each commit is a small file in the .hoodie
directory. Since HDFS is not designed to handle multiple small files, hoodie
archives older commits into a sequential log. A commit is published atomically
by a rename of the commit file.</span></li>
- <li><a href="#compactionSmallFileSize">compactionSmallFileSize</a>
(size = 0) <br />
- <span style="color:grey">Small files can always happen because of the number
of insert records in a paritition in a batch. Hoodie has an option to
auto-resolve small files by masking inserts into this partition as updates to
existing small files. The size here is the minimum file size considered as a
“small file size”. This should be less < maxFileSize and setting it to 0,
turns off this feature. </span></li>
- <li><a href="#insertSplitSize">insertSplitSize</a> (size = 500000)
<br />
+ <li><a href="#archiveCommitsWith">archiveCommitsWith</a> (minCommits =
96, maxCommits = 128) <br />
+ Property: <code class="highlighter-rouge">hoodie.keep.min.commits</code>,
<code class="highlighter-rouge">hoodie.keep.max.commits</code> <br />
+ <span style="color:grey">Each commit is a small file in the <code
class="highlighter-rouge">.hoodie</code> directory. Since DFS typically does
not favor lots of small files, Hudi archives older commits into a sequential
log. A commit is published atomically by a rename of the commit
file.</span></li>
+ <li><a href="#compactionSmallFileSize">compactionSmallFileSize</a> (size
= 0) <br />
+ Property: <code
class="highlighter-rouge">hoodie.parquet.small.file.limit</code> <br />
+ <span style="color:grey">This should be less < maxFileSize and setting it
to 0, turns off this feature. Small files can always happen because of the
number of insert records in a partition in a batch. Hudi has an option to
auto-resolve small files by masking inserts into this partition as updates to
existing small files. The size here is the minimum file size considered as a
“small file size”.</span></li>
+ <li><a href="#insertSplitSize">insertSplitSize</a> (size = 500000) <br />
+ Property: <code
class="highlighter-rouge">hoodie.copyonwrite.insert.split.size</code> <br />
<span style="color:grey">Insert Write Parallelism. Number of inserts grouped
for a single partition. Writing out 100MB files, with atleast 1kb records,
means 100K records per file. Default is to overprovision to 500K. To improve
insert latency, tune this to match the number of records in a single file.
Setting this to a low number, will result in small files (particularly when
compactionSmallFileSize is 0)</span></li>
- <li><a href="#autoTuneInsertSplits">autoTuneInsertSplits</a> (true)
<br />
- <span style="color:grey">Should hoodie dynamically compute the
insertSplitSize based on the last 24 commit’s metadata. Turned off by default.
</span></li>
- <li><a href="#approxRecordSize">approxRecordSize</a> () <br />
- <span style="color:grey">The average record size. If specified, hoodie will
use this and not compute dynamically based on the last 24 commit’s metadata. No
value set as default. This is critical in computing the insert parallelism and
bin-packing inserts into small files. See above.</span></li>
- <li><a
href="#withCompactionLazyBlockReadEnabled">withCompactionLazyBlockReadEnabled</a>
(true) <br />
- <span style="color:grey">When a CompactedLogScanner merges all log files,
this config helps to choose whether the logblocks should be read lazily or not.
Choose true to use I/O intensive lazy block reading (low memory usage) or false
for Memory intensive immediate block read (high memory usage)</span></li>
- <li><a
href="#withMaxNumDeltaCommitsBeforeCompaction">withMaxNumDeltaCommitsBeforeCompaction</a>
(maxNumDeltaCommitsBeforeCompaction = 10) <br />
+ <li><a href="#autoTuneInsertSplits">autoTuneInsertSplits</a> (true) <br
/>
+ Property: <code
class="highlighter-rouge">hoodie.copyonwrite.insert.auto.split</code> <br />
+ <span style="color:grey">Should hudi dynamically compute the insertSplitSize
based on the last 24 commit’s metadata. Turned off by default. </span></li>
+ <li><a href="#approxRecordSize">approxRecordSize</a> () <br />
+ Property: <code
class="highlighter-rouge">hoodie.copyonwrite.record.size.estimate</code> <br />
+ <span style="color:grey">The average record size. If specified, hudi will
use this and not compute dynamically based on the last 24 commit’s metadata. No
value set as default. This is critical in computing the insert parallelism and
bin-packing inserts into small files. See above.</span></li>
+ <li><a href="#withInlineCompaction">withInlineCompaction</a>
(inlineCompaction = false) <br />
+ Property: <code class="highlighter-rouge">hoodie.compact.inline</code> <br />
+ <span style="color:grey">When set to true, compaction is triggered by the
ingestion itself, right after a commit/deltacommit action as part of
insert/upsert/bulk_insert</span></li>
+ <li><a
href="#withMaxNumDeltaCommitsBeforeCompaction">withMaxNumDeltaCommitsBeforeCompaction</a>
(maxNumDeltaCommitsBeforeCompaction = 10) <br />
+ Property: <code
class="highlighter-rouge">hoodie.compact.inline.max.delta.commits</code> <br />
<span style="color:grey">Number of max delta commits to keep before
triggering an inline compaction</span></li>
- <li><a
href="#withCompactionReverseLogReadEnabled">withCompactionReverseLogReadEnabled</a>
(false) <br />
+ <li><a
href="#withCompactionLazyBlockReadEnabled">withCompactionLazyBlockReadEnabled</a>
(true) <br />
+ Property: <code
class="highlighter-rouge">hoodie.compaction.lazy.block.read</code> <br />
+ <span style="color:grey">When a CompactedLogScanner merges all log files,
this config helps to choose whether the logblocks should be read lazily or not.
Choose true to use I/O intensive lazy block reading (low memory usage) or false
for Memory intensive immediate block read (high memory usage)</span></li>
+ <li><a
href="#withCompactionReverseLogReadEnabled">withCompactionReverseLogReadEnabled</a>
(false) <br />
+ Property: <code
class="highlighter-rouge">hoodie.compaction.reverse.log.read</code> <br />
<span style="color:grey">HoodieLogFormatReader reads a logfile in the
forward direction starting from pos=0 to pos=file_length. If this config is set
to true, the Reader reads the logfile in reverse direction, from
pos=file_length to pos=0</span></li>
- </ul>
- </li>
- <li><a href="#withMetricsConfig">withMetricsConfig</a>
(HoodieMetricsConfig) <br />
- <span style="color:grey">Hoodie publishes metrics on every commit, clean,
rollback etc.</span>
- <ul>
- <li><a href="#on">on</a> (true) <br />
+ <li><a href="#withCleanerParallelism">withCleanerParallelism</a>
(cleanerParallelism = 200) <br />
+ Property: <code class="highlighter-rouge">hoodie.cleaner.parallelism</code>
<br />
+ <span style="color:grey">Increase this if cleaning becomes slow.</span></li>
+ <li><a href="#withCompactionStrategy">withCompactionStrategy</a>
(compactionStrategy =
com.uber.hoodie.io.compact.strategy.LogFileSizeBasedCompactionStrategy) <br />
+ Property: <code class="highlighter-rouge">hoodie.compaction.strategy</code>
<br />
+ <span style="color:grey">Compaction strategy decides which file groups are
picked up for compaction during each compaction run. By default. Hudi picks the
log file with most accumulated unmerged data</span></li>
+ <li><a
href="#withTargetIOPerCompactionInMB">withTargetIOPerCompactionInMB</a>
(targetIOPerCompactionInMB = 500000) <br />
+ Property: <code class="highlighter-rouge">hoodie.compaction.target.io</code>
<br />
+ <span style="color:grey">Amount of MBs to spend during compaction run for
the LogFileSizeBasedCompactionStrategy. This value helps bound ingestion
latency while compaction is run inline mode.</span></li>
+ <li><a
href="#withTargetPartitionsPerDayBasedCompaction">withTargetPartitionsPerDayBasedCompaction</a>
(targetPartitionsPerCompaction = 10) <br />
+ Property: <code
class="highlighter-rouge">hoodie.compaction.daybased.target</code> <br />
+ <span style="color:grey">Used by
com.uber.hoodie.io.compact.strategy.DayBasedCompactionStrategy to denote the
number of latest partitions to compact during a compaction run.</span></li>
+ <li><a href="#payloadClassName">withPayloadClass</a> (payloadClassName =
com.uber.hoodie.common.model.HoodieAvroPayload) <br />
+ Property: <code
class="highlighter-rouge">hoodie.compaction.payload.class</code> <br />
+ <span style="color:grey">This needs to be same as class used during
insert/upserts. Just like writing, compaction also uses the record payload
class to merge records in the log against each other, merge again with the base
file and produce the final record to be written after compaction.</span></li>
+ </ul>
+ </li>
+</ul>
+
+<h4 id="metrics-configs">Metrics configs</h4>
+<p>Enables reporting of Hudi metrics to graphite.</p>
+
+<ul>
+ <li><a href="#withMetricsConfig">withMetricsConfig</a> (HoodieMetricsConfig)
<br />
+<span style="color:grey">Hudi publishes metrics on every commit, clean,
rollback etc.</span>
+ <ul>
+ <li><a href="#on">on</a> (metricsOn = true) <br />
+ Property: <code class="highlighter-rouge">hoodie.metrics.on</code> <br />
<span style="color:grey">Turn sending metrics on/off. on by
default.</span></li>
- <li><a href="#withReporterType">withReporterType</a> (GRAPHITE) <br
/>
+ <li><a href="#withReporterType">withReporterType</a> (reporterType =
GRAPHITE) <br />
+ Property: <code
class="highlighter-rouge">hoodie.metrics.reporter.type</code> <br />
<span style="color:grey">Type of metrics reporter. Graphite is the default
and the only value suppported.</span></li>
- <li><a href="#toGraphiteHost">toGraphiteHost</a> () <br />
+ <li><a href="#toGraphiteHost">toGraphiteHost</a> (host = localhost) <br
/>
+ Property: <code
class="highlighter-rouge">hoodie.metrics.graphite.host</code> <br />
<span style="color:grey">Graphite host to connect to</span></li>
- <li><a href="#onGraphitePort">onGraphitePort</a> () <br />
+ <li><a href="#onGraphitePort">onGraphitePort</a> (port = 4756) <br />
+ Property: <code
class="highlighter-rouge">hoodie.metrics.graphite.port</code> <br />
<span style="color:grey">Graphite port to connect to</span></li>
- <li><a href="#usePrefix">usePrefix</a> () <br />
- <span style="color:grey">Standard prefix for all metrics</span></li>
- </ul>
- </li>
- <li><a href="#withMemoryConfig">withMemoryConfig</a>
(HoodieMemoryConfig) <br />
- <span style="color:grey">Memory related configs</span>
- <ul>
- <li><a
href="#withMaxMemoryFractionPerPartitionMerge">withMaxMemoryFractionPerPartitionMerge</a>
(maxMemoryFractionPerPartitionMerge = 0.6) <br />
- <span style="color:grey">This fraction is multiplied with the user memory
fraction (1 - spark.memory.fraction) to get a final fraction of heap space to
use during merge </span></li>
- <li><a
href="#withMaxMemorySizePerCompactionInBytes">withMaxMemorySizePerCompactionInBytes</a>
(maxMemorySizePerCompactionInBytes = 1GB) <br />
- <span style="color:grey">HoodieCompactedLogScanner reads logblocks, converts
records to HoodieRecords and then merges these log blocks and records. At any
point, the number of entries in a log block can be less than or equal to the
number of entries in the corresponding parquet file. This can lead to OOM in
the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use
this config to set the max allowable inMemory footprint of the spillable
map.</span></li>
- </ul>
- </li>
- <li>
- <p><a href="s3_hoodie.html">S3Configs</a> (Hoodie S3 Configs) <br />
- <span style="color:grey">Configurations required for S3 and Hoodie
co-operability.</span></p>
- </li>
- <li><a href="gcs_hoodie.html">GCSConfigs</a> (Hoodie GCS Configs) <br />
- <span style="color:grey">Configurations required for GCS and Hoodie
co-operability.</span></li>
+ <li><a href="#usePrefix">usePrefix</a> (prefix = “”) <br />
+ Property: <code
class="highlighter-rouge">hoodie.metrics.graphite.metric.prefix</code> <br />
+ <span style="color:grey">Standard prefix applied to all metrics. This helps
to add datacenter, environment information for e.g</span></li>
</ul>
</li>
- <li><a href="#datasource">Hoodie Datasource</a> <br />
-<span style="color:grey">Configs for datasource</span>
+</ul>
+
+<h4 id="memory-configs">Memory configs</h4>
+<p>Controls memory usage for compaction and merges, performed internally by
Hudi</p>
+
+<ul>
+ <li><a href="#withMemoryConfig">withMemoryConfig</a> (HoodieMemoryConfig)
<br />
+<span style="color:grey">Memory related configs</span>
<ul>
- <li><a href="#writeoptions">write options</a> (write.format.option(…))
<br />
- <span style="color:grey"> Options useful for writing datasets </span>
- <ul>
- <li><a href="#OPERATION_OPT_KEY">OPERATION_OPT_KEY</a> (Default:
upsert) <br />
- <span style="color:grey">whether to do upsert, insert or bulkinsert for the
write operation</span></li>
- <li><a href="#STORAGE_TYPE_OPT_KEY">STORAGE_TYPE_OPT_KEY</a>
(Default: COPY_ON_WRITE) <br />
- <span style="color:grey">The storage type for the underlying data, for this
write. This can’t change between writes.</span></li>
- <li><a href="#TABLE_NAME_OPT_KEY">TABLE_NAME_OPT_KEY</a> (Default:
None (mandatory)) <br />
- <span style="color:grey">Hive table name, to register the dataset
into.</span></li>
- <li><a href="#PRECOMBINE_FIELD_OPT_KEY">PRECOMBINE_FIELD_OPT_KEY</a>
(Default: ts) <br />
- <span style="color:grey">Field used in preCombining before actual write.
When two records have the same key value,
- we will pick the one with the largest value for the precombine field,
determined by Object.compareTo(..)</span></li>
- <li><a href="#PAYLOAD_CLASS_OPT_KEY">PAYLOAD_CLASS_OPT_KEY</a>
(Default: com.uber.hoodie.OverwriteWithLatestAvroPayload) <br />
- <span style="color:grey">Payload class used. Override this, if you like to
roll your own merge logic, when upserting/inserting.
- This will render any value set for <code
class="highlighter-rouge">PRECOMBINE_FIELD_OPT_VAL</code>
in-effective</span></li>
- <li><a href="#RECORDKEY_FIELD_OPT_KEY">RECORDKEY_FIELD_OPT_KEY</a>
(Default: uuid) <br />
- <span style="color:grey">Record key field. Value to be used as the <code
class="highlighter-rouge">recordKey</code> component of <code
class="highlighter-rouge">HoodieKey</code>. Actual value
- will be obtained by invoking .toString() on the field value. Nested fields
can be specified using
- the dot notation eg: <code class="highlighter-rouge">a.b.c</code></span></li>
- <li><a
href="#PARTITIONPATH_FIELD_OPT_KEY">PARTITIONPATH_FIELD_OPT_KEY</a> (Default:
partitionpath) <br />
- <span style="color:grey">Partition path field. Value to be used at the <code
class="highlighter-rouge">partitionPath</code> component of <code
class="highlighter-rouge">HoodieKey</code>.
- Actual value ontained by invoking .toString()</span></li>
- <li><a
href="#KEYGENERATOR_CLASS_OPT_KEY">KEYGENERATOR_CLASS_OPT_KEY</a> (Default:
com.uber.hoodie.SimpleKeyGenerator) <br />
- <span style="color:grey">Key generator class, that implements will extract
the key out of incoming <code class="highlighter-rouge">Row</code>
object</span></li>
- <li><a
href="#COMMIT_METADATA_KEYPREFIX_OPT_KEY">COMMIT_METADATA_KEYPREFIX_OPT_KEY</a>
(Default: <code class="highlighter-rouge">_</code>) <br />
- <span style="color:grey">Option keys beginning with this prefix, are
automatically added to the commit/deltacommit metadata.
- This is useful to store checkpointing information, in a consistent way with
the hoodie timeline</span></li>
- </ul>
- </li>
- <li><a href="#readoptions">read options</a> (read.format.option(…)) <br
/>
- <span style="color:grey">Options useful for reading datasets</span>
- <ul>
- <li><a href="#VIEW_TYPE_OPT_KEY">VIEW_TYPE_OPT_KEY</a> (Default: =
read_optimized) <br />
- <span style="color:grey">Whether data needs to be read, in incremental mode
(new data since an instantTime)
- (or) Read Optimized mode (obtain latest view, based on columnar data)
- (or) Real time mode (obtain latest view, based on row & columnar
data)</span></li>
- <li><a
href="#BEGIN_INSTANTTIME_OPT_KEY">BEGIN_INSTANTTIME_OPT_KEY</a> (Default: None
(Mandatory in incremental mode)) <br />
- <span style="color:grey">Instant time to start incrementally pulling data
from. The instanttime here need not
- necessarily correspond to an instant on the timeline. New data written with
an
- <code class="highlighter-rouge">instant_time > BEGIN_INSTANTTIME</code>
are fetched out. For e.g: ‘20170901080000’ will get
- all new data written after Sep 1, 2017 08:00AM.</span></li>
- <li><a href="#END_INSTANTTIME_OPT_KEY">END_INSTANTTIME_OPT_KEY</a>
(Default: latest instant (i.e fetches all new data since begin instant time))
<br />
- <span style="color:grey"> Instant time to limit incrementally fetched data
to. New data written with an
- <code class="highlighter-rouge">instant_time <= END_INSTANTTIME</code>
are fetched out.</span></li>
- </ul>
- </li>
+ <li><a
href="#withMaxMemoryFractionPerPartitionMerge">withMaxMemoryFractionPerPartitionMerge</a>
(maxMemoryFractionPerPartitionMerge = 0.6) <br />
+ Property: <code
class="highlighter-rouge">hoodie.memory.merge.fraction</code> <br />
+ <span style="color:grey">This fraction is multiplied with the user memory
fraction (1 - spark.memory.fraction) to get a final fraction of heap space to
use during merge </span></li>
+ <li><a
href="#withMaxMemorySizePerCompactionInBytes">withMaxMemorySizePerCompactionInBytes</a>
(maxMemorySizePerCompactionInBytes = 1GB) <br />
+ Property: <code
class="highlighter-rouge">hoodie.memory.compaction.fraction</code> <br />
+ <span style="color:grey">HoodieCompactedLogScanner reads logblocks, converts
records to HoodieRecords and then merges these log blocks and records. At any
point, the number of entries in a log block can be less than or equal to the
number of entries in the corresponding parquet file. This can lead to OOM in
the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use
this config to set the max allowable inMemory footprint of the spillable
map.</span></li>
</ul>
</li>
</ul>
@@ -519,14 +709,11 @@
<p>Writing data via Hudi happens as a Spark job and thus general rules of
spark debugging applies here too. Below is a list of things to keep in mind, if
you are looking to improving performance or reliability.</p>
-<p><strong>Write operations</strong> : Use <code
class="highlighter-rouge">bulkinsert</code> to load new data into a table, and
there on use <code class="highlighter-rouge">upsert</code>/<code
class="highlighter-rouge">insert</code>.
- Difference between them is that bulk insert uses a disk based write path to
scale to load large inputs without need to cache it.</p>
-
-<p><strong>Input Parallelism</strong> : By default, Hoodie tends to
over-partition input (i.e <code
class="highlighter-rouge">withParallelism(1500)</code>), to ensure each Spark
partition stays within the 2GB limit for inputs upto 500GB. Bump this up
accordingly if you have larger inputs. We recommend having shuffle parallelism
<code
class="highlighter-rouge">hoodie.[insert|upsert|bulkinsert].shuffle.parallelism</code>
such that its atleast input_data_size/500MB</p>
+<p><strong>Input Parallelism</strong> : By default, Hudi tends to
over-partition input (i.e <code
class="highlighter-rouge">withParallelism(1500)</code>), to ensure each Spark
partition stays within the 2GB limit for inputs upto 500GB. Bump this up
accordingly if you have larger inputs. We recommend having shuffle parallelism
<code
class="highlighter-rouge">hoodie.[insert|upsert|bulkinsert].shuffle.parallelism</code>
such that its atleast input_data_size/500MB</p>
-<p><strong>Off-heap memory</strong> : Hoodie writes parquet files and that
needs good amount of off-heap memory proportional to schema width. Consider
setting something like <code
class="highlighter-rouge">spark.yarn.executor.memoryOverhead</code> or <code
class="highlighter-rouge">spark.yarn.driver.memoryOverhead</code>, if you are
running into such failures.</p>
+<p><strong>Off-heap memory</strong> : Hudi writes parquet files and that needs
good amount of off-heap memory proportional to schema width. Consider setting
something like <code
class="highlighter-rouge">spark.yarn.executor.memoryOverhead</code> or <code
class="highlighter-rouge">spark.yarn.driver.memoryOverhead</code>, if you are
running into such failures.</p>
-<p><strong>Spark Memory</strong> : Typically, hoodie needs to be able to read
a single file into memory to perform merges or compactions and thus the
executor memory should be sufficient to accomodate this. In addition, Hoodie
caches the input to be able to intelligently place data and thus leaving some
<code class="highlighter-rouge">spark.storage.memoryFraction</code> will
generally help boost performance.</p>
+<p><strong>Spark Memory</strong> : Typically, hudi needs to be able to read a
single file into memory to perform merges or compactions and thus the executor
memory should be sufficient to accomodate this. In addition, Hoodie caches the
input to be able to intelligently place data and thus leaving some <code
class="highlighter-rouge">spark.storage.memoryFraction</code> will generally
help boost performance.</p>
<p><strong>Sizing files</strong> : Set <code
class="highlighter-rouge">limitFileSize</code> above judiciously, to balance
ingest/write latency vs number of files & consequently metadata overhead
associated with it.</p>
diff --git a/content/contributing.html b/content/contributing.html
index 1901952..9c9e61d 100644
--- a/content/contributing.html
+++ b/content/contributing.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
-<meta name="keywords" content=" developer setup">
+<meta name="keywords" content="hudi, ide, developer, setup">
<title>Developer Setup | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -380,6 +384,8 @@ have an open source license <a
href="https://www.apache.org/legal/resolved.html#
<li>Add adequate tests for your new functionality</li>
<li>[Optional] For involved changes, its best to also run the entire
integration test suite using <code class="highlighter-rouge">mvn clean
install</code></li>
<li>For website changes, please build the site locally & test
navigation, formatting & links thoroughly</li>
+ <li>If your code change changes some aspect of documentation (e.g new
config, default value change),
+please ensure there is a another PR to <a
href="https://github.com/apache/incubator-hudi/blob/asf-site/docs/README.md">update
the docs</a> as well.</li>
</ul>
</li>
<li>Format commit messages and the pull request title like <code
class="highlighter-rouge">[HUDI-XXX] Fixes bug in Spark Datasource</code>,
diff --git a/content/css/customstyles.css b/content/css/customstyles.css
index d6667a5..56dcdba 100644
--- a/content/css/customstyles.css
+++ b/content/css/customstyles.css
@@ -1,5 +1,5 @@
body {
- font-size:15px;
+ font-size:14px;
}
.bs-callout {
@@ -607,7 +607,7 @@ a.fa.fa-envelope-o.mailto {
font-weight: 600;
}
-h3 {color: #ED1951; font-weight:normal; font-size:130%;}
+h3 {color: #545253; font-weight:normal; font-size:130%;}
h4 {color: #808080; font-weight:normal; font-size:120%; font-style:italic;}
.alert, .callout {
diff --git a/content/css/theme-blue.css b/content/css/theme-blue.css
index 9a923ef..46fbd0d 100644
--- a/content/css/theme-blue.css
+++ b/content/css/theme-blue.css
@@ -5,7 +5,7 @@
}
-h3 {color: #ED1951; }
+h3 {color: #545253; }
h4 {color: #808080; }
.nav-tabs > li.active > a, .nav-tabs > li.active > a:hover, .nav-tabs >
li.active > a:focus {
diff --git a/content/feed.xml b/content/feed.xml
index b21704e..cd76d50 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -5,8 +5,8 @@
<description>Apache Hudi (pronounced “Hoodie”) provides upserts and
incremental processing capaibilities on Big Data</description>
<link>http://0.0.0.0:4000/</link>
<atom:link href="http://0.0.0.0:4000/feed.xml" rel="self"
type="application/rss+xml"/>
- <pubDate>Mon, 25 Feb 2019 20:49:33 +0000</pubDate>
- <lastBuildDate>Mon, 25 Feb 2019 20:49:33 +0000</lastBuildDate>
+ <pubDate>Sat, 09 Mar 2019 21:08:53 +0000</pubDate>
+ <lastBuildDate>Sat, 09 Mar 2019 21:08:53 +0000</lastBuildDate>
<generator>Jekyll v3.3.1</generator>
<item>
@@ -25,7 +25,7 @@
<item>
<title>Connect with us at Strata San Jose March 2017</title>
- <description><p>We will be presenting Hoodie &amp;
general concepts around how incremental processing works at Uber.
+ <description><p>We will be presenting Hudi &amp; general
concepts around how incremental processing works at Uber.
Catch our talk <strong>“Incremental Processing on Hadoop At
Uber”</strong></p>
</description>
diff --git a/content/gcs_hoodie.html b/content/gcs_hoodie.html
index f90992d..cb96011 100644
--- a/content/gcs_hoodie.html
+++ b/content/gcs_hoodie.html
@@ -4,8 +4,8 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="In this page, we go over how to configure
hudi with Google Cloud Storage.">
-<meta name="keywords" content=" sql hive gcs spark presto">
-<title>GCS Filesystem (experimental) | Hudi</title>
+<meta name="keywords" content="hudi, hive, google cloud, storage, spark,
presto">
+<title>GCS Filesystem | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -158,7 +162,7 @@
- <a class="email" title="Submit feedback" href="#"
onclick="javascript:window.location='mailto:[email protected]?subject=Hudi
Documentation feedback&body=I have some feedback about the GCS Filesystem
(experimental) page: ' + window.location.href;"><i class="fa
fa-envelope-o"></i> Feedback</a>
+ <a class="email" title="Submit feedback" href="#"
onclick="javascript:window.location='mailto:[email protected]?subject=Hudi
Documentation feedback&body=I have some feedback about the GCS Filesystem page:
' + window.location.href;"><i class="fa fa-envelope-o"></i> Feedback</a>
<li>
@@ -176,7 +180,7 @@
searchInput:
document.getElementById('search-input'),
resultsContainer:
document.getElementById('results-container'),
dataSource: 'search.json',
- searchResultTemplate: '<li><a href="{url}"
title="GCS Filesystem (experimental)">{title}</a></li>',
+ searchResultTemplate: '<li><a href="{url}"
title="GCS Filesystem">{title}</a></li>',
noResultsText: 'No results found.',
limit: 10,
fuzzy: true,
@@ -327,7 +331,7 @@
<!-- Content Column -->
<div class="col-md-9">
<div class="post-header">
- <h1 class="post-title-main">GCS Filesystem (experimental)</h1>
+ <h1 class="post-title-main">GCS Filesystem</h1>
</div>
@@ -343,7 +347,7 @@
- <p>Hudi works with HDFS by default and GCS <strong>regional</strong> buckets
provide an HDFS API with strong consistency.</p>
+ <p>For Hudi storage on GCS, <strong>regional</strong> buckets provide an DFS
API with strong consistency.</p>
<h2 id="gcs-configs">GCS Configs</h2>
diff --git a/content/images/hoodie_commit_duration.png
b/content/images/hudi_commit_duration.png
similarity index 100%
rename from content/images/hoodie_commit_duration.png
rename to content/images/hudi_commit_duration.png
diff --git a/content/images/hoodie_intro_1.png b/content/images/hudi_intro_1.png
similarity index 100%
rename from content/images/hoodie_intro_1.png
rename to content/images/hudi_intro_1.png
diff --git a/content/images/hoodie_log_format_v2.png
b/content/images/hudi_log_format_v2.png
similarity index 100%
rename from content/images/hoodie_log_format_v2.png
rename to content/images/hudi_log_format_v2.png
diff --git a/content/images/hoodie_query_perf_hive.png
b/content/images/hudi_query_perf_hive.png
similarity index 100%
rename from content/images/hoodie_query_perf_hive.png
rename to content/images/hudi_query_perf_hive.png
diff --git a/content/images/hoodie_query_perf_presto.png
b/content/images/hudi_query_perf_presto.png
similarity index 100%
rename from content/images/hoodie_query_perf_presto.png
rename to content/images/hudi_query_perf_presto.png
diff --git a/content/images/hoodie_query_perf_spark.png
b/content/images/hudi_query_perf_spark.png
similarity index 100%
rename from content/images/hoodie_query_perf_spark.png
rename to content/images/hudi_query_perf_spark.png
diff --git a/content/images/hoodie_upsert_dag.png
b/content/images/hudi_upsert_dag.png
similarity index 100%
rename from content/images/hoodie_upsert_dag.png
rename to content/images/hudi_upsert_dag.png
diff --git a/content/images/hoodie_upsert_perf1.png
b/content/images/hudi_upsert_perf1.png
similarity index 100%
rename from content/images/hoodie_upsert_perf1.png
rename to content/images/hudi_upsert_perf1.png
diff --git a/content/images/hoodie_upsert_perf2.png
b/content/images/hudi_upsert_perf2.png
similarity index 100%
rename from content/images/hoodie_upsert_perf2.png
rename to content/images/hudi_upsert_perf2.png
diff --git a/content/implementation.html b/content/implementation.html
index d649a70..e524ec6 100644
--- a/content/implementation.html
+++ b/content/implementation.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
-<meta name="keywords" content=" implementation">
+<meta name="keywords" content="hudi, index, storage, compaction, cleaning,
implementation">
<title>Implementation | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -347,7 +351,7 @@ Hudi upsert/insert is merely a Spark DAG, that can be
broken into two big pieces
<ul>
<li>
- <p><strong>Indexing</strong> : A big part of Hoodie’s efficiency comes
from indexing the mapping from record keys to the file ids, to which they
belong to.
+ <p><strong>Indexing</strong> : A big part of Hudi’s efficiency comes from
indexing the mapping from record keys to the file ids, to which they belong to.
This index also helps the <code
class="highlighter-rouge">HoodieWriteClient</code> separate upserted records
into inserts and updates, so they can be treated differently.
<code class="highlighter-rouge">HoodieReadClient</code> supports operations
such as <code class="highlighter-rouge">filterExists</code> (used for
de-duplication of table) and an efficient batch <code
class="highlighter-rouge">read(keys)</code> api, that
can read out the records corresponding to the keys using the index much
quickly, than a typical scan via a query. The index is also atomically
@@ -406,7 +410,7 @@ Any remaining records after that, are again packed into new
file id groups, agai
<p>In the case of Copy-On-Write, a single parquet file constitutes one <code
class="highlighter-rouge">file slice</code> which contains one complete version
of
the file</p>
-<figure><img class="docimage" src="images/hoodie_log_format_v2.png"
alt="hoodie_log_format_v2.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_log_format_v2.png"
alt="hudi_log_format_v2.png" style="max-width: 1000px" /></figure>
<h4 id="merge-on-read">Merge On Read</h4>
@@ -575,7 +579,7 @@ incremental ingestion (writer at DC6) happened before the
compaction (some time
The below description is with regards to compaction from file-group
perspective.
<ul>
<li><code class="highlighter-rouge">Reader querying at time between
ingestion completion time for DC6 and compaction finish “Tc”</code>:
-Hoodie’s implementation will be changed to become aware of file-groups
currently waiting for compaction and
+Hudi’s implementation will be changed to become aware of file-groups currently
waiting for compaction and
merge log-files corresponding to DC2-DC6 with the base-file corresponding to
SC1. In essence, Hudi will create
a pseudo file-slice by combining the 2 file-slices starting at base-commits
SC1 and SC5 to one.
For file-groups not waiting for compaction, the reader behavior is essentially
the same - read latest file-slice
@@ -602,12 +606,12 @@ the conventional alternatives for achieving these
tasks.</p>
<p>Following shows the speed up obtained for NoSQL ingestion, by switching
from bulk loads off HBase to Parquet to incrementally upserting
on a Hudi dataset, on 5 tables ranging from small to huge.</p>
-<figure><img class="docimage" src="images/hoodie_upsert_perf1.png"
alt="hoodie_upsert_perf1.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_upsert_perf1.png"
alt="hudi_upsert_perf1.png" style="max-width: 1000px" /></figure>
<p>Given Hudi can build the dataset incrementally, it opens doors for also
scheduling ingesting more frequently thus reducing latency, with
significant savings on the overall compute cost.</p>
-<figure><img class="docimage" src="images/hoodie_upsert_perf2.png"
alt="hoodie_upsert_perf2.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_upsert_perf2.png"
alt="hudi_upsert_perf2.png" style="max-width: 1000px" /></figure>
<p>Hudi upserts have been stress tested upto 4TB in a single commit across the
t1 table.</p>
@@ -618,15 +622,15 @@ with no impact on queries. Following charts compare the
Hudi vs non-Hudi dataset
<p><strong>Hive</strong></p>
-<figure><img class="docimage" src="images/hoodie_query_perf_hive.png"
alt="hoodie_query_perf_hive.png" style="max-width: 800px" /></figure>
+<figure><img class="docimage" src="images/hudi_query_perf_hive.png"
alt="hudi_query_perf_hive.png" style="max-width: 800px" /></figure>
<p><strong>Spark</strong></p>
-<figure><img class="docimage" src="images/hoodie_query_perf_spark.png"
alt="hoodie_query_perf_spark.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_query_perf_spark.png"
alt="hudi_query_perf_spark.png" style="max-width: 1000px" /></figure>
<p><strong>Presto</strong></p>
-<figure><img class="docimage" src="images/hoodie_query_perf_presto.png"
alt="hoodie_query_perf_presto.png" style="max-width: 1000px" /></figure>
+<figure><img class="docimage" src="images/hudi_query_perf_presto.png"
alt="hudi_query_perf_presto.png" style="max-width: 1000px" /></figure>
diff --git a/content/incremental_processing.html
b/content/incremental_processing.html
index a694881..c487368 100644
--- a/content/incremental_processing.html
+++ b/content/incremental_processing.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="In this page, we will discuss some available
tools for ingesting data incrementally & consuming the changes.">
-<meta name="keywords" content=" incremental processing">
+<meta name="keywords" content="hudi, incremental, batch, stream, processing,
Hive, ETL, Spark SQL">
<title>Incremental Processing | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -349,7 +353,7 @@ discusses a few tools that can be used to achieve these on
different contexts.</
<h2 id="incremental-ingestion">Incremental Ingestion</h2>
-<p>Following means can be used to apply a delta or an incremental change to a
Hudi dataset. For e.g, the incremental changes could be from a Kafka topic or
files uploaded to HDFS or
+<p>Following means can be used to apply a delta or an incremental change to a
Hudi dataset. For e.g, the incremental changes could be from a Kafka topic or
files uploaded to DFS or
even changes pulled from another Hudi dataset.</p>
<h4 id="deltastreamer-tool">DeltaStreamer Tool</h4>
@@ -360,9 +364,10 @@ from different sources such as DFS or Kafka.</p>
<p>The tool is a spark job (part of hoodie-utilities), that provides the
following functionality</p>
<ul>
- <li>Ability to consume new events from Kafka, incremental imports from Sqoop
or output of <code class="highlighter-rouge">HiveIncrementalPuller</code> or
files under a folder on HDFS</li>
+ <li>Ability to consume new events from Kafka, incremental imports from Sqoop
or output of <code class="highlighter-rouge">HiveIncrementalPuller</code> or
files under a folder on DFS</li>
<li>Support json, avro or a custom payload types for the incoming data</li>
- <li>New data is written to a Hudi dataset, with support for checkpointing
& schemas and registered onto Hive</li>
+ <li>Pick up avro schemas from DFS or Confluent <a
href="https://github.com/confluentinc/schema-registry">schema registry</a>.</li>
+ <li>New data is written to a Hudi dataset, with support for checkpointing
and registered onto Hive</li>
</ul>
<p>Command line options describe capabilities in more detail (first build
hoodie-utilities using <code class="highlighter-rouge">mvn clean
package</code>).</p>
@@ -423,10 +428,10 @@ Usage: <main class> [options]
* --target-table
name of the target table in Hive
--transformer-class
- subclass of com.uber.hoodie.utilities.transform.Transformer. UDF to
- transform raw source dataset to a target dataset (conforming to target
- schema) before writing. Default : Not set. E:g -
- com.uber.hoodie.utilities.transform.SqlQueryBasedTransformer (which
+ subclass of com.uber.hoodie.utilities.transform.Transformer. UDF to
+ transform raw source dataset to a target dataset (conforming to target
+ schema) before writing. Default : Not set. E:g -
+ com.uber.hoodie.utilities.transform.SqlQueryBasedTransformer (which
allows a SQL query template to be passed as a transformation function)
</code></pre>
@@ -453,7 +458,7 @@ provided under <code
class="highlighter-rouge">hoodie-utilities/src/test/resourc
</code></pre>
</div>
-<p>In some cases, you may want to convert your existing dataset into Hoodie,
before you can begin ingesting new data. This can be accomplished using the
<code class="highlighter-rouge">hdfsparquetimport</code> command on the <code
class="highlighter-rouge">hoodie-cli</code>.
+<p>In some cases, you may want to convert your existing dataset into Hudi,
before you can begin ingesting new data. This can be accomplished using the
<code class="highlighter-rouge">hdfsparquetimport</code> command on the <code
class="highlighter-rouge">hoodie-cli</code>.
Currently, there is support for converting parquet datasets.</p>
<h4 id="via-custom-spark-job">Via Custom Spark Job</h4>
@@ -503,8 +508,6 @@ Usage: <main class> [options]
</code></pre>
</div>
-<div class="bs-callout bs-callout-info">Note that for now, due to jar
mismatches between Spark & Hive, its recommended to run this as a separate
Java task in your workflow manager/cron. This is getting fix <a
href="https://github.com/uber/hoodie/issues/123">here</a></div>
-
<h2 id="incrementally-pulling">Incrementally Pulling</h2>
<p>Hudi datasets can be pulled incrementally, which means you can get ALL and
ONLY the updated & new rows since a specified commit timestamp.
@@ -530,7 +533,7 @@ This class can be used within existing Spark jobs and
offers the following funct
<p>Please refer to <a href="configurations.html">configurations</a> section,
to view all datasource options.</p>
-<p>Additionally, <code class="highlighter-rouge">HoodieReadClient</code>
offers the following functionality using Hoodie’s implicit indexing.</p>
+<p>Additionally, <code class="highlighter-rouge">HoodieReadClient</code>
offers the following functionality using Hudi’s implicit indexing.</p>
<table>
<tbody>
@@ -540,7 +543,7 @@ This class can be used within existing Spark jobs and
offers the following funct
</tr>
<tr>
<td>read(keys)</td>
- <td>Read out the data corresponding to the keys as a DataFrame, using
Hoodie’s own index for faster lookup</td>
+ <td>Read out the data corresponding to the keys as a DataFrame, using
Hudi’s own index for faster lookup</td>
</tr>
<tr>
<td>filterExists()</td>
@@ -590,7 +593,7 @@ e.g: <code
class="highlighter-rouge">/app/incremental-hql/intermediate/{source_t
</tr>
<tr>
<td>tmp</td>
- <td>Directory where the temporary delta data is stored in HDFS. The
directory structure will follow conventions. Please see the below section.</td>
+ <td>Directory where the temporary delta data is stored in DFS. The
directory structure will follow conventions. Please see the below section.</td>
<td> </td>
</tr>
<tr>
@@ -610,12 +613,12 @@ e.g: <code
class="highlighter-rouge">/app/incremental-hql/intermediate/{source_t
</tr>
<tr>
<td>sourceDataPath</td>
- <td>Source HDFS Base Path. This is where the Hudi metadata will be
read.</td>
+ <td>Source DFS Base Path. This is where the Hudi metadata will be
read.</td>
<td> </td>
</tr>
<tr>
<td>targetDataPath</td>
- <td>Target HDFS Base path. This is needed to compute the fromCommitTime.
This is not needed if fromCommitTime is specified explicitly.</td>
+ <td>Target DFS Base path. This is needed to compute the fromCommitTime.
This is not needed if fromCommitTime is specified explicitly.</td>
<td> </td>
</tr>
<tr>
@@ -647,7 +650,6 @@ it will automatically use the backfill configuration, since
applying the last 24
is the lack of support for self-joining the same table in mixed mode (normal
and incremental modes).</p>
-
<div class="tags">
</div>
diff --git a/content/index.html b/content/index.html
index bd31b4d..1a1c5ff 100644
--- a/content/index.html
+++ b/content/index.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Hudi brings stream processing to big data,
providing fresh data while being an order of magnitude efficient over
traditional batch processing.">
-<meta name="keywords" content="getting_started, homepage">
+<meta name="keywords" content="big data, stream processing, cloud, hdfs,
storage, upserts, change capture">
<title>What is Hudi? | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -366,7 +370,7 @@ $('#toc').on('click', 'a', function() {
- <p>Hudi (pronounced “Hoodie”) ingests & manages storage of large
analytical datasets on <a
href="http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">HDFS</a>
or cloud stores and provides three logical views for query access.</p>
+ <p>Hudi (pronounced “Hoodie”) ingests & manages storage of large
analytical datasets over DFS (<a
href="http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">HDFS</a>
or cloud stores) and provides three logical views for query access.</p>
<ul>
<li><strong>Read Optimized View</strong> - Provides excellent query
performance on pure columnar storage, much like plain <a
href="https://parquet.apache.org/">Parquet</a> tables.</li>
@@ -374,7 +378,7 @@ $('#toc').on('click', 'a', function() {
<li><strong>Near-Real time Table</strong> - Provides queries on real-time
data, using a combination of columnar & row based storage (e.g Parquet + <a
href="http://avro.apache.org/docs/current/mr.html">Avro</a>)</li>
</ul>
-<figure><img class="docimage" src="images/hoodie_intro_1.png"
alt="hoodie_intro_1.png" /></figure>
+<figure><img class="docimage" src="images/hudi_intro_1.png"
alt="hudi_intro_1.png" /></figure>
<p>By carefully managing how data is laid out in storage & how it’s
exposed to queries, Hudi is able to power a rich data ecosystem where external
sources can be ingested in near real-time and made available for interactive
SQL Engines like <a href="https://prestodb.io">Presto</a> & <a
href="https://spark.apache.org/sql/">Spark</a>, while at the same time capable
of being consumed incrementally from processing/ETL frameworks like <a
href="https://hive.apache.org/">Hive</a> & [...]
diff --git a/content/js/mydoc_scroll.html b/content/js/mydoc_scroll.html
index b23a6ad..ee70719 100644
--- a/content/js/mydoc_scroll.html
+++ b/content/js/mydoc_scroll.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="This page demonstrates how you the
integration of a script called ScrollTo, which is used here to link definitions
of a JSON code sample to a list of definit...">
-<meta name="keywords" content="special_layouts, json, scrolling, scrollto,
jquery plugin">
+<meta name="keywords" content="json, scrolling, scrollto, jquery plugin">
<title>Scroll layout | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
diff --git a/content/migration_guide.html b/content/migration_guide.html
index 7bcfa1d..03ea8a1 100644
--- a/content/migration_guide.html
+++ b/content/migration_guide.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="In this page, we will discuss some available
tools for migrating your existing dataset into a Hudi dataset">
-<meta name="keywords" content=" migration guide">
+<meta name="keywords" content="hudi, migration, use case">
<title>Migration Guide | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -362,7 +366,7 @@ Take this approach if your dataset is an append only type
of dataset and you do
<p>Import your existing dataset into a Hudi managed dataset. Since all the
data is Hudi managed, none of the limitations
of Approach 1 apply here. Updates spanning any partitions can be applied to
this dataset and Hudi will efficiently
- make the update available to queries. Note that not only do you get to use
all Hoodie primitives on this dataset,
+ make the update available to queries. Note that not only do you get to use
all Hudi primitives on this dataset,
there are other additional advantages of doing this. Hudi automatically
manages file sizes of a Hudi managed dataset
. You can define the desired file size when converting this dataset and Hudi
will ensure it writes out files
adhering to the config. It will also ensure that smaller files later get
corrected by routing some new inserts into
@@ -371,9 +375,8 @@ Take this approach if your dataset is an append only type
of dataset and you do
<p>There are a few options when choosing this approach.</p>
<h4 id="option-1">Option 1</h4>
-<p>Use the HDFSParquetImporter tool. As the name suggests, this only works if
your existing dataset is in
-parquet file
-format. This tool essentially starts a Spark Job to read the existing parquet
dataset and converts it into a HUDI managed dataset by re-writing all the
data.</p>
+<p>Use the HDFSParquetImporter tool. As the name suggests, this only works if
your existing dataset is in parquet file format.
+This tool essentially starts a Spark Job to read the existing parquet dataset
and converts it into a HUDI managed dataset by re-writing all the data.</p>
<h4 id="option-2">Option 2</h4>
<p>For huge datasets, this could be as simple as : for partition in [list of
partitions in source dataset] {
@@ -385,7 +388,7 @@ format. This tool essentially starts a Spark Job to read
the existing parquet da
<p>Write your own custom logic of how to load an existing dataset into a Hudi
managed one. Please read about the RDD API
<a href="quickstart.html">here</a>.</p>
-<div class="highlighter-rouge"><pre class="highlight"><code>Using the
HDFSParquetImporter Tool. Once hoodie has been built via `mvn clean install
-DskipTests`, the shell can be
+<div class="highlighter-rouge"><pre class="highlight"><code>Using the
HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install
-DskipTests`, the shell can be
fired by via `cd hoodie-cli && ./hoodie-cli.sh`.
hoodie->hdfsparquetimport
diff --git a/content/news.html b/content/news.html
index 645bae0..43d92a3 100644
--- a/content/news.html
+++ b/content/news.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
-<meta name="keywords" content=" news, blog, updates, release notes,
announcements">
+<meta name="keywords" content="apache, hudi, news, blog, updates, release
notes, announcements">
<title>News | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -266,7 +270,7 @@
<a href="tag_news.html">news</a>
</span>
- <p> We will be presenting Hoodie & general concepts around how
incremental processing works at Uber.
+ <p> We will be presenting Hudi & general concepts around how
incremental processing works at Uber.
Catch our talk “Incremental Processing on Hadoop At Uber”
</p>
diff --git a/content/news_archive.html b/content/news_archive.html
index 4d80715..d1986b5 100644
--- a/content/news_archive.html
+++ b/content/news_archive.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
-<meta name="keywords" content=" news, blog, updates, release notes,
announcements">
+<meta name="keywords" content="news, blog, updates, release notes,
announcements">
<title>News | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
diff --git a/content/powered_by.html b/content/powered_by.html
index 8f4b0d4..99991ca 100644
--- a/content/powered_by.html
+++ b/content/powered_by.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
-<meta name="keywords" content=" talks">
+<meta name="keywords" content="hudi, talks, presentation">
<title>Talks & Powered By | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -383,7 +387,6 @@ October 2018, Spark+AI Summit Europe, London, UK</p>
</ol>
-
<div class="tags">
</div>
diff --git a/content/privacy.html b/content/privacy.html
index 704bd3d..1804b9f 100644
--- a/content/privacy.html
+++ b/content/privacy.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
-<meta name="keywords" content=" privacy">
+<meta name="keywords" content="hudi, privacy">
<title>Privacy Policy | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
diff --git a/content/quickstart.html b/content/quickstart.html
index a73534d..b7781b3 100644
--- a/content/quickstart.html
+++ b/content/quickstart.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
-<meta name="keywords" content="quickstart, quickstart">
+<meta name="keywords" content="hudi, quickstart">
<title>Quickstart | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -362,7 +366,8 @@ $ mvn clean install -DskipTests -DskipITs -Dhive11
<h2 id="version-compatibility">Version Compatibility</h2>
-<p>Hudi requires Java 8 to be installed. Hudi works with Spark-2.x versions.
We have verified that Hudi works with the following combination of
Hadoop/Hive/Spark.</p>
+<p>Hudi requires Java 8 to be installed on a *nix system. Hudi works with
Spark-2.x versions.
+Further, we have verified that Hudi works with the following combination of
Hadoop/Hive/Spark.</p>
<table>
<thead>
@@ -395,8 +400,9 @@ $ mvn clean install -DskipTests -DskipITs -Dhive11
</tbody>
</table>
-<p>If your environment has other versions of hadoop/hive/spark, please try out
Hudi and let us know if there are any issues. We are limited by our bandwidth
to certify other combinations.
-It would be of great help if you can reach out to us with your setup and
experience with hoodie.</p>
+<p>If your environment has other versions of hadoop/hive/spark, please try out
Hudi and let us know if there are any issues.
+We are limited by our bandwidth to certify other combinations (e.g Docker on
Windows).
+It would be of great help if you can reach out to us with your setup and
experience with hudi.</p>
<h2 id="generate-a-hudi-dataset">Generate a Hudi Dataset</h2>
@@ -424,7 +430,7 @@ Use the RDD API to perform more involved actions on a Hudi
dataset</p>
<h4 id="datasource-api">DataSource API</h4>
-<p>Run <strong>hoodie-spark/src/test/java/HoodieJavaApp.java</strong> class,
to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates
to previously inserted 100 records) onto your HDFS/local filesystem. Use the
wrapper script
+<p>Run <strong>hoodie-spark/src/test/java/HoodieJavaApp.java</strong> class,
to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates
to previously inserted 100 records) onto your DFS/local filesystem. Use the
wrapper script
to run from command-line</p>
<div class="highlighter-rouge"><pre class="highlight"><code>cd hoodie-spark
@@ -679,9 +685,9 @@ data infrastructure is brought up in a local docker cluster
within your computer
<h3 id="setting-up-docker-cluster">Setting up Docker Cluster</h3>
-<h4 id="build-hoodie">Build Hoodie</h4>
+<h4 id="build-hudi">Build Hudi</h4>
-<p>The first step is to build hoodie
+<p>The first step is to build hudi
<code class="highlighter-rouge">
cd <HUDI_WORKSPACE>
mvn package -DskipTests
@@ -801,7 +807,7 @@ automatically initializes the datasets in the file-system
if they do not exist y
<div class="highlighter-rouge"><pre class="highlight"><code>docker exec -it
adhoc-2 /bin/bash
# Run the following spark-submit command to execute the delta-streamer and
ingest to stock_ticks_cow dataset in HDFS
-spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
+spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class
com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table
stock_ticks_cow --props /var/demo/config/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
....
....
2018-09-24 22:20:00 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 -
OutputCommitCoordinator stopped!
@@ -1329,7 +1335,7 @@ scala> spark.sql("select `_hoodie_commit_time`,
symbol, ts, volume, open, clo
Again, You can use Hudi CLI to manually schedule and run compaction</p>
<div class="highlighter-rouge"><pre class="highlight"><code>docker exec -it
adhoc-1 /bin/bash
-^[[Aroot@adhoc-1:/opt# /var/hoodie/ws/hoodie-cli/hoodie-cli.sh
+root@adhoc-1:/opt# /var/hoodie/ws/hoodie-cli/hoodie-cli.sh
============================================
* *
* _ _ _ _ *
@@ -1514,7 +1520,7 @@ scala> spark.sql("select `_hoodie_commit_time`,
symbol, ts, volume, open, clo
<h2 id="testing-hudi-in-local-docker-environment">Testing Hudi in Local Docker
environment</h2>
-<p>You can bring up a hadoop docker environment containing Hadoop, Hive and
Spark services with support for hoodie.
+<p>You can bring up a hadoop docker environment containing Hadoop, Hive and
Spark services with support for hudi.
<code class="highlighter-rouge">
$ mvn pre-integration-test -DskipTests
</code>
diff --git a/content/s3_hoodie.html b/content/s3_hoodie.html
index 217005c..0366721 100644
--- a/content/s3_hoodie.html
+++ b/content/s3_hoodie.html
@@ -4,8 +4,8 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="In this page, we go over how to configure
Hudi with S3 filesystem.">
-<meta name="keywords" content=" sql hive s3 spark presto">
-<title>S3 Filesystem (experimental) | Hudi</title>
+<meta name="keywords" content="hudi, hive, aws, s3, spark, presto">
+<title>S3 Filesystem | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -158,7 +162,7 @@
- <a class="email" title="Submit feedback" href="#"
onclick="javascript:window.location='mailto:[email protected]?subject=Hudi
Documentation feedback&body=I have some feedback about the S3 Filesystem
(experimental) page: ' + window.location.href;"><i class="fa
fa-envelope-o"></i> Feedback</a>
+ <a class="email" title="Submit feedback" href="#"
onclick="javascript:window.location='mailto:[email protected]?subject=Hudi
Documentation feedback&body=I have some feedback about the S3 Filesystem page:
' + window.location.href;"><i class="fa fa-envelope-o"></i> Feedback</a>
<li>
@@ -176,7 +180,7 @@
searchInput:
document.getElementById('search-input'),
resultsContainer:
document.getElementById('results-container'),
dataSource: 'search.json',
- searchResultTemplate: '<li><a href="{url}"
title="S3 Filesystem (experimental)">{title}</a></li>',
+ searchResultTemplate: '<li><a href="{url}"
title="S3 Filesystem">{title}</a></li>',
noResultsText: 'No results found.',
limit: 10,
fuzzy: true,
@@ -327,7 +331,7 @@
<!-- Content Column -->
<div class="col-md-9">
<div class="post-header">
- <h1 class="post-title-main">S3 Filesystem (experimental)</h1>
+ <h1 class="post-title-main">S3 Filesystem</h1>
</div>
@@ -343,11 +347,11 @@
- <p>Hudi works with HDFS by default. There is an experimental work going on
Hoodie-S3 compatibility.</p>
+ <p>In this page, we explain how to get your Hudi spark job to store into AWS
S3.</p>
<h2 id="aws-configs">AWS configs</h2>
-<p>There are two configurations required for Hoodie-S3 compatibility:</p>
+<p>There are two configurations required for Hudi-S3 compatibility:</p>
<ul>
<li>Adding AWS Credentials for Hudi</li>
@@ -415,7 +419,6 @@ export
HOODIE_ENV_fs_DOT_s3n_DOT_impl=org.apache.hadoop.fs.s3a.S3AFileSystem
</ul>
-
<div class="tags">
</div>
diff --git a/content/search.json b/content/search.json
index 3f7eb15..0473b34 100644
--- a/content/search.json
+++ b/content/search.json
@@ -6,7 +6,7 @@
{
"title": "Admin Guide",
"tags": "",
-"keywords": "admin",
+"keywords": "hudi, administration, operation, devops",
"url": "admin_guide.html",
"summary": "This section offers an overview of tools available to operate an
ecosystem of Hudi datasets"
}
@@ -17,7 +17,7 @@
{
"title": "Community",
"tags": "",
-"keywords": "usecases",
+"keywords": "hudi, use cases, big data, apache",
"url": "community.html",
"summary": ""
}
@@ -28,7 +28,7 @@
{
"title": "Comparison",
"tags": "",
-"keywords": "usecases",
+"keywords": "apache, hudi, kafka, kudu, hive, hbase, stream processing",
"url": "comparison.html",
"summary": ""
}
@@ -39,7 +39,7 @@
{
"title": "Concepts",
"tags": "",
-"keywords": "concepts",
+"keywords": "hudi, design, storage, views, timeline",
"url": "concepts.html",
"summary": "Here we introduce some basic concepts & give a broad technical
overview of Hudi"
}
@@ -50,7 +50,7 @@
{
"title": "Configurations",
"tags": "",
-"keywords": "configurations",
+"keywords": "garbage collection, hudi, jvm, configs, tuning",
"url": "configurations.html",
"summary": "Here we list all possible configurations and what they mean"
}
@@ -61,7 +61,7 @@
{
"title": "Developer Setup",
"tags": "",
-"keywords": "developer setup",
+"keywords": "hudi, ide, developer, setup",
"url": "contributing.html",
"summary": ""
}
@@ -72,9 +72,9 @@
{
-"title": "GCS Filesystem (experimental)",
+"title": "GCS Filesystem",
"tags": "",
-"keywords": "sql hive gcs spark presto",
+"keywords": "hudi, hive, google cloud, storage, spark, presto",
"url": "gcs_hoodie.html",
"summary": "In this page, we go over how to configure hudi with Google Cloud
Storage."
}
@@ -85,7 +85,7 @@
{
"title": "Implementation",
"tags": "",
-"keywords": "implementation",
+"keywords": "hudi, index, storage, compaction, cleaning, implementation",
"url": "implementation.html",
"summary": ""
}
@@ -96,7 +96,7 @@
{
"title": "Incremental Processing",
"tags": "",
-"keywords": "incremental processing",
+"keywords": "hudi, incremental, batch, stream, processing, Hive, ETL, Spark
SQL",
"url": "incremental_processing.html",
"summary": "In this page, we will discuss some available tools for ingesting
data incrementally & consuming the changes."
}
@@ -107,7 +107,7 @@
{
"title": "What is Hudi?",
"tags": "getting_started",
-"keywords": "homepage",
+"keywords": "big data, stream processing, cloud, hdfs, storage, upserts,
change capture",
"url": "index.html",
"summary": "Hudi brings stream processing to big data, providing fresh data
while being an order of magnitude efficient over traditional batch processing."
}
@@ -118,7 +118,7 @@
{
"title": "Migration Guide",
"tags": "",
-"keywords": "migration guide",
+"keywords": "hudi, migration, use case",
"url": "migration_guide.html",
"summary": "In this page, we will discuss some available tools for migrating
your existing dataset into a Hudi dataset"
}
@@ -140,7 +140,7 @@
{
"title": "News",
"tags": "",
-"keywords": "news, blog, updates, release notes, announcements",
+"keywords": "apache, hudi, news, blog, updates, release notes, announcements",
"url": "news.html",
"summary": ""
}
@@ -162,7 +162,7 @@
{
"title": "Talks & Powered By",
"tags": "",
-"keywords": "talks",
+"keywords": "hudi, talks, presentation",
"url": "powered_by.html",
"summary": ""
}
@@ -173,7 +173,7 @@
{
"title": "Privacy Policy",
"tags": "",
-"keywords": "privacy",
+"keywords": "hudi, privacy",
"url": "privacy.html",
"summary": ""
}
@@ -184,7 +184,7 @@
{
"title": "Quickstart",
"tags": "quickstart",
-"keywords": "quickstart",
+"keywords": "hudi, quickstart",
"url": "quickstart.html",
"summary": ""
}
@@ -193,9 +193,9 @@
{
-"title": "S3 Filesystem (experimental)",
+"title": "S3 Filesystem",
"tags": "",
-"keywords": "sql hive s3 spark presto",
+"keywords": "hudi, hive, aws, s3, spark, presto",
"url": "s3_hoodie.html",
"summary": "In this page, we go over how to configure Hudi with S3 filesystem."
}
@@ -210,7 +210,7 @@
{
"title": "SQL Queries",
"tags": "",
-"keywords": "sql hive spark presto",
+"keywords": "hudi, hive, spark, sql, presto",
"url": "sql_queries.html",
"summary": "In this page, we go over how to enable SQL queries on Hudi built
tables."
}
@@ -221,7 +221,7 @@
{
"title": "Use Cases",
"tags": "",
-"keywords": "usecases",
+"keywords": "hudi, data ingestion, etl, real time, use cases",
"url": "use_cases.html",
"summary": "Following are some sample use-cases for Hudi, which illustrate the
benefits in terms of faster processing & increased efficiency"
}
diff --git a/content/sql_queries.html b/content/sql_queries.html
index 6936191..d7fa8cc 100644
--- a/content/sql_queries.html
+++ b/content/sql_queries.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="In this page, we go over how to enable SQL
queries on Hudi built tables.">
-<meta name="keywords" content=" sql hive spark presto">
+<meta name="keywords" content="hudi, hive, spark, sql, presto">
<title>SQL Queries | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -368,8 +372,6 @@ to using the Hive Serde to read the data
(planning/executions is still Spark). T
towards Parquet reading, which we will address in the next method based on
path filters.
However benchmarks have not revealed any real performance degradation with
Hudi & SparkSQL, compared to native support.</p>
-<div class="bs-callout bs-callout-info">Get involved to improve this
integration <a href="https://github.com/uber/hoodie/issues/7">here</a> and <a
href="https://issues.apache.org/jira/browse/SPARK-19351">here</a> </div>
-
<p>Sample command is provided below to spin up Spark Shell</p>
<div class="highlighter-rouge"><pre class="highlight"><code>$ spark-shell
--jars hoodie-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path
/etc/hive/conf --packages com.databricks:spark-avro_2.11:4.0.0 --conf
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory
7g --executor-memory 2g --master yarn-client
diff --git a/content/strata-talk.html b/content/strata-talk.html
index 13a8375..58b6f8a 100644
--- a/content/strata-talk.html
+++ b/content/strata-talk.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
-<meta name="keywords" content="news, ">
+<meta name="keywords" content="">
<title>Hudi entered Apache Incubator | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
diff --git a/content/use_cases.html b/content/use_cases.html
index 6df8c34..dcdf403 100644
--- a/content/use_cases.html
+++ b/content/use_cases.html
@@ -4,7 +4,7 @@
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Following are some sample use-cases for
Hudi, which illustrate the benefits in terms of faster processing & increased
efficiency">
-<meta name="keywords" content=" usecases">
+<meta name="keywords" content="hudi, data ingestion, etl, real time, use
cases">
<title>Use Cases | Hudi</title>
<link rel="stylesheet" href="css/syntax.css">
@@ -149,6 +149,10 @@
<li><a
href="https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI"
target="_blank">Blog</a></li>
+
+ <li><a
href="https://projects.apache.org/project.html?incubator-hudi"
target="_blank">Team</a></li>
+
+
</ul>
</li>
@@ -350,7 +354,7 @@ In most (if not all) Hadoop deployments, it is
unfortunately solved in a pieceme
even though this data is arguably the most valuable for the entire
organization.</p>
<p>For RDBMS ingestion, Hudi provides <strong>faster loads via
Upserts</strong>, as opposed costly & inefficient bulk loads. For e.g, you
can read the MySQL BIN log or <a
href="https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports">Sqoop
Incremental Import</a> and apply them to an
-equivalent Hudi table on HDFS. This would be much faster/efficient than a <a
href="https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457">bulk
merge job</a>
+equivalent Hudi table on DFS. This would be much faster/efficient than a <a
href="https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457">bulk
merge job</a>
or <a
href="http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/">complicated
handcrafted merge workflows</a></p>
<p>For NoSQL datastores like <a
href="http://cassandra.apache.org/">Cassandra</a> / <a
href="http://www.project-voldemort.com/voldemort/">Voldemort</a> / <a
href="https://hbase.apache.org/">HBase</a>, even moderately big installations
store billions of rows.
@@ -367,13 +371,13 @@ This is absolutely perfect for lower scale (<a
href="https://blog.twitter.com/20
But, typically these systems end up getting abused for less interactive
queries also since data on Hadoop is intolerably stale. This leads to under
utilization & wasteful hardware/license costs.</p>
<p>On the other hand, interactive SQL solutions on Hadoop such as Presto &
SparkSQL excel in <strong>queries that finish within few seconds</strong>.
-By bringing <strong>data freshness to a few minutes</strong>, Hudi can provide
a much efficient alternative, as well unlock real-time analytics on
<strong>several magnitudes larger datasets</strong> stored in HDFS.
+By bringing <strong>data freshness to a few minutes</strong>, Hudi can provide
a much efficient alternative, as well unlock real-time analytics on
<strong>several magnitudes larger datasets</strong> stored in DFS.
Also, Hudi has no external dependencies (like a dedicated HBase cluster,
purely used for real-time analytics) and thus enables faster analytics on much
fresher analytics, without increasing the operational overhead.</p>
<h2 id="incremental-processing-pipelines">Incremental Processing Pipelines</h2>
<p>One fundamental ability Hadoop provides is to build a chain of datasets
derived from each other via DAGs expressed as workflows.
-Workflows often depend on new data being output by multiple upstream workflows
and traditionally, availability of new data is indicated by a new HDFS
Folder/Hive Partition.
+Workflows often depend on new data being output by multiple upstream workflows
and traditionally, availability of new data is indicated by a new DFS
Folder/Hive Partition.
Let’s take a concrete example to illustrate this. An upstream workflow <code
class="highlighter-rouge">U</code> can create a Hive partition for every hour,
with data for that hour (event_time) at the end of each hour (processing_time),
providing effective freshness of 1 hour.
Then, a downstream workflow <code class="highlighter-rouge">D</code>, kicks
off immediately after <code class="highlighter-rouge">U</code> finishes, and
does its own processing for the next hour, increasing the effective latency to
2 hours.</p>
@@ -388,19 +392,18 @@ like 15 mins, and providing an end-end latency of 30 mins
at <code class="highli
<div class="bs-callout bs-callout-info">To achieve this, Hudi has embraced
similar concepts from stream processing frameworks like <a
href="https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations">Spark
Streaming</a> , Pub/Sub systems like <a
href="http://kafka.apache.org/documentation/#theconsumer">Kafka</a>
or database replication technologies like <a
href="https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187">Oracle
XStream</a>.
-For the more curious, a more detailed explanation of the benefits of
Incremetal Processing (compared to Stream Processing & Batch Processing)
can be found <a
href="https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop">here</a></div>
+For the more curious, a more detailed explanation of the benefits of
Incremental Processing (compared to Stream Processing & Batch Processing)
can be found <a
href="https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop">here</a></div>
-<h2 id="data-dispersal-from-hadoop">Data Dispersal From Hadoop</h2>
+<h2 id="data-dispersal-from-dfs">Data Dispersal From DFS</h2>
<p>A popular use-case for Hadoop, is to crunch data and then disperse it back
to an online serving store, to be used by an application.
For e.g, a Spark Pipeline can <a
href="https://eng.uber.com/telematics/">determine hard braking events on
Hadoop</a> and load them into a serving store like ElasticSearch, to be used by
the Uber application to increase safe driving. Typical architectures for this
employ a <code class="highlighter-rouge">queue</code> between Hadoop and
serving store, to prevent overwhelming the target serving store.
-A popular choice for this queue is Kafka and this model often results in
<strong>redundant storage of same data on HDFS (for offline analysis on
computed results) and Kafka (for dispersal)</strong></p>
+A popular choice for this queue is Kafka and this model often results in
<strong>redundant storage of same data on DFS (for offline analysis on computed
results) and Kafka (for dispersal)</strong></p>
<p>Once again Hudi can efficiently solve this problem, by having the Spark
Pipeline upsert output from
each run into a Hudi dataset, which can then be incrementally tailed (just
like a Kafka topic) for new data & written into the serving store.</p>
-
<div class="tags">
</div>