This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 28ee7ae Travis CI build asf-site
28ee7ae is described below
commit 28ee7aea456a9e2413c5e909981152e6f9a7a075
Author: CI <[email protected]>
AuthorDate: Thu Jul 2 13:54:34 2020 +0000
Travis CI build asf-site
---
content/docs/querying_data.html | 12 +++++++
content/docs/quick-start-guide.html | 50 ++++++++++++++++------------
content/docs/writing_data.html | 65 ++++++++++++++++++++++++++++++-------
3 files changed, 95 insertions(+), 32 deletions(-)
diff --git a/content/docs/querying_data.html b/content/docs/querying_data.html
index ad7dbf7..86f195b 100644
--- a/content/docs/querying_data.html
+++ b/content/docs/querying_data.html
@@ -369,6 +369,7 @@
<li><a href="#spark-sql">Spark SQL</a></li>
<li><a href="#spark-datasource">Spark Datasource</a>
<ul>
+ <li><a href="#spark-snap-query">Snapshot query</a></li>
<li><a href="#spark-incr-query">Incremental query</a></li>
</ul>
</li>
@@ -639,6 +640,17 @@ If using spark’s built in support, additionally a path
filter needs to be push
datasources work (e.g: <code
class="highlighter-rouge">spark.read.parquet</code>). Both snapshot querying
and incremental querying are supported here. Typically spark jobs require
adding <code class="highlighter-rouge">--jars <path to
jar>/hudi-spark-bundle_2.11-<hudi version>.jar</code> to classpath of
drivers
and executors. Alternatively, hudi-spark-bundle can also fetched via the <code
class="highlighter-rouge">--packages</code> options (e.g: <code
class="highlighter-rouge">--packages
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3</code>).</p>
+<h3 id="spark-snap-query">Snapshot query</h3>
+<p>This method can be used to retrieve the data table at the present point in
time.
+Note: The file path must be suffixed with a number of wildcard asterisk (<code
class="highlighter-rouge">/*</code>) one greater than the number of partition
levels. Eg: with table file path “tablePath” partitioned by columns “a”, “b”,
and “c”, the load path must be <code class="highlighter-rouge">tablePath +
"/*/*/*/*"</code></p>
+
+<div class="language-scala highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="k">val</span> <span
class="nv">hudiIncQueryDF</span> <span class="k">=</span> <span
class="n">spark</span>
+ <span class="o">.</span><span class="py">read</span><span
class="o">()</span>
+ <span class="o">.</span><span class="py">format</span><span
class="o">(</span><span class="s">"org.apache.hudi"</span><span
class="o">)</span>
+ <span class="o">.</span><span class="py">option</span><span
class="o">(</span><span class="nv">DataSourceReadOptions</span><span
class="o">.</span><span class="py">QUERY_TYPE_OPT_KEY</span><span
class="o">(),</span> <span class="nv">DataSourceReadOptions</span><span
class="o">.</span><span class="py">QUERY_TYPE_SNAPSHOT_OPT_VAL</span><span
class="o">())</span>
+ <span class="o">.</span><span class="py">load</span><span
class="o">(</span><span class="n">tablePath</span> <span class="o">+</span>
<span class="s">"/*"</span><span class="o">)</span> <span class="c1">//The
number of wildcard asterisks here must be one greater than the number of
partition
+</span></code></pre></div></div>
+
<h3 id="spark-incr-query">Incremental query</h3>
<p>Of special interest to spark pipelines, is Hudi’s ability to support
incremental queries, like below. A sample incremental query, that will obtain
all records written since <code
class="highlighter-rouge">beginInstantTime</code>, looks like below.
Thanks to Hudi’s support for record level change streams, these incremental
pipelines often offer 10x efficiency over batch counterparts, by only
processing the changed records.
diff --git a/content/docs/quick-start-guide.html
b/content/docs/quick-start-guide.html
index 78c30fb..85bfb64 100644
--- a/content/docs/quick-start-guide.html
+++ b/content/docs/quick-start-guide.html
@@ -549,34 +549,42 @@ specific commit time and beginTime to “000” (denoting
earliest possible comm
<div class="language-scala highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="c1">// spark-shell
// fetch total records count
-</span><span class="nv">spark</span><span class="o">.</span><span
class="py">sql</span><span class="o">(</span><span class="s">"select uuid,
partitionPath from hudi_trips_snapshot"</span><span class="o">).</span><span
class="py">count</span><span class="o">()</span>
+</span><span class="nv">spark</span><span class="o">.</span><span
class="py">sql</span><span class="o">(</span><span class="s">"select uuid,
partitionpath from hudi_trips_snapshot"</span><span class="o">).</span><span
class="py">count</span><span class="o">()</span>
<span class="c1">// fetch two records to be deleted
-</span><span class="k">val</span> <span class="nv">ds</span> <span
class="k">=</span> <span class="nv">spark</span><span class="o">.</span><span
class="py">sql</span><span class="o">(</span><span class="s">"select uuid,
partitionPath from hudi_trips_snapshot"</span><span class="o">).</span><span
class="py">limit</span><span class="o">(</span><span class="mi">2</span><span
class="o">)</span>
+</span><span class="k">val</span> <span class="nv">ds</span> <span
class="k">=</span> <span class="nv">spark</span><span class="o">.</span><span
class="py">sql</span><span class="o">(</span><span class="s">"select uuid,
partitionpath from hudi_trips_snapshot"</span><span class="o">).</span><span
class="py">limit</span><span class="o">(</span><span class="mi">2</span><span
class="o">)</span>
<span class="c1">// issue deletes
</span><span class="k">val</span> <span class="nv">deletes</span> <span
class="k">=</span> <span class="nv">dataGen</span><span class="o">.</span><span
class="py">generateDeletes</span><span class="o">(</span><span
class="nv">ds</span><span class="o">.</span><span
class="py">collectAsList</span><span class="o">())</span>
-<span class="k">val</span> <span class="nv">df</span> <span class="k">=</span>
<span class="nv">spark</span><span class="o">.</span><span
class="py">read</span><span class="o">.</span><span class="py">json</span><span
class="o">(</span><span class="nv">spark</span><span class="o">.</span><span
class="py">sparkContext</span><span class="o">.</span><span
class="py">parallelize</span><span class="o">(</span><span
class="n">deletes</span><span class="o">,</span> <span class="mi">2</span><spa
[...]
-<span class="nv">df</span><span class="o">.</span><span
class="py">write</span><span class="o">.</span><span
class="py">format</span><span class="o">(</span><span
class="s">"hudi"</span><span class="o">).</span>
- <span class="nf">options</span><span class="o">(</span><span
class="n">getQuickstartWriteConfigs</span><span class="o">).</span>
- <span class="nf">option</span><span class="o">(</span><span
class="nc">OPERATION_OPT_KEY</span><span class="o">,</span><span
class="s">"delete"</span><span class="o">).</span>
- <span class="nf">option</span><span class="o">(</span><span
class="nc">PRECOMBINE_FIELD_OPT_KEY</span><span class="o">,</span> <span
class="s">"ts"</span><span class="o">).</span>
- <span class="nf">option</span><span class="o">(</span><span
class="nc">RECORDKEY_FIELD_OPT_KEY</span><span class="o">,</span> <span
class="s">"uuid"</span><span class="o">).</span>
- <span class="nf">option</span><span class="o">(</span><span
class="nc">PARTITIONPATH_FIELD_OPT_KEY</span><span class="o">,</span> <span
class="s">"partitionpath"</span><span class="o">).</span>
- <span class="nf">option</span><span class="o">(</span><span
class="nc">TABLE_NAME</span><span class="o">,</span> <span
class="n">tableName</span><span class="o">).</span>
- <span class="nf">mode</span><span class="o">(</span><span
class="nc">Append</span><span class="o">).</span>
- <span class="nf">save</span><span class="o">(</span><span
class="n">basePath</span><span class="o">)</span>
+<span class="k">val</span> <span class="nv">df</span> <span class="k">=</span>
<span class="n">spark</span>
+ <span class="o">.</span><span class="py">read</span>
+ <span class="o">.</span><span class="py">json</span><span
class="o">(</span><span class="nv">spark</span><span class="o">.</span><span
class="py">sparkContext</span><span class="o">.</span><span
class="py">parallelize</span><span class="o">(</span><span
class="n">deletes</span><span class="o">,</span> <span class="mi">2</span><span
class="o">))</span>
+
+<span class="n">df</span>
+ <span class="o">.</span><span class="py">write</span>
+ <span class="o">.</span><span class="py">format</span><span
class="o">(</span><span class="s">"hudi"</span><span class="o">)</span>
+ <span class="o">.</span><span class="py">options</span><span
class="o">(</span><span class="n">getQuickstartWriteConfigs</span><span
class="o">)</span>
+ <span class="o">.</span><span class="py">option</span><span
class="o">(</span><span class="nc">OPERATION_OPT_KEY</span><span
class="o">,</span><span class="s">"delete"</span><span class="o">)</span>
+ <span class="o">.</span><span class="py">option</span><span
class="o">(</span><span class="nc">PRECOMBINE_FIELD_OPT_KEY</span><span
class="o">,</span> <span class="s">"ts"</span><span class="o">)</span>
+ <span class="o">.</span><span class="py">option</span><span
class="o">(</span><span class="nc">RECORDKEY_FIELD_OPT_KEY</span><span
class="o">,</span> <span class="s">"uuid"</span><span class="o">)</span>
+ <span class="o">.</span><span class="py">option</span><span
class="o">(</span><span class="nc">PARTITIONPATH_FIELD_OPT_KEY</span><span
class="o">,</span> <span class="s">"partitionpath"</span><span
class="o">)</span>
+ <span class="o">.</span><span class="py">option</span><span
class="o">(</span><span class="nc">TABLE_NAME</span><span class="o">,</span>
<span class="n">tableName</span><span class="o">)</span>
+ <span class="o">.</span><span class="py">mode</span><span
class="o">(</span><span class="nc">Append</span><span class="o">)</span>
+ <span class="o">.</span><span class="py">save</span><span
class="o">(</span><span class="n">basePath</span><span class="o">)</span>
<span class="c1">// run the same read query as above.
-</span><span class="k">val</span> <span class="nv">roAfterDeleteViewDF</span>
<span class="k">=</span> <span class="n">spark</span><span class="o">.</span>
- <span class="n">read</span><span class="o">.</span>
- <span class="nf">format</span><span class="o">(</span><span
class="s">"hudi"</span><span class="o">).</span>
- <span class="nf">load</span><span class="o">(</span><span
class="n">basePath</span> <span class="o">+</span> <span
class="s">"/*/*/*/*"</span><span class="o">)</span>
+</span><span class="k">val</span> <span class="nv">roAfterDeleteViewDF</span>
<span class="k">=</span> <span class="n">spark</span>
+ <span class="o">.</span><span class="py">read</span>
+ <span class="o">.</span><span class="py">format</span><span
class="o">(</span><span class="s">"hudi"</span><span class="o">)</span>
+ <span class="o">.</span><span class="py">load</span><span
class="o">(</span><span class="n">basePath</span> <span class="o">+</span>
<span class="s">"/*/*/*/*"</span><span class="o">)</span>
+
<span class="nv">roAfterDeleteViewDF</span><span class="o">.</span><span
class="py">registerTempTable</span><span class="o">(</span><span
class="s">"hudi_trips_snapshot"</span><span class="o">)</span>
<span class="c1">// fetch should return (total - 2) records
-</span><span class="nv">spark</span><span class="o">.</span><span
class="py">sql</span><span class="o">(</span><span class="s">"select uuid,
partitionPath from hudi_trips_snapshot"</span><span class="o">).</span><span
class="py">count</span><span class="o">()</span>
+</span><span class="nv">spark</span><span class="o">.</span><span
class="py">sql</span><span class="o">(</span><span class="s">"select uuid,
partitionpath from hudi_trips_snapshot"</span><span class="o">).</span><span
class="py">count</span><span class="o">()</span>
</code></pre></div></div>
<p>Note: Only <code class="highlighter-rouge">Append</code> mode is supported
for delete operation.</p>
+<p>See the <a href="/docs/writing_data.html#deletes">deletion section</a> of
the writing data page for more details.</p>
+
<h1 id="pyspark-example">Pyspark example</h1>
<h2 id="setup-1">Setup</h2>
@@ -749,9 +757,9 @@ specific commit time and beginTime to “000” (denoting
earliest possible comm
<div class="language-python highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="c1"># pyspark
# fetch total records count
-</span><span class="n">spark</span><span class="o">.</span><span
class="n">sql</span><span class="p">(</span><span class="s">"select uuid,
partitionPath from hudi_trips_snapshot"</span><span class="p">)</span><span
class="o">.</span><span class="n">count</span><span class="p">()</span>
+</span><span class="n">spark</span><span class="o">.</span><span
class="n">sql</span><span class="p">(</span><span class="s">"select uuid,
partitionpath from hudi_trips_snapshot"</span><span class="p">)</span><span
class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="c1"># fetch two records to be deleted
-</span><span class="n">ds</span> <span class="o">=</span> <span
class="n">spark</span><span class="o">.</span><span class="n">sql</span><span
class="p">(</span><span class="s">"select uuid, partitionPath from
hudi_trips_snapshot"</span><span class="p">)</span><span
class="o">.</span><span class="n">limit</span><span class="p">(</span><span
class="mi">2</span><span class="p">)</span>
+</span><span class="n">ds</span> <span class="o">=</span> <span
class="n">spark</span><span class="o">.</span><span class="n">sql</span><span
class="p">(</span><span class="s">"select uuid, partitionpath from
hudi_trips_snapshot"</span><span class="p">)</span><span
class="o">.</span><span class="n">limit</span><span class="p">(</span><span
class="mi">2</span><span class="p">)</span>
<span class="c1"># issue deletes
</span><span class="n">hudi_delete_options</span> <span class="o">=</span>
<span class="p">{</span>
@@ -780,9 +788,11 @@ specific commit time and beginTime to “000” (denoting
earliest possible comm
<span class="n">load</span><span class="p">(</span><span
class="n">basePath</span> <span class="o">+</span> <span
class="s">"/*/*/*/*"</span><span class="p">)</span>
<span class="n">roAfterDeleteViewDF</span><span class="o">.</span><span
class="n">registerTempTable</span><span class="p">(</span><span
class="s">"hudi_trips_snapshot"</span><span class="p">)</span>
<span class="c1"># fetch should return (total - 2) records
-</span><span class="n">spark</span><span class="o">.</span><span
class="n">sql</span><span class="p">(</span><span class="s">"select uuid,
partitionPath from hudi_trips_snapshot"</span><span class="p">)</span><span
class="o">.</span><span class="n">count</span><span class="p">()</span>
+</span><span class="n">spark</span><span class="o">.</span><span
class="n">sql</span><span class="p">(</span><span class="s">"select uuid,
partitionpath from hudi_trips_snapshot"</span><span class="p">)</span><span
class="o">.</span><span class="n">count</span><span class="p">()</span>
</code></pre></div></div>
+<p>See the <a href="/docs/writing_data.html#deletes">deletion section</a> of
the writing data page for more details.</p>
+
<h2 id="where-to-go-from-here">Where to go from here?</h2>
<p>You can also do the quickstart by <a
href="https://github.com/apache/hudi#building-apache-hudi-from-source">building
hudi yourself</a>,
diff --git a/content/docs/writing_data.html b/content/docs/writing_data.html
index 89aa8b7..6452694 100644
--- a/content/docs/writing_data.html
+++ b/content/docs/writing_data.html
@@ -374,8 +374,8 @@ speeding up large Spark jobs via upserts using the <a
href="#datasource-writer">
can be chosen/changed across each commit/deltacommit issued against the
table.</p>
<ul>
- <li><strong>UPSERT</strong> : This is the default operation where the input
records are first tagged as inserts or updates by looking up the index and
- the records are ultimately written after heuristics are run to determine how
best to pack them on storage to optimize for things like file sizing.
+ <li><strong>UPSERT</strong> : This is the default operation where the input
records are first tagged as inserts or updates by looking up the index.
+ The records are ultimately written after heuristics are run to determine how
best to pack them on storage to optimize for things like file sizing.
This operation is recommended for use-cases like database change capture
where the input almost certainly contains updates.</li>
<li><strong>INSERT</strong> : This operation is very similar to upsert in
terms of heuristics/file sizing but completely skips the index lookup step.
Thus, it can be a lot faster than upserts
for use-cases like log de-duplication (in conjunction with options to filter
duplicates mentioned below). This is also suitable for use-cases where the
table can tolerate duplicates, but just
@@ -532,13 +532,45 @@ provided under <code
class="highlighter-rouge">hudi-utilities/src/test/resources
<h2 id="datasource-writer">Datasource Writer</h2>
-<p>The <code class="highlighter-rouge">hudi-spark</code> module offers the
DataSource API to write (and also read) any data frame into a Hudi table.
-Following is how we can upsert a dataframe, while specifying the field names
that need to be used
-for <code class="highlighter-rouge">recordKey => _row_key</code>, <code
class="highlighter-rouge">partitionPath => partition</code> and <code
class="highlighter-rouge">precombineKey => timestamp</code></p>
+<p>The <code class="highlighter-rouge">hudi-spark</code> module offers the
DataSource API to write (and read) a Spark DataFrame into a Hudi table. There
are a number of options available:</p>
+
+<p><strong><code
class="highlighter-rouge">HoodieWriteConfig</code></strong>:</p>
+
+<p><strong>TABLE_NAME</strong> (Required)<br /></p>
+
+<p><strong><code
class="highlighter-rouge">DataSourceWriteOptions</code></strong>:</p>
+
+<p><strong>RECORDKEY_FIELD_OPT_KEY</strong> (Required): Primary key field(s).
Nested fields can be specified using the dot notation eg: <code
class="highlighter-rouge">a.b.c</code>. When using multiple columns as primary
key use comma separated notation, eg: <code
class="highlighter-rouge">"col1,col2,col3,etc"</code>. Single or multiple
columns as primary key specified by <code
class="highlighter-rouge">KEYGENERATOR_CLASS_OPT_KEY</code> property.<br />
+Default value: <code class="highlighter-rouge">"uuid"</code><br /></p>
+
+<p><strong>PARTITIONPATH_FIELD_OPT_KEY</strong> (Required): Columns to be used
for partitioning the table. To prevent partitioning, provide empty string as
value eg: <code class="highlighter-rouge">""</code>. Specify partitioning/no
partitioning using <code
class="highlighter-rouge">KEYGENERATOR_CLASS_OPT_KEY</code>. If synchronizing
to hive, also specify using <code
class="highlighter-rouge">HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY.</code><br />
+Default value: <code class="highlighter-rouge">"partitionpath"</code><br /></p>
+
+<p><strong>PRECOMBINE_FIELD_OPT_KEY</strong> (Required): When two records have
the same key value, the record with the largest value from the field specified
will be choosen.<br />
+Default value: <code class="highlighter-rouge">"ts"</code><br /></p>
+
+<p><strong>OPERATION_OPT_KEY</strong>: The <a href="#write-operations">write
operations</a> to use.<br />
+Available values:<br />
+<code class="highlighter-rouge">UPSERT_OPERATION_OPT_VAL</code> (default),
<code class="highlighter-rouge">BULK_INSERT_OPERATION_OPT_VAL</code>, <code
class="highlighter-rouge">INSERT_OPERATION_OPT_VAL</code>, <code
class="highlighter-rouge">DELETE_OPERATION_OPT_VAL</code></p>
+
+<p><strong>TABLE_TYPE_OPT_KEY</strong>: The <a
href="/docs/concepts.html#table-types">type of table</a> to write to. Note:
After the initial creation of a table, this value must stay consistent when
writing to (updating) the table using the Spark <code
class="highlighter-rouge">SaveMode.Append</code> mode.<br />
+Available values:<br />
+<a href="/docs/concepts.html#copy-on-write-table"><code
class="highlighter-rouge">COW_TABLE_TYPE_OPT_VAL</code></a> (default), <a
href="/docs/concepts.html#merge-on-read-table"><code
class="highlighter-rouge">MOR_TABLE_TYPE_OPT_VAL</code></a></p>
+
+<p><strong>KEYGENERATOR_CLASS_OPT_KEY</strong>: Key generator class, that will
extract the key out of incoming record. If single column key use <code
class="highlighter-rouge">SimpleKeyGenerator</code>. For multiple column keys
use <code class="highlighter-rouge">ComplexKeyGenerator</code>. Note: A custom
key generator class can be written/provided here as well. Primary key columns
should be provided via <code
class="highlighter-rouge">RECORDKEY_FIELD_OPT_KEY</code> option.<br />
+Available values:<br />
+<code class="highlighter-rouge">classOf[SimpleKeyGenerator].getName</code>
(default), <code
class="highlighter-rouge">classOf[NonpartitionedKeyGenerator].getName</code>
(Non-partitioned tables can currently only have a single key column, <a
href="https://issues.apache.org/jira/browse/HUDI-1053">HUDI-1053</a>), <code
class="highlighter-rouge">classOf[ComplexKeyGenerator].getName</code></p>
+
+<p><strong>HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY</strong>: If using hive,
specify if the table should or should not be partitioned.<br />
+Available values:<br />
+<code
class="highlighter-rouge">classOf[SlashEncodedDayPartitionValueExtractor].getCanonicalName</code>
(default), <code
class="highlighter-rouge">classOf[MultiPartKeysValueExtractor].getCanonicalName</code>,
<code
class="highlighter-rouge">classOf[TimestampBasedKeyGenerator].getCanonicalName</code>,
<code
class="highlighter-rouge">classOf[NonPartitionedExtractor].getCanonicalName</code>,
<code
class="highlighter-rouge">classOf[GlobalDeleteKeyGenerator].getCanonicalName</code>
(to be use [...]
+
+<p>Example:
+Upsert a DataFrame, specifying the necessary field names for <code
class="highlighter-rouge">recordKey => _row_key</code>, <code
class="highlighter-rouge">partitionPath => partition</code>, and <code
class="highlighter-rouge">precombineKey => timestamp</code></p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="n">inputDF</span><span
class="o">.</span><span class="na">write</span><span class="o">()</span>
<span class="o">.</span><span class="na">format</span><span
class="o">(</span><span class="s">"org.apache.hudi"</span><span
class="o">)</span>
- <span class="o">.</span><span class="na">options</span><span
class="o">(</span><span class="n">clientOpts</span><span class="o">)</span>
<span class="c1">// any of the Hudi client opts can be passed in as well</span>
+ <span class="o">.</span><span class="na">options</span><span
class="o">(</span><span class="n">clientOpts</span><span class="o">)</span>
<span class="c1">//Where clientOpts is of type Map[String, String]. clientOpts
can include any other options necessary.</span>
<span class="o">.</span><span class="na">option</span><span
class="o">(</span><span class="nc">DataSourceWriteOptions</span><span
class="o">.</span><span class="na">RECORDKEY_FIELD_OPT_KEY</span><span
class="o">(),</span> <span class="s">"_row_key"</span><span class="o">)</span>
<span class="o">.</span><span class="na">option</span><span
class="o">(</span><span class="nc">DataSourceWriteOptions</span><span
class="o">.</span><span class="na">PARTITIONPATH_FIELD_OPT_KEY</span><span
class="o">(),</span> <span class="s">"partition"</span><span class="o">)</span>
<span class="o">.</span><span class="na">option</span><span
class="o">(</span><span class="nc">DataSourceWriteOptions</span><span
class="o">.</span><span class="na">PRECOMBINE_FIELD_OPT_KEY</span><span
class="o">(),</span> <span class="s">"timestamp"</span><span class="o">)</span>
@@ -557,8 +589,7 @@ once you have built the hudi-hive module. Following is how
we sync the above Dat
<span class="o">./</span><span class="n">run_sync_tool</span><span
class="o">.</span><span class="na">sh</span> <span class="o">--</span><span
class="n">jdbc</span><span class="o">-</span><span class="n">url</span> <span
class="nl">jdbc:hive2:</span><span class="err">\</span><span
class="o">/</span><span class="err">\</span><span class="o">/</span><span
class="nl">hiveserver:</span><span class="mi">10000</span> <span
class="o">--</span><span class="n">user</span> <span class="n">hive</s [...]
</code></pre></div></div>
-<p>Starting with Hudi 0.5.1 version read optimized version of merge-on-read
tables are suffixed ‘_ro’ by default. For backwards compatibility with older
Hudi versions,
-an optional HiveSyncConfig - <code
class="highlighter-rouge">--skip-ro-suffix</code>, has been provided to turn
off ‘_ro’ suffixing if desired. Explore other hive sync options using the
following command:</p>
+<p>Starting with Hudi 0.5.1 version read optimized version of merge-on-read
tables are suffixed ‘_ro’ by default. For backwards compatibility with older
Hudi versions, an optional HiveSyncConfig - <code
class="highlighter-rouge">--skip-ro-suffix</code>, has been provided to turn
off ‘_ro’ suffixing if desired. Explore other hive sync options using the
following command:</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="n">cd</span> <span
class="n">hudi</span><span class="o">-</span><span class="n">hive</span>
<span class="o">./</span><span class="n">run_sync_tool</span><span
class="o">.</span><span class="na">sh</span>
@@ -571,12 +602,22 @@ an optional HiveSyncConfig - <code
class="highlighter-rouge">--skip-ro-suffix</c
For more info refer to <a
href="https://cwiki.apache.org/confluence/x/6IqvC">Delete support in
Hudi</a>.</p>
<ul>
- <li><strong>Soft Deletes</strong> : With soft deletes, user wants to retain
the key but just null out the values for all other fields.
- This can be simply achieved by ensuring the appropriate fields are nullable
in the table schema and simply upserting the table after setting these fields
to null.</li>
- <li><strong>Hard Deletes</strong> : A stronger form of delete is to
physically remove any trace of the record from the table. This can be achieved
by issuing an upsert with a custom payload implementation
- via either DataSource or DeltaStreamer which always returns Optional.Empty as
the combined value. Hudi ships with a built-in <code
class="highlighter-rouge">org.apache.hudi.EmptyHoodieRecordPayload</code> class
that does exactly this.</li>
+ <li>
+ <p><strong>Soft Deletes</strong> : Retain the record key and just null out
the values for all the other fields.
+ This can be achieved by ensuring the appropriate fields are nullable in the
table schema and simply upserting the table after setting these fields to
null.</p>
+ </li>
+ <li>
+ <p><strong>Hard Deletes</strong> : A stronger form of deletion is to
physically remove any trace of the record from the table. This can be achieved
in 3 different ways.</p>
+
+ <p>1) Using DataSource, set <code
class="highlighter-rouge">OPERATION_OPT_KEY</code> to <code
class="highlighter-rouge">DELETE_OPERATION_OPT_VAL</code>. This will remove all
the records in the DataSet being submitted.</p>
+
+ <p>2) Using DataSource, set <code
class="highlighter-rouge">PAYLOAD_CLASS_OPT_KEY</code> to <code
class="highlighter-rouge">"org.apache.hudi.EmptyHoodieRecordPayload"</code>.
This will remove all the records in the DataSet being submitted.</p>
+
+ <p>3) Using DataSource or DeltaStreamer, add a column named <code
class="highlighter-rouge">_hoodie_is_deleted</code> to DataSet. The value of
this column must be set to <code class="highlighter-rouge">true</code> for all
the records to be deleted and either <code
class="highlighter-rouge">false</code> or left null for any records which are
to be upserted.</p>
+ </li>
</ul>
+<p>Example using hard delete method 2, remove all the records from the table
that exist in the DataSet <code class="highlighter-rouge">deleteDF</code>:</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre
class="highlight"><code> <span class="n">deleteDF</span> <span class="c1">//
dataframe containing just records to be deleted</span>
<span class="o">.</span><span class="na">write</span><span
class="o">().</span><span class="na">format</span><span class="o">(</span><span
class="s">"org.apache.hudi"</span><span class="o">)</span>
<span class="o">.</span><span class="na">option</span><span
class="o">(...)</span> <span class="c1">// Add HUDI options like record-key,
partition-path and others as needed for your setup</span>