This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new e995bd2 Travis CI build asf-site
e995bd2 is described below
commit e995bd2819c7df640826bada28f9dd87c0110a62
Author: CI <[email protected]>
AuthorDate: Mon Aug 24 19:11:45 2020 +0000
Travis CI build asf-site
---
content/assets/js/lunr/lunr-store.js | 2 +-
content/cn/docs/querying_data.html | 12 ++---
content/docs/comparison.html | 4 +-
content/docs/configurations.html | 88 ++++++++++++++++++++++++++++++++----
content/docs/deployment.html | 6 +--
content/docs/powered_by.html | 4 ++
content/docs/querying_data.html | 49 ++++++++++++++++----
content/docs/structure.html | 2 +-
8 files changed, 135 insertions(+), 32 deletions(-)
diff --git a/content/assets/js/lunr/lunr-store.js
b/content/assets/js/lunr/lunr-store.js
index 5e4b619..c07019e 100644
--- a/content/assets/js/lunr/lunr-store.js
+++ b/content/assets/js/lunr/lunr-store.js
@@ -825,7 +825,7 @@ var store = [{
"url": "https://hudi.apache.org/docs/writing_data.html",
"teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
"title": "查询 Hudi 数据集",
- "excerpt":"从概念上讲,Hudi物理存储一次数据到DFS上,同时在其上提供三个逻辑视图,如之前所述。 数据集同步到Hive
Metastore后,它将提供由Hudi的自定义输入格式支持的Hive外部表。一旦提供了适当的Hudi捆绑包,
就可以通过Hive、Spark和Presto之类的常用查询引擎来查询数据集。 具体来说,在写入过程中传递了两个由table name命名的Hive表。
例如,如果table name = hudi_tbl,我们得到 hudi_tbl 实现了由 HoodieParquetInputFormat
支持的数据集的读优化视图,从而提供了纯列式数据。 hudi_tbl_rt 实现了由 HoodieParquetRealtimeInputFormat
支持的数据集的实时视图,从而提供了基础数据和日志数据的合并视图。 如概念部分所述,增量处理所需要的
一个关键原语是增量拉取(以从数据集中获取更改流/日志)。您可以增量提取Hudi数据集,这意味着自指定的即时时间起, 您可�
�只获得全部更新和新行。 这与插入更新一起使用,对于构建某 [...]
+ "excerpt":"从概念上讲,Hudi物理存储一次数据到DFS上,同时在其上提供三个逻辑视图,如之前所述。 数据集同步到Hive
Metastore后,它将提供由Hudi的自定义输入格式支持的Hive外部表。一旦提供了适当的Hudi捆绑包,
就可以通过Hive、Spark和Presto之类的常用查询引擎来查询数据集。 具体来说,在写入过程中传递了两个由table name命名的Hive表。
例如,如果table name = hudi_tbl,我们得到 hudi_tbl 实现了由 HoodieParquetInputFormat
支持的数据集的读优化视图,从而提供了纯列式数据。 hudi_tbl_rt 实现了由 HoodieParquetRealtimeInputFormat
支持的数据集的实时视图,从而提供了基础数据和日志数据的合并视图。 如概念部分所述,增量处理所需要的
一个关键原语是增量拉取(以从数据集中获取更改流/日志)。您可以增量提取Hudi数据集,这意味着自指定的即时时间起, 您可�
�只获得全部更新和新行。 这与插入更新一起使用,对于构建某 [...]
"tags": [],
"url": "https://hudi.apache.org/cn/docs/querying_data.html",
"teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
diff --git a/content/cn/docs/querying_data.html
b/content/cn/docs/querying_data.html
index e808682..3129f6d 100644
--- a/content/cn/docs/querying_data.html
+++ b/content/cn/docs/querying_data.html
@@ -375,7 +375,7 @@
<li><a href="#spark-incr-pull">增量拉取</a></li>
</ul>
</li>
- <li><a href="#presto">Presto</a></li>
+ <li><a href="#prestodb">PrestoDB</a></li>
<li><a href="#impala-34-or-later">Impala (3.4 or later)</a>
<ul>
<li><a href="#读优化表-1">读优化表</a></li>
@@ -434,7 +434,7 @@
<td>Y</td>
</tr>
<tr>
- <td><strong>Presto</strong></td>
+ <td><strong>PrestoDB</strong></td>
<td>Y</td>
<td>N</td>
</tr>
@@ -477,8 +477,8 @@
<td>Y</td>
</tr>
<tr>
- <td><strong>Presto</strong></td>
- <td>N</td>
+ <td><strong>PrestoDB</strong></td>
+ <td>Y</td>
<td>N</td>
<td>Y</td>
</tr>
@@ -703,9 +703,9 @@ Upsert实用程序(<code
class="highlighter-rouge">HoodieDeltaStreamer</code>
</tbody>
</table>
-<h2 id="presto">Presto</h2>
+<h2 id="prestodb">PrestoDB</h2>
-<p>Presto是一种常用的查询引擎,可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。
+<p>PrestoDB是一种常用的查询引擎,可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。
这需要在整个安装过程中将<code class="highlighter-rouge">hudi-presto-bundle</code>
jar放入<code
class="highlighter-rouge"><presto_install>/plugin/hive-hadoop2/</code>中。</p>
<h2 id="impala-34-or-later">Impala (3.4 or later)</h2>
diff --git a/content/docs/comparison.html b/content/docs/comparison.html
index 3fd1343..2a50a60 100644
--- a/content/docs/comparison.html
+++ b/content/docs/comparison.html
@@ -392,7 +392,7 @@ we expect Hudi to positioned at something that ingests
parquet with superior per
Hive transactions does not offer the read-optimized storage option or the
incremental pulling, that Hudi does. In terms of implementation choices, Hudi
leverages
the full power of a processing framework like Spark, while Hive transactions
feature is implemented underneath by Hive tasks/queries kicked off by user or
the Hive metastore.
Based on our production experience, embedding Hudi as a library into existing
Spark pipelines was much easier and less operationally heavy, compared with the
other approach.
-Hudi is also designed to work with non-hive enginers like Presto/Spark and
will incorporate file formats other than parquet over time.</p>
+Hudi is also designed to work with non-hive engines like PrestoDB/Spark and
will incorporate file formats other than parquet over time.</p>
<h2 id="hbase">HBase</h2>
@@ -410,7 +410,7 @@ integration of Hudi library with Spark/Spark streaming
DAGs. In case of Non-Spar
and later sent into a Hudi table via a Kafka topic/DFS intermediate file. In
more conceptual level, data processing
pipelines just consist of three components : <code
class="highlighter-rouge">source</code>, <code
class="highlighter-rouge">processing</code>, <code
class="highlighter-rouge">sink</code>, with users ultimately running queries
against the sink to use the results of the pipeline.
Hudi can act as either a source or sink, that stores data on DFS.
Applicability of Hudi to a given stream processing pipeline ultimately boils
down to suitability
-of Presto/SparkSQL/Hive for your queries.</p>
+of PrestoDB/SparkSQL/Hive for your queries.</p>
<p>More advanced use cases revolve around the concepts of <a
href="https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop">incremental
processing</a>, which effectively
uses Hudi even inside the <code class="highlighter-rouge">processing</code>
engine to speed up typical batch pipelines. For e.g: Hudi can be used as a
state store inside a processing DAG (similar
diff --git a/content/docs/configurations.html b/content/docs/configurations.html
index 5f5783e..c274063 100644
--- a/content/docs/configurations.html
+++ b/content/docs/configurations.html
@@ -507,6 +507,10 @@ This is useful to store checkpointing information, in a
consistent way with the
<p>Property: <code
class="highlighter-rouge">hoodie.datasource.hive_sync.assume_date_partitioning</code>,
Default: <code class="highlighter-rouge">false</code> <br />
<span style="color:grey">Assume partitioning is yyyy/mm/dd</span></p>
+<h4 id="HIVE_USE_JDBC_OPT_KEY">HIVE_USE_JDBC_OPT_KEY</h4>
+<p>Property: <code
class="highlighter-rouge">hoodie.datasource.hive_sync.use_jdbc</code>, Default:
<code class="highlighter-rouge">true</code> <br />
+ <span style="color:grey">Use JDBC when hive synchronization is
enabled</span></p>
+
<h3 id="read-options">Read Options</h3>
<p>Options useful for reading tables via <code
class="highlighter-rouge">read.format.option(...)</code></p>
@@ -563,6 +567,18 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<p>Property: <code
class="highlighter-rouge">hoodie.bulkinsert.shuffle.parallelism</code><br />
<span style="color:grey">Bulk insert is meant to be used for large initial
imports and this parallelism determines the initial number of files in your
table. Tune this to achieve a desired optimal size during initial
import.</span></p>
+<h4
id="withUserDefinedBulkInsertPartitionerClass">withUserDefinedBulkInsertPartitionerClass(className
= x.y.z.UserDefinedPatitionerClass)</h4>
+<p>Property: <code
class="highlighter-rouge">hoodie.bulkinsert.user.defined.partitioner.class</code><br
/>
+<span style="color:grey">If specified, this class will be used to re-partition
input records before they are inserted.</span></p>
+
+<h4 id="withBulkInsertSortMode">withBulkInsertSortMode(mode =
BulkInsertSortMode.GLOBAL_SORT)</h4>
+<p>Property: <code
class="highlighter-rouge">hoodie.bulkinsert.sort.mode</code><br />
+<span style="color:grey">Sorting modes to use for sorting records for bulk
insert. This is leveraged when user defined partitioner is not configured.
Default is GLOBAL_SORT.
+ Available values are - <strong>GLOBAL_SORT</strong>: this ensures best
file sizes, with lowest memory overhead at cost of sorting.
+ <strong>PARTITION_SORT</strong>: Strikes a balance by only sorting within a
partition, still keeping the memory overhead of writing lowest and best effort
file sizing.
+ <strong>NONE</strong>: No sorting. Fastest and matches <code
class="highlighter-rouge">spark.write.parquet()</code> in terms of number of
files, overheads
+</span></p>
+
<h4 id="withParallelism">withParallelism(insert_shuffle_parallelism = 1500,
upsert_shuffle_parallelism = 1500)</h4>
<p>Property: <code
class="highlighter-rouge">hoodie.insert.shuffle.parallelism</code>, <code
class="highlighter-rouge">hoodie.upsert.shuffle.parallelism</code><br />
<span style="color:grey">Once data has been initially imported, this
parallelism controls initial parallelism for reading input records. Ensure this
value is high enough say: 1 partition for 1 GB of input data</span></p>
@@ -587,10 +603,22 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<p>Property: <code
class="highlighter-rouge">hoodie.consistency.check.enabled</code><br />
<span style="color:grey">Should HoodieWriteClient perform additional checks to
ensure written files’ are listable on the underlying filesystem/storage. Set
this to true, to workaround S3’s eventual consistency model and ensure all data
written as a part of a commit is faithfully available for queries. </span></p>
+<h4 id="withRollbackParallelism">withRollbackParallelism(rollbackParallelism =
100)</h4>
+<p>Property: <code
class="highlighter-rouge">hoodie.rollback.parallelism</code><br />
+<span style="color:grey">Determines the parallelism for rollback of
commits.</span></p>
+
+<h4
id="withRollbackUsingMarkers">withRollbackUsingMarkers(rollbackUsingMarkers =
false)</h4>
+<p>Property: <code
class="highlighter-rouge">hoodie.rollback.using.markers</code><br />
+<span style="color:grey">Enables a more efficient mechanism for rollbacks
based on the marker files generated during the writes. Turned off by
default.</span></p>
+
+<h4 id="withMarkersDeleteParallelism">withMarkersDeleteParallelism(parallelism
= 100)</h4>
+<p>Property: <code
class="highlighter-rouge">hoodie.markers.delete.parallelism</code><br />
+<span style="color:grey">Determines the parallelism for deleting marker
files.</span></p>
+
<h3 id="index-configs">Index configs</h3>
<p>Following configs control indexing behavior, which tags incoming records as
either inserts or updates to older records.</p>
-<p><a href="#withIndexConfig">withIndexConfig</a> (HoodieIndexConfig) <br />
+<p><a href="#index-configs">withIndexConfig</a> (HoodieIndexConfig) <br />
<span style="color:grey">This is pluggable to have a external index (HBase) or
use the default bloom filter stored in the Parquet files</span></p>
<h4 id="withIndexClass">withIndexClass(indexClass =
“x.y.z.UserDefinedIndex”)</h4>
@@ -599,7 +627,9 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<h4 id="withIndexType">withIndexType(indexType = BLOOM)</h4>
<p>Property: <code class="highlighter-rouge">hoodie.index.type</code> <br />
-<span style="color:grey">Type of index to use. Default is Bloom filter.
Possible options are [BLOOM | HBASE | INMEMORY]. Bloom filters removes the
dependency on a external system and is stored in the footer of the Parquet Data
Files</span></p>
+<span style="color:grey">Type of index to use. Default is Bloom filter.
Possible options are [BLOOM | GLOBAL_BLOOM |SIMPLE | GLOBAL_SIMPLE | INMEMORY |
HBASE]. Bloom filters removes the dependency on a external system and is stored
in the footer of the Parquet Data Files</span></p>
+
+<h4 id="bloom-index-configs">Bloom Index configs</h4>
<h4 id="bloomFilterNumEntries">bloomFilterNumEntries(numEntries = 60000)</h4>
<p>Property: <code
class="highlighter-rouge">hoodie.index.bloom.num_entries</code> <br />
@@ -609,6 +639,10 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<p>Property: <code class="highlighter-rouge">hoodie.index.bloom.fpp</code> <br
/>
<span style="color:grey">Only applies if index type is BLOOM. <br /> Error
rate allowed given the number of entries. This is used to calculate how many
bits should be assigned for the bloom filter and the number of hash functions.
This is usually set very low (default: 0.000000001), we like to tradeoff disk
space for lower false positives</span></p>
+<h4 id="bloomIndexParallelism">bloomIndexParallelism(0)</h4>
+<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.parallelism</code> <br />
+<span style="color:grey">Only applies if index type is BLOOM. <br /> This is
the amount of parallelism for index lookup, which involves a Spark Shuffle. By
default, this is auto computed based on input workload
characteristics</span></p>
+
<h4 id="bloomIndexPruneByRanges">bloomIndexPruneByRanges(pruneRanges =
true)</h4>
<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.prune.by.ranges</code> <br />
<span style="color:grey">Only applies if index type is BLOOM. <br /> When
true, range information from files to leveraged speed up index lookups.
Particularly helpful, if the key has a monotonously increasing prefix, such as
timestamp.</span></p>
@@ -625,13 +659,27 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.bucketized.checking</code> <br />
<span style="color:grey">Only applies if index type is BLOOM. <br /> When
true, bucketized bloom filtering is enabled. This reduces skew seen in sort
based bloom index lookup</span></p>
+<h4 id="bloomIndexFilterType">bloomIndexFilterType(bucketizedChecking =
BloomFilterTypeCode.SIMPLE)</h4>
+<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.filter.type</code> <br />
+<span style="color:grey">Filter type used. Default is
BloomFilterTypeCode.SIMPLE. Available values are [BloomFilterTypeCode.SIMPLE ,
BloomFilterTypeCode.DYNAMIC_V0]. Dynamic bloom filters auto size themselves
based on number of keys</span></p>
+
+<h4
id="bloomIndexFilterDynamicMaxEntries">bloomIndexFilterDynamicMaxEntries(maxNumberOfEntries
= 100000)</h4>
+<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.filter.dynamic.max.entries</code>
<br />
+<span style="color:grey">The threshold for the maximum number of keys to
record in a dynamic Bloom filter row. Only applies if filter type is
BloomFilterTypeCode.DYNAMIC_V0.</span></p>
+
<h4 id="bloomIndexKeysPerBucket">bloomIndexKeysPerBucket(keysPerBucket =
10000000)</h4>
<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.keys.per.bucket</code> <br />
<span style="color:grey">Only applies if bloomIndexBucketizedChecking is
enabled and index type is bloom. <br /> This configuration controls the
“bucket” size which tracks the number of record-key checks made against a
single file and is the unit of work allocated to each partition performing
bloom filter lookup. A higher value would amortize the fixed cost of reading a
bloom filter to memory. </span></p>
-<h4 id="bloomIndexParallelism">bloomIndexParallelism(0)</h4>
-<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.parallelism</code> <br />
-<span style="color:grey">Only applies if index type is BLOOM. <br /> This is
the amount of parallelism for index lookup, which involves a Spark Shuffle. By
default, this is auto computed based on input workload
characteristics</span></p>
+<h5 id="withBloomIndexInputStorageLevel">withBloomIndexInputStorageLevel(level
= MEMORY_AND_DISK_SER)</h5>
+<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.input.storage.level</code> <br />
+<span style="color:grey">Only applies when <a
href="#bloomIndexUseCaching">#bloomIndexUseCaching</a> is set. Determine what
level of persistence is used to cache input RDDs.<br /> Refer to
org.apache.spark.storage.StorageLevel for different values</span></p>
+
+<h5
id="bloomIndexUpdatePartitionPath">bloomIndexUpdatePartitionPath(updatePartitionPath
= false)</h5>
+<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.update.partition.path</code> <br />
+<span style="color:grey">Only applies if index type is GLOBAL_BLOOM. <br
/>When set to true, an update including the partition path of a record that
already exists will result in inserting the incoming record into the new
partition and deleting the original record in the old partition. When set to
false, the original record will only be updated in the old partition.</span></p>
+
+<h4 id="hbase-index-configs">HBase Index configs</h4>
<h4 id="hbaseZkQuorum">hbaseZkQuorum(zkString) [Required]</h4>
<p>Property: <code
class="highlighter-rouge">hoodie.index.hbase.zkquorum</code> <br />
@@ -649,9 +697,23 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<p>Property: <code class="highlighter-rouge">hoodie.index.hbase.table</code>
<br />
<span style="color:grey">Only applies if index type is HBASE. HBase Table name
to use as the index. Hudi stores the row_key and [partition_path, fileID,
commitTime] mapping in the table.</span></p>
-<h5
id="bloomIndexUpdatePartitionPath">bloomIndexUpdatePartitionPath(updatePartitionPath
= false)</h5>
-<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.update.partition.path</code> <br />
-<span style="color:grey">Only applies if index type is GLOBAL_BLOOM. <br
/>When set to true, an update including the partition path of a record that
already exists will result in inserting the incoming record into the new
partition and deleting the original record in the old partition. When set to
false, the original record will only be updated in the old partition.</span></p>
+<h4 id="simple-index-configs">Simple Index configs</h4>
+
+<h4 id="simpleIndexUseCaching">simpleIndexUseCaching(useCaching = true)</h4>
+<p>Property: <code
class="highlighter-rouge">hoodie.simple.index.use.caching</code> <br />
+<span style="color:grey">Only applies if index type is SIMPLE. <br /> When
true, the input RDD will cached to speed up index lookup by reducing IO for
computing parallelism or affected partitions</span></p>
+
+<h5
id="withSimpleIndexInputStorageLevel">withSimpleIndexInputStorageLevel(level =
MEMORY_AND_DISK_SER)</h5>
+<p>Property: <code
class="highlighter-rouge">hoodie.simple.index.input.storage.level</code> <br />
+<span style="color:grey">Only applies when <a
href="#simpleIndexUseCaching">#simpleIndexUseCaching</a> is set. Determine what
level of persistence is used to cache input RDDs.<br /> Refer to
org.apache.spark.storage.StorageLevel for different values</span></p>
+
+<h4 id="withSimpleIndexParallelism">withSimpleIndexParallelism(parallelism =
50)</h4>
+<p>Property: <code
class="highlighter-rouge">hoodie.simple.index.parallelism</code> <br />
+<span style="color:grey">Only applies if index type is SIMPLE. <br /> This is
the amount of parallelism for index lookup, which involves a Spark
Shuffle.</span></p>
+
+<h4
id="withGlobalSimpleIndexParallelism">withGlobalSimpleIndexParallelism(parallelism
= 100)</h4>
+<p>Property: <code
class="highlighter-rouge">hoodie.global.simple.index.parallelism</code> <br />
+<span style="color:grey">Only applies if index type is GLOBAL_SIMPLE. <br />
This is the amount of parallelism for index lookup, which involves a Spark
Shuffle.</span></p>
<h3 id="storage-configs">Storage configs</h3>
<p>Controls aspects around sizing parquet and log files.</p>
@@ -706,6 +768,14 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<p>Property: <code
class="highlighter-rouge">hoodie.cleaner.commits.retained</code> <br />
<span style="color:grey">Number of commits to retain. So data will be retained
for num_of_commits * time_between_commits (scheduled). This also directly
translates into how much you can incrementally pull on this table</span></p>
+<h4 id="withAutoClean">withAutoClean(autoClean = true)</h4>
+<p>Property: <code class="highlighter-rouge">hoodie.clean.automatic</code> <br
/>
+<span style="color:grey">Should cleanup if there is anything to cleanup
immediately after the commit</span></p>
+
+<h4 id="withAsyncClean">withAsyncClean(asyncClean = false)</h4>
+<p>Property: <code class="highlighter-rouge">hoodie.clean.async</code> <br />
+<span style="color:grey">Only applies when <a
href="#withAutoClean">#withAutoClean</a> is turned on. When turned on runs
cleaner async with writing. </span></p>
+
<h4 id="archiveCommitsWith">archiveCommitsWith(minCommits = 96, maxCommits =
128)</h4>
<p>Property: <code class="highlighter-rouge">hoodie.keep.min.commits</code>,
<code class="highlighter-rouge">hoodie.keep.max.commits</code> <br />
<span style="color:grey">Each commit is a small file in the <code
class="highlighter-rouge">.hoodie</code> directory. Since DFS typically does
not favor lots of small files, Hudi archives older commits into a sequential
log. A commit is published atomically by a rename of the commit file.</span></p>
@@ -724,7 +794,7 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<h4 id="autoTuneInsertSplits">autoTuneInsertSplits(true)</h4>
<p>Property: <code
class="highlighter-rouge">hoodie.copyonwrite.insert.auto.split</code> <br />
-<span style="color:grey">Should hudi dynamically compute the insertSplitSize
based on the last 24 commit’s metadata. Turned off by default. </span></p>
+<span style="color:grey">Should hudi dynamically compute the insertSplitSize
based on the last 24 commit’s metadata. Turned on by default. </span></p>
<h4 id="approxRecordSize">approxRecordSize(size = 1024)</h4>
<p>Property: <code
class="highlighter-rouge">hoodie.copyonwrite.record.size.estimate</code> <br />
diff --git a/content/docs/deployment.html b/content/docs/deployment.html
index 023675b..4db35f2 100644
--- a/content/docs/deployment.html
+++ b/content/docs/deployment.html
@@ -414,9 +414,9 @@ Specifically, we will cover the following aspects.</p>
<p>All in all, Hudi deploys with no long running servers or additional
infrastructure cost to your data lake. In fact, Hudi pioneered this model of
building a transactional distributed storage layer
using existing infrastructure and its heartening to see other systems adopting
similar approaches as well. Hudi writing is done via Spark jobs (DeltaStreamer
or custom Spark datasource jobs), deployed per standard Apache Spark <a
href="https://spark.apache.org/docs/latest/cluster-overview.html">recommendations</a>.
-Querying Hudi tables happens via libraries installed into Apache Hive, Apache
Spark or Presto and hence no additional infrastructure is necessary.</p>
+Querying Hudi tables happens via libraries installed into Apache Hive, Apache
Spark or PrestoDB and hence no additional infrastructure is necessary.</p>
-<p>A typical Hudi data ingestion can be achieved in 2 modes. In a singe run
mode, Hudi ingestion reads next batch of data, ingest them to Hudi table and
exits. In continuous mode, Hudi ingestion runs as a long-running service
executing ingestion in a loop.</p>
+<p>A typical Hudi data ingestion can be achieved in 2 modes. In a single run
mode, Hudi ingestion reads next batch of data, ingest them to Hudi table and
exits. In continuous mode, Hudi ingestion runs as a long-running service
executing ingestion in a loop.</p>
<p>With Merge_On_Read Table, Hudi ingestion needs to also take care of
compacting delta files. Again, compaction can be performed in an
asynchronous-mode by letting compaction run concurrently with ingestion or in a
serial fashion with one after another.</p>
@@ -893,7 +893,7 @@ consistent with the compaction plan</p>
<h2 id="troubleshooting">Troubleshooting</h2>
-<p>Section below generally aids in debugging Hudi failures. Off the bat, the
following metadata is added to every record to help triage issues easily using
standard Hadoop SQL engines (Hive/Presto/Spark)</p>
+<p>Section below generally aids in debugging Hudi failures. Off the bat, the
following metadata is added to every record to help triage issues easily using
standard Hadoop SQL engines (Hive/PrestoDB/Spark)</p>
<ul>
<li><strong>_hoodie_record_key</strong> - Treated as a primary key within
each DFS partition, basis of all updates/inserts</li>
diff --git a/content/docs/powered_by.html b/content/docs/powered_by.html
index dcc8b71..4e5de0e 100644
--- a/content/docs/powered_by.html
+++ b/content/docs/powered_by.html
@@ -477,6 +477,9 @@ December 2019, AWS re:Invent 2019, Las Vegas, NV, USA</p>
<li>
<p><a href="https://www.youtube.com/watch?v=N2eDfU_rQ_U">“Apache Hudi -
Design/Code Walkthrough Session for Contributors”</a> - By Vinoth Chandar, July
2020, Hudi community.</p>
</li>
+ <li>
+ <p><a href="https://youtu.be/nA3rwOdmm3A">“PrestoDB and Apache Hudi”</a> -
By Bhavani Sudha Saktheeswaran and Brandon Scheller, Aug 2020, PrestoDB
Community Meetup.</p>
+ </li>
</ol>
<h2 id="articles">Articles</h2>
@@ -489,6 +492,7 @@ December 2019, AWS re:Invent 2019, Las Vegas, NV, USA</p>
<li><a
href="https://searchdatamanagement.techtarget.com/news/252484740/Apache-Hudi-grows-cloud-data-lake-maturity">“Apache
Hudi grows cloud data lake maturity”</a></li>
<li><a href="https://eng.uber.com/apache-hudi-graduation/">“Building a
Large-scale Transactional Data Lake at Uber Using Apache Hudi”</a> - Uber eng
blog by Nishith Agarwal</li>
<li><a
href="https://www.diva-portal.org/smash/get/diva2:1413103/FULLTEXT01.pdf">“Hudi
On Hops”</a> - By NETSANET GEBRETSADKAN KIDANE</li>
+ <li><a
href="https://prestodb.io/blog/2020/08/04/prestodb-and-hudi">“PrestoDB and
Apachi Hudi</a> - PrestoDB - Hudi integration blog by Bhavani Sudha
Saktheeswaran and Brandon Scheller</li>
</ol>
<h2 id="powered-by">Powered by</h2>
diff --git a/content/docs/querying_data.html b/content/docs/querying_data.html
index 3db236f..eca16b2 100644
--- a/content/docs/querying_data.html
+++ b/content/docs/querying_data.html
@@ -4,7 +4,7 @@
<meta charset="utf-8">
<!-- begin _includes/seo.html --><title>Querying Hudi Tables - Apache
Hudi</title>
-<meta name="description" content="Conceptually, Hudi stores data physically
once on DFS, while providing 3 different ways of querying, as explained before.
Once the table is synced to the Hive metastore, it provides external Hive
tables backed by Hudi’s custom inputformats. Once the proper hudibundle has
been installed, the table can be queried by popular query engines like Hive,
Spark SQL, Spark Datasource API and Presto.">
+<meta name="description" content="Conceptually, Hudi stores data physically
once on DFS, while providing 3 different ways of querying, as explained before.
Once the table is synced to the Hive metastore, it provides external Hive
tables backed by Hudi’s custom inputformats. Once the proper hudibundle has
been installed, the table can be queried by popular query engines like Hive,
Spark SQL, Spark Datasource API and PrestoDB.">
<meta property="og:type" content="article">
<meta property="og:locale" content="en_US">
@@ -13,7 +13,7 @@
<meta property="og:url"
content="https://hudi.apache.org/docs/querying_data.html">
- <meta property="og:description" content="Conceptually, Hudi stores data
physically once on DFS, while providing 3 different ways of querying, as
explained before. Once the table is synced to the Hive metastore, it provides
external Hive tables backed by Hudi’s custom inputformats. Once the proper
hudibundle has been installed, the table can be queried by popular query
engines like Hive, Spark SQL, Spark Datasource API and Presto.">
+ <meta property="og:description" content="Conceptually, Hudi stores data
physically once on DFS, while providing 3 different ways of querying, as
explained before. Once the table is synced to the Hive metastore, it provides
external Hive tables backed by Hudi’s custom inputformats. Once the proper
hudibundle has been installed, the table can be queried by popular query
engines like Hive, Spark SQL, Spark Datasource API and PrestoDB.">
@@ -384,7 +384,7 @@
<li><a href="#spark-incr-query">Incremental query</a></li>
</ul>
</li>
- <li><a href="#presto">Presto</a></li>
+ <li><a href="#prestodb">PrestoDB</a></li>
<li><a href="#impala-34-or-later">Impala (3.4 or later)</a>
<ul>
<li><a href="#snapshot-query">Snapshot Query</a></li>
@@ -396,7 +396,7 @@
<p>Conceptually, Hudi stores data physically once on DFS, while
providing 3 different ways of querying, as explained <a
href="/docs/concepts.html#query-types">before</a>.
Once the table is synced to the Hive metastore, it provides external Hive
tables backed by Hudi’s custom inputformats. Once the proper hudi
-bundle has been installed, the table can be queried by popular query engines
like Hive, Spark SQL, Spark Datasource API and Presto.</p>
+bundle has been installed, the table can be queried by popular query engines
like Hive, Spark SQL, Spark Datasource API and PrestoDB.</p>
<p>Specifically, following Hive tables are registered based off <a
href="/docs/configurations.html#TABLE_NAME_OPT_KEY">table name</a>
and <a href="/docs/configurations.html#TABLE_TYPE_OPT_KEY">table type</a>
configs passed during write.</p>
@@ -449,7 +449,7 @@ with special configurations that indicates to query
planning that only increment
<td>Y</td>
</tr>
<tr>
- <td><strong>Presto</strong></td>
+ <td><strong>PrestoDB</strong></td>
<td>Y</td>
<td>N</td>
</tr>
@@ -494,8 +494,8 @@ with special configurations that indicates to query
planning that only increment
<td>Y</td>
</tr>
<tr>
- <td><strong>Presto</strong></td>
- <td>N</td>
+ <td><strong>PrestoDB</strong></td>
+ <td>Y</td>
<td>N</td>
<td>Y</td>
</tr>
@@ -706,10 +706,39 @@ Please refer to <a
href="/docs/configurations.html#spark-datasource">configurati
</tbody>
</table>
-<h2 id="presto">Presto</h2>
+<h2 id="prestodb">PrestoDB</h2>
+
+<p>PrestoDB is a popular query engine, providing interactive query
performance. PrestoDB currently supports snapshot querying on COPY_ON_WRITE
tables.
+Both snapshot and read optimized queries are supported on MERGE_ON_READ Hudi
tables. Since PrestoDB-Hudi integration has evolved over time, the installation
+instructions for PrestoDB would vary based on versions. Please check the below
table for query types supported and installation instructions
+for different versions of PrestoDB.</p>
-<p>Presto is a popular query engine, providing interactive query performance.
Presto currently supports snapshot queries on COPY_ON_WRITE and read optimized
queries
-on MERGE_ON_READ Hudi tables. This requires the <code
class="highlighter-rouge">hudi-presto-bundle</code> jar to be placed into <code
class="highlighter-rouge"><presto_install>/plugin/hive-hadoop2/</code>,
across the installation.</p>
+<table>
+ <thead>
+ <tr>
+ <th><strong>PrestoDB Version</strong></th>
+ <th><strong>Installation description</strong></th>
+ <th><strong>Query types supported</strong></th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>< 0.233</td>
+ <td>Requires the <code
class="highlighter-rouge">hudi-presto-bundle</code> jar to be placed into <code
class="highlighter-rouge"><presto_install>/plugin/hive-hadoop2/</code>,
across the installation.</td>
+ <td>Snapshot querying on COW tables. Read optimized querying on MOR
tables.</td>
+ </tr>
+ <tr>
+ <td>>= 0.233</td>
+ <td>No action needed. Hudi (0.5.1-incubating) is a compile time
dependency.</td>
+ <td>Snapshot querying on COW tables. Read optimized querying on MOR
tables.</td>
+ </tr>
+ <tr>
+ <td>>= 0.240</td>
+ <td>No action needed. Hudi 0.5.3 version is a compile time
dependency.</td>
+ <td>Snapshot querying on both COW and MOR tables</td>
+ </tr>
+ </tbody>
+</table>
<h2 id="impala-34-or-later">Impala (3.4 or later)</h2>
diff --git a/content/docs/structure.html b/content/docs/structure.html
index e68514f..9495489 100644
--- a/content/docs/structure.html
+++ b/content/docs/structure.html
@@ -380,7 +380,7 @@
<img class="docimage" src="/assets/images/hudi_intro_1.png"
alt="hudi_intro_1.png" />
</figure>
-<p>By carefully managing how data is laid out in storage & how it’s
exposed to queries, Hudi is able to power a rich data ecosystem where external
sources can be ingested in near real-time and made available for interactive
SQL Engines like <a href="https://prestodb.io">Presto</a> & <a
href="https://spark.apache.org/sql/">Spark</a>, while at the same time capable
of being consumed incrementally from processing/ETL frameworks like <a
href="https://hive.apache.org/">Hive</a> & [...]
+<p>By carefully managing how data is laid out in storage & how it’s
exposed to queries, Hudi is able to power a rich data ecosystem where external
sources can be ingested in near real-time and made available for interactive
SQL Engines like <a href="https://prestodb.io">PrestoDB</a> & <a
href="https://spark.apache.org/sql/">Spark</a>, while at the same time capable
of being consumed incrementally from processing/ETL frameworks like <a
href="https://hive.apache.org/">Hive</a> & [...]
<p>Hudi broadly consists of a self contained Spark library to build tables and
integrations with existing query engines for data access. See <a
href="/docs/quick-start-guide">quickstart</a> for a demo.</p>