[hudi] branch asf-site updated: Travis CI build asf-site

vinoth Mon, 24 Aug 2020 12:12:44 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new e995bd2  Travis CI build asf-site
e995bd2 is described below

commit e995bd2819c7df640826bada28f9dd87c0110a62
Author: CI <[email protected]>
AuthorDate: Mon Aug 24 19:11:45 2020 +0000

    Travis CI build asf-site
---
 content/assets/js/lunr/lunr-store.js |  2 +-
 content/cn/docs/querying_data.html   | 12 ++---
 content/docs/comparison.html         |  4 +-
 content/docs/configurations.html     | 88 ++++++++++++++++++++++++++++++++----
 content/docs/deployment.html         |  6 +--
 content/docs/powered_by.html         |  4 ++
 content/docs/querying_data.html      | 49 ++++++++++++++++----
 content/docs/structure.html          |  2 +-
 8 files changed, 135 insertions(+), 32 deletions(-)

diff --git a/content/assets/js/lunr/lunr-store.js 
b/content/assets/js/lunr/lunr-store.js
index 5e4b619..c07019e 100644
--- a/content/assets/js/lunr/lunr-store.js
+++ b/content/assets/js/lunr/lunr-store.js
@@ -825,7 +825,7 @@ var store = [{
         "url": "https://hudi.apache.org/docs/writing_data.html";,
         "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
         "title": "查询 Hudi 数据集",
-        "excerpt":"从概念上讲，Hudi物理存储一次数据到DFS上，同时在其上提供三个逻辑视图，如之前所述。 数据集同步到Hive 
Metastore后，它将提供由Hudi的自定义输入格式支持的Hive外部表。一旦提供了适当的Hudi捆绑包， 
就可以通过Hive、Spark和Presto之类的常用查询引擎来查询数据集。 具体来说，在写入过程中传递了两个由table name命名的Hive表。 
例如，如果table name = hudi_tbl，我们得到 hudi_tbl 实现了由 HoodieParquetInputFormat 
支持的数据集的读优化视图，从而提供了纯列式数据。 hudi_tbl_rt 实现了由 HoodieParquetRealtimeInputFormat 
支持的数据集的实时视图，从而提供了基础数据和日志数据的合并视图。 如概念部分所述，增量处理所需要的 
一个关键原语是增量拉取（以从数据集中获取更改流/日志）。您可以增量提取Hudi数据集，这意味着自指定的即时时间起， 您可�
 �只获得全部更新和新行。 这与插入更新一起使用，对于构建某 [...]
+        "excerpt":"从概念上讲，Hudi物理存储一次数据到DFS上，同时在其上提供三个逻辑视图，如之前所述。 数据集同步到Hive 
Metastore后，它将提供由Hudi的自定义输入格式支持的Hive外部表。一旦提供了适当的Hudi捆绑包， 
就可以通过Hive、Spark和Presto之类的常用查询引擎来查询数据集。 具体来说，在写入过程中传递了两个由table name命名的Hive表。 
例如，如果table name = hudi_tbl，我们得到 hudi_tbl 实现了由 HoodieParquetInputFormat 
支持的数据集的读优化视图，从而提供了纯列式数据。 hudi_tbl_rt 实现了由 HoodieParquetRealtimeInputFormat 
支持的数据集的实时视图，从而提供了基础数据和日志数据的合并视图。 如概念部分所述，增量处理所需要的 
一个关键原语是增量拉取（以从数据集中获取更改流/日志）。您可以增量提取Hudi数据集，这意味着自指定的即时时间起， 您可�
 �只获得全部更新和新行。 这与插入更新一起使用，对于构建某 [...]
         "tags": [],
         "url": "https://hudi.apache.org/cn/docs/querying_data.html";,
         "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
diff --git a/content/cn/docs/querying_data.html 
b/content/cn/docs/querying_data.html
index e808682..3129f6d 100644
--- a/content/cn/docs/querying_data.html
+++ b/content/cn/docs/querying_data.html
@@ -375,7 +375,7 @@
       <li><a href="#spark-incr-pull">增量拉取</a></li>
     </ul>
   </li>
-  <li><a href="#presto">Presto</a></li>
+  <li><a href="#prestodb">PrestoDB</a></li>
   <li><a href="#impala-34-or-later">Impala (3.4 or later)</a>
     <ul>
       <li><a href="#读优化表-1">读优化表</a></li>
@@ -434,7 +434,7 @@
       <td>Y</td>
     </tr>
     <tr>
-      <td><strong>Presto</strong></td>
+      <td><strong>PrestoDB</strong></td>
       <td>Y</td>
       <td>N</td>
     </tr>
@@ -477,8 +477,8 @@
       <td>Y</td>
     </tr>
     <tr>
-      <td><strong>Presto</strong></td>
-      <td>N</td>
+      <td><strong>PrestoDB</strong></td>
+      <td>Y</td>
       <td>N</td>
       <td>Y</td>
     </tr>
@@ -703,9 +703,9 @@ Upsert实用程序（<code 
class="highlighter-rouge">HoodieDeltaStreamer</code>
   </tbody>
 </table>
 
-<h2 id="presto">Presto</h2>
+<h2 id="prestodb">PrestoDB</h2>
 
-<p>Presto是一种常用的查询引擎，可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。
+<p>PrestoDB是一种常用的查询引擎，可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。
 这需要在整个安装过程中将<code class="highlighter-rouge">hudi-presto-bundle</code> 
jar放入<code 
class="highlighter-rouge">&lt;presto_install&gt;/plugin/hive-hadoop2/</code>中。</p>
 
 <h2 id="impala-34-or-later">Impala (3.4 or later)</h2>
diff --git a/content/docs/comparison.html b/content/docs/comparison.html
index 3fd1343..2a50a60 100644
--- a/content/docs/comparison.html
+++ b/content/docs/comparison.html
@@ -392,7 +392,7 @@ we expect Hudi to positioned at something that ingests 
parquet with superior per
 Hive transactions does not offer the read-optimized storage option or the 
incremental pulling, that Hudi does. In terms of implementation choices, Hudi 
leverages
 the full power of a processing framework like Spark, while Hive transactions 
feature is implemented underneath by Hive tasks/queries kicked off by user or 
the Hive metastore.
 Based on our production experience, embedding Hudi as a library into existing 
Spark pipelines was much easier and less operationally heavy, compared with the 
other approach.
-Hudi is also designed to work with non-hive enginers like Presto/Spark and 
will incorporate file formats other than parquet over time.</p>
+Hudi is also designed to work with non-hive engines like PrestoDB/Spark and 
will incorporate file formats other than parquet over time.</p>
 
 <h2 id="hbase">HBase</h2>
 
@@ -410,7 +410,7 @@ integration of Hudi library with Spark/Spark streaming 
DAGs. In case of Non-Spar
 and later sent into a Hudi table via a Kafka topic/DFS intermediate file. In 
more conceptual level, data processing
 pipelines just consist of three components : <code 
class="highlighter-rouge">source</code>, <code 
class="highlighter-rouge">processing</code>, <code 
class="highlighter-rouge">sink</code>, with users ultimately running queries 
against the sink to use the results of the pipeline.
 Hudi can act as either a source or sink, that stores data on DFS. 
Applicability of Hudi to a given stream processing pipeline ultimately boils 
down to suitability
-of Presto/SparkSQL/Hive for your queries.</p>
+of PrestoDB/SparkSQL/Hive for your queries.</p>
 
 <p>More advanced use cases revolve around the concepts of <a 
href="https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop";>incremental
 processing</a>, which effectively
 uses Hudi even inside the <code class="highlighter-rouge">processing</code> 
engine to speed up typical batch pipelines. For e.g: Hudi can be used as a 
state store inside a processing DAG (similar
diff --git a/content/docs/configurations.html b/content/docs/configurations.html
index 5f5783e..c274063 100644
--- a/content/docs/configurations.html
+++ b/content/docs/configurations.html
@@ -507,6 +507,10 @@ This is useful to store checkpointing information, in a 
consistent way with the
 <p>Property: <code 
class="highlighter-rouge">hoodie.datasource.hive_sync.assume_date_partitioning</code>,
 Default: <code class="highlighter-rouge">false</code> <br />
   <span style="color:grey">Assume partitioning is yyyy/mm/dd</span></p>
 
+<h4 id="HIVE_USE_JDBC_OPT_KEY">HIVE_USE_JDBC_OPT_KEY</h4>
+<p>Property: <code 
class="highlighter-rouge">hoodie.datasource.hive_sync.use_jdbc</code>, Default: 
<code class="highlighter-rouge">true</code> <br />
+  <span style="color:grey">Use JDBC when hive synchronization is 
enabled</span></p>
+
 <h3 id="read-options">Read Options</h3>
 
 <p>Options useful for reading tables via <code 
class="highlighter-rouge">read.format.option(...)</code></p>
@@ -563,6 +567,18 @@ HoodieWriteConfig can be built using a builder pattern as 
below.</p>
 <p>Property: <code 
class="highlighter-rouge">hoodie.bulkinsert.shuffle.parallelism</code><br />
 <span style="color:grey">Bulk insert is meant to be used for large initial 
imports and this parallelism determines the initial number of files in your 
table. Tune this to achieve a desired optimal size during initial 
import.</span></p>
 
+<h4 
id="withUserDefinedBulkInsertPartitionerClass">withUserDefinedBulkInsertPartitionerClass(className
 = x.y.z.UserDefinedPatitionerClass)</h4>
+<p>Property: <code 
class="highlighter-rouge">hoodie.bulkinsert.user.defined.partitioner.class</code><br
 />
+<span style="color:grey">If specified, this class will be used to re-partition 
input records before they are inserted.</span></p>
+
+<h4 id="withBulkInsertSortMode">withBulkInsertSortMode(mode = 
BulkInsertSortMode.GLOBAL_SORT)</h4>
+<p>Property: <code 
class="highlighter-rouge">hoodie.bulkinsert.sort.mode</code><br />
+<span style="color:grey">Sorting modes to use for sorting records for bulk 
insert. This is leveraged when user defined partitioner is not configured. 
Default is GLOBAL_SORT. 
+   Available values are - <strong>GLOBAL_SORT</strong>:  this ensures best 
file sizes, with lowest memory overhead at cost of sorting. 
+  <strong>PARTITION_SORT</strong>: Strikes a balance by only sorting within a 
partition, still keeping the memory overhead of writing lowest and best effort 
file sizing. 
+  <strong>NONE</strong>: No sorting. Fastest and matches <code 
class="highlighter-rouge">spark.write.parquet()</code> in terms of number of 
files, overheads 
+</span></p>
+
 <h4 id="withParallelism">withParallelism(insert_shuffle_parallelism = 1500, 
upsert_shuffle_parallelism = 1500)</h4>
 <p>Property: <code 
class="highlighter-rouge">hoodie.insert.shuffle.parallelism</code>, <code 
class="highlighter-rouge">hoodie.upsert.shuffle.parallelism</code><br />
 <span style="color:grey">Once data has been initially imported, this 
parallelism controls initial parallelism for reading input records. Ensure this 
value is high enough say: 1 partition for 1 GB of input data</span></p>
@@ -587,10 +603,22 @@ HoodieWriteConfig can be built using a builder pattern as 
below.</p>
 <p>Property: <code 
class="highlighter-rouge">hoodie.consistency.check.enabled</code><br />
 <span style="color:grey">Should HoodieWriteClient perform additional checks to 
ensure written files’ are listable on the underlying filesystem/storage. Set 
this to true, to workaround S3’s eventual consistency model and ensure all data 
written as a part of a commit is faithfully available for queries. </span></p>
 
+<h4 id="withRollbackParallelism">withRollbackParallelism(rollbackParallelism = 
100)</h4>
+<p>Property: <code 
class="highlighter-rouge">hoodie.rollback.parallelism</code><br />
+<span style="color:grey">Determines the parallelism for rollback of 
commits.</span></p>
+
+<h4 
id="withRollbackUsingMarkers">withRollbackUsingMarkers(rollbackUsingMarkers = 
false)</h4>
+<p>Property: <code 
class="highlighter-rouge">hoodie.rollback.using.markers</code><br />
+<span style="color:grey">Enables a more efficient mechanism for rollbacks 
based on the marker files generated during the writes. Turned off by 
default.</span></p>
+
+<h4 id="withMarkersDeleteParallelism">withMarkersDeleteParallelism(parallelism 
= 100)</h4>
+<p>Property: <code 
class="highlighter-rouge">hoodie.markers.delete.parallelism</code><br />
+<span style="color:grey">Determines the parallelism for deleting marker 
files.</span></p>
+
 <h3 id="index-configs">Index configs</h3>
 <p>Following configs control indexing behavior, which tags incoming records as 
either inserts or updates to older records.</p>
 
-<p><a href="#withIndexConfig">withIndexConfig</a> (HoodieIndexConfig) <br />
+<p><a href="#index-configs">withIndexConfig</a> (HoodieIndexConfig) <br />
 <span style="color:grey">This is pluggable to have a external index (HBase) or 
use the default bloom filter stored in the Parquet files</span></p>
 
 <h4 id="withIndexClass">withIndexClass(indexClass = 
“x.y.z.UserDefinedIndex”)</h4>
@@ -599,7 +627,9 @@ HoodieWriteConfig can be built using a builder pattern as 
below.</p>
 
 <h4 id="withIndexType">withIndexType(indexType = BLOOM)</h4>
 <p>Property: <code class="highlighter-rouge">hoodie.index.type</code> <br />
-<span style="color:grey">Type of index to use. Default is Bloom filter. 
Possible options are [BLOOM | HBASE | INMEMORY]. Bloom filters removes the 
dependency on a external system and is stored in the footer of the Parquet Data 
Files</span></p>
+<span style="color:grey">Type of index to use. Default is Bloom filter. 
Possible options are [BLOOM | GLOBAL_BLOOM |SIMPLE | GLOBAL_SIMPLE | INMEMORY | 
HBASE]. Bloom filters removes the dependency on a external system and is stored 
in the footer of the Parquet Data Files</span></p>
+
+<h4 id="bloom-index-configs">Bloom Index configs</h4>
 
 <h4 id="bloomFilterNumEntries">bloomFilterNumEntries(numEntries = 60000)</h4>
 <p>Property: <code 
class="highlighter-rouge">hoodie.index.bloom.num_entries</code> <br />
@@ -609,6 +639,10 @@ HoodieWriteConfig can be built using a builder pattern as 
below.</p>
 <p>Property: <code class="highlighter-rouge">hoodie.index.bloom.fpp</code> <br 
/>
 <span style="color:grey">Only applies if index type is BLOOM. <br /> Error 
rate allowed given the number of entries. This is used to calculate how many 
bits should be assigned for the bloom filter and the number of hash functions. 
This is usually set very low (default: 0.000000001), we like to tradeoff disk 
space for lower false positives</span></p>
 
+<h4 id="bloomIndexParallelism">bloomIndexParallelism(0)</h4>
+<p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.parallelism</code> <br />
+<span style="color:grey">Only applies if index type is BLOOM. <br /> This is 
the amount of parallelism for index lookup, which involves a Spark Shuffle. By 
default, this is auto computed based on input workload 
characteristics</span></p>
+
 <h4 id="bloomIndexPruneByRanges">bloomIndexPruneByRanges(pruneRanges = 
true)</h4>
 <p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.prune.by.ranges</code> <br />
 <span style="color:grey">Only applies if index type is BLOOM. <br /> When 
true, range information from files to leveraged speed up index lookups. 
Particularly helpful, if the key has a monotonously increasing prefix, such as 
timestamp.</span></p>
@@ -625,13 +659,27 @@ HoodieWriteConfig can be built using a builder pattern as 
below.</p>
 <p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.bucketized.checking</code> <br />
 <span style="color:grey">Only applies if index type is BLOOM. <br /> When 
true, bucketized bloom filtering is enabled. This reduces skew seen in sort 
based bloom index lookup</span></p>
 
+<h4 id="bloomIndexFilterType">bloomIndexFilterType(bucketizedChecking = 
BloomFilterTypeCode.SIMPLE)</h4>
+<p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.filter.type</code> <br />
+<span style="color:grey">Filter type used. Default is 
BloomFilterTypeCode.SIMPLE. Available values are [BloomFilterTypeCode.SIMPLE , 
BloomFilterTypeCode.DYNAMIC_V0]. Dynamic bloom filters auto size themselves 
based on number of keys</span></p>
+
+<h4 
id="bloomIndexFilterDynamicMaxEntries">bloomIndexFilterDynamicMaxEntries(maxNumberOfEntries
 = 100000)</h4>
+<p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.filter.dynamic.max.entries</code> 
<br />
+<span style="color:grey">The threshold for the maximum number of keys to 
record in a dynamic Bloom filter row. Only applies if filter type is 
BloomFilterTypeCode.DYNAMIC_V0.</span></p>
+
 <h4 id="bloomIndexKeysPerBucket">bloomIndexKeysPerBucket(keysPerBucket = 
10000000)</h4>
 <p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.keys.per.bucket</code> <br />
 <span style="color:grey">Only applies if bloomIndexBucketizedChecking is 
enabled and index type is bloom. <br /> This configuration controls the 
“bucket” size which tracks the number of record-key checks made against a 
single file and is the unit of work allocated to each partition performing 
bloom filter lookup. A higher value would amortize the fixed cost of reading a 
bloom filter to memory. </span></p>
 
-<h4 id="bloomIndexParallelism">bloomIndexParallelism(0)</h4>
-<p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.parallelism</code> <br />
-<span style="color:grey">Only applies if index type is BLOOM. <br /> This is 
the amount of parallelism for index lookup, which involves a Spark Shuffle. By 
default, this is auto computed based on input workload 
characteristics</span></p>
+<h5 id="withBloomIndexInputStorageLevel">withBloomIndexInputStorageLevel(level 
= MEMORY_AND_DISK_SER)</h5>
+<p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.input.storage.level</code> <br />
+<span style="color:grey">Only applies when <a 
href="#bloomIndexUseCaching">#bloomIndexUseCaching</a> is set. Determine what 
level of persistence is used to cache input RDDs.<br /> Refer to 
org.apache.spark.storage.StorageLevel for different values</span></p>
+
+<h5 
id="bloomIndexUpdatePartitionPath">bloomIndexUpdatePartitionPath(updatePartitionPath
 = false)</h5>
+<p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.update.partition.path</code> <br />
+<span style="color:grey">Only applies if index type is GLOBAL_BLOOM. <br 
/>When set to true, an update including the partition path of a record that 
already exists will result in inserting the incoming record into the new 
partition and deleting the original record in the old partition. When set to 
false, the original record will only be updated in the old partition.</span></p>
+
+<h4 id="hbase-index-configs">HBase Index configs</h4>
 
 <h4 id="hbaseZkQuorum">hbaseZkQuorum(zkString) [Required]</h4>
 <p>Property: <code 
class="highlighter-rouge">hoodie.index.hbase.zkquorum</code> <br />
@@ -649,9 +697,23 @@ HoodieWriteConfig can be built using a builder pattern as 
below.</p>
 <p>Property: <code class="highlighter-rouge">hoodie.index.hbase.table</code> 
<br />
 <span style="color:grey">Only applies if index type is HBASE. HBase Table name 
to use as the index. Hudi stores the row_key and [partition_path, fileID, 
commitTime] mapping in the table.</span></p>
 
-<h5 
id="bloomIndexUpdatePartitionPath">bloomIndexUpdatePartitionPath(updatePartitionPath
 = false)</h5>
-<p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.update.partition.path</code> <br />
-<span style="color:grey">Only applies if index type is GLOBAL_BLOOM. <br 
/>When set to true, an update including the partition path of a record that 
already exists will result in inserting the incoming record into the new 
partition and deleting the original record in the old partition. When set to 
false, the original record will only be updated in the old partition.</span></p>
+<h4 id="simple-index-configs">Simple Index configs</h4>
+
+<h4 id="simpleIndexUseCaching">simpleIndexUseCaching(useCaching = true)</h4>
+<p>Property: <code 
class="highlighter-rouge">hoodie.simple.index.use.caching</code> <br />
+<span style="color:grey">Only applies if index type is SIMPLE. <br /> When 
true, the input RDD will cached to speed up index lookup by reducing IO for 
computing parallelism or affected partitions</span></p>
+
+<h5 
id="withSimpleIndexInputStorageLevel">withSimpleIndexInputStorageLevel(level = 
MEMORY_AND_DISK_SER)</h5>
+<p>Property: <code 
class="highlighter-rouge">hoodie.simple.index.input.storage.level</code> <br />
+<span style="color:grey">Only applies when <a 
href="#simpleIndexUseCaching">#simpleIndexUseCaching</a> is set. Determine what 
level of persistence is used to cache input RDDs.<br /> Refer to 
org.apache.spark.storage.StorageLevel for different values</span></p>
+
+<h4 id="withSimpleIndexParallelism">withSimpleIndexParallelism(parallelism = 
50)</h4>
+<p>Property: <code 
class="highlighter-rouge">hoodie.simple.index.parallelism</code> <br />
+<span style="color:grey">Only applies if index type is SIMPLE. <br /> This is 
the amount of parallelism for index lookup, which involves a Spark 
Shuffle.</span></p>
+
+<h4 
id="withGlobalSimpleIndexParallelism">withGlobalSimpleIndexParallelism(parallelism
 = 100)</h4>
+<p>Property: <code 
class="highlighter-rouge">hoodie.global.simple.index.parallelism</code> <br />
+<span style="color:grey">Only applies if index type is GLOBAL_SIMPLE. <br /> 
This is the amount of parallelism for index lookup, which involves a Spark 
Shuffle.</span></p>
 
 <h3 id="storage-configs">Storage configs</h3>
 <p>Controls aspects around sizing parquet and log files.</p>
@@ -706,6 +768,14 @@ HoodieWriteConfig can be built using a builder pattern as 
below.</p>
 <p>Property: <code 
class="highlighter-rouge">hoodie.cleaner.commits.retained</code> <br />
 <span style="color:grey">Number of commits to retain. So data will be retained 
for num_of_commits * time_between_commits (scheduled). This also directly 
translates into how much you can incrementally pull on this table</span></p>
 
+<h4 id="withAutoClean">withAutoClean(autoClean = true)</h4>
+<p>Property: <code class="highlighter-rouge">hoodie.clean.automatic</code> <br 
/>
+<span style="color:grey">Should cleanup if there is anything to cleanup 
immediately after the commit</span></p>
+
+<h4 id="withAsyncClean">withAsyncClean(asyncClean = false)</h4>
+<p>Property: <code class="highlighter-rouge">hoodie.clean.async</code> <br />
+<span style="color:grey">Only applies when <a 
href="#withAutoClean">#withAutoClean</a> is turned on. When turned on runs 
cleaner async with writing. </span></p>
+
 <h4 id="archiveCommitsWith">archiveCommitsWith(minCommits = 96, maxCommits = 
128)</h4>
 <p>Property: <code class="highlighter-rouge">hoodie.keep.min.commits</code>, 
<code class="highlighter-rouge">hoodie.keep.max.commits</code> <br />
 <span style="color:grey">Each commit is a small file in the <code 
class="highlighter-rouge">.hoodie</code> directory. Since DFS typically does 
not favor lots of small files, Hudi archives older commits into a sequential 
log. A commit is published atomically by a rename of the commit file.</span></p>
@@ -724,7 +794,7 @@ HoodieWriteConfig can be built using a builder pattern as 
below.</p>
 
 <h4 id="autoTuneInsertSplits">autoTuneInsertSplits(true)</h4>
 <p>Property: <code 
class="highlighter-rouge">hoodie.copyonwrite.insert.auto.split</code> <br />
-<span style="color:grey">Should hudi dynamically compute the insertSplitSize 
based on the last 24 commit’s metadata. Turned off by default. </span></p>
+<span style="color:grey">Should hudi dynamically compute the insertSplitSize 
based on the last 24 commit’s metadata. Turned on by default. </span></p>
 
 <h4 id="approxRecordSize">approxRecordSize(size = 1024)</h4>
 <p>Property: <code 
class="highlighter-rouge">hoodie.copyonwrite.record.size.estimate</code> <br />
diff --git a/content/docs/deployment.html b/content/docs/deployment.html
index 023675b..4db35f2 100644
--- a/content/docs/deployment.html
+++ b/content/docs/deployment.html
@@ -414,9 +414,9 @@ Specifically, we will cover the following aspects.</p>
 
 <p>All in all, Hudi deploys with no long running servers or additional 
infrastructure cost to your data lake. In fact, Hudi pioneered this model of 
building a transactional distributed storage layer
 using existing infrastructure and its heartening to see other systems adopting 
similar approaches as well. Hudi writing is done via Spark jobs (DeltaStreamer 
or custom Spark datasource jobs), deployed per standard Apache Spark <a 
href="https://spark.apache.org/docs/latest/cluster-overview.html";>recommendations</a>.
-Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or Presto and hence no additional infrastructure is necessary.</p>
+Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or PrestoDB and hence no additional infrastructure is necessary.</p>
 
-<p>A typical Hudi data ingestion can be achieved in 2 modes. In a singe run 
mode, Hudi ingestion reads next batch of data, ingest them to Hudi table and 
exits. In continuous mode, Hudi ingestion runs as a long-running service 
executing ingestion in a loop.</p>
+<p>A typical Hudi data ingestion can be achieved in 2 modes. In a single run 
mode, Hudi ingestion reads next batch of data, ingest them to Hudi table and 
exits. In continuous mode, Hudi ingestion runs as a long-running service 
executing ingestion in a loop.</p>
 
 <p>With Merge_On_Read Table, Hudi ingestion needs to also take care of 
compacting delta files. Again, compaction can be performed in an 
asynchronous-mode by letting compaction run concurrently with ingestion or in a 
serial fashion with one after another.</p>
 
@@ -893,7 +893,7 @@ consistent with the compaction plan</p>
 
 <h2 id="troubleshooting">Troubleshooting</h2>
 
-<p>Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)</p>
+<p>Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/PrestoDB/Spark)</p>
 
 <ul>
   <li><strong>_hoodie_record_key</strong> - Treated as a primary key within 
each DFS partition, basis of all updates/inserts</li>
diff --git a/content/docs/powered_by.html b/content/docs/powered_by.html
index dcc8b71..4e5de0e 100644
--- a/content/docs/powered_by.html
+++ b/content/docs/powered_by.html
@@ -477,6 +477,9 @@ December 2019, AWS re:Invent 2019, Las Vegas, NV, USA</p>
   <li>
     <p><a href="https://www.youtube.com/watch?v=N2eDfU_rQ_U";>“Apache Hudi - 
Design/Code Walkthrough Session for Contributors”</a> - By Vinoth Chandar, July 
2020, Hudi community.</p>
   </li>
+  <li>
+    <p><a href="https://youtu.be/nA3rwOdmm3A";>“PrestoDB and Apache Hudi”</a> - 
By Bhavani Sudha Saktheeswaran and Brandon Scheller, Aug 2020, PrestoDB 
Community Meetup.</p>
+  </li>
 </ol>
 
 <h2 id="articles">Articles</h2>
@@ -489,6 +492,7 @@ December 2019, AWS re:Invent 2019, Las Vegas, NV, USA</p>
   <li><a 
href="https://searchdatamanagement.techtarget.com/news/252484740/Apache-Hudi-grows-cloud-data-lake-maturity";>“Apache
 Hudi grows cloud data lake maturity”</a></li>
   <li><a href="https://eng.uber.com/apache-hudi-graduation/";>“Building a 
Large-scale Transactional Data Lake at Uber Using Apache Hudi”</a> - Uber eng 
blog by Nishith Agarwal</li>
   <li><a 
href="https://www.diva-portal.org/smash/get/diva2:1413103/FULLTEXT01.pdf";>“Hudi 
On Hops”</a> - By NETSANET GEBRETSADKAN KIDANE</li>
+  <li><a 
href="https://prestodb.io/blog/2020/08/04/prestodb-and-hudi";>“PrestoDB and 
Apachi Hudi</a> - PrestoDB - Hudi integration blog by Bhavani Sudha 
Saktheeswaran and Brandon Scheller</li>
 </ol>
 
 <h2 id="powered-by">Powered by</h2>
diff --git a/content/docs/querying_data.html b/content/docs/querying_data.html
index 3db236f..eca16b2 100644
--- a/content/docs/querying_data.html
+++ b/content/docs/querying_data.html
@@ -4,7 +4,7 @@
     <meta charset="utf-8">
 
 <!-- begin _includes/seo.html --><title>Querying Hudi Tables - Apache 
Hudi</title>
-<meta name="description" content="Conceptually, Hudi stores data physically 
once on DFS, while providing 3 different ways of querying, as explained before. 
Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi’s custom inputformats. Once the proper hudibundle has 
been installed, the table can be queried by popular query engines like Hive, 
Spark SQL, Spark Datasource API and Presto.">
+<meta name="description" content="Conceptually, Hudi stores data physically 
once on DFS, while providing 3 different ways of querying, as explained before. 
Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi’s custom inputformats. Once the proper hudibundle has 
been installed, the table can be queried by popular query engines like Hive, 
Spark SQL, Spark Datasource API and PrestoDB.">
 
 <meta property="og:type" content="article">
 <meta property="og:locale" content="en_US">
@@ -13,7 +13,7 @@
 <meta property="og:url" 
content="https://hudi.apache.org/docs/querying_data.html";>
 
 
-  <meta property="og:description" content="Conceptually, Hudi stores data 
physically once on DFS, while providing 3 different ways of querying, as 
explained before. Once the table is synced to the Hive metastore, it provides 
external Hive tables backed by Hudi’s custom inputformats. Once the proper 
hudibundle has been installed, the table can be queried by popular query 
engines like Hive, Spark SQL, Spark Datasource API and Presto.">
+  <meta property="og:description" content="Conceptually, Hudi stores data 
physically once on DFS, while providing 3 different ways of querying, as 
explained before. Once the table is synced to the Hive metastore, it provides 
external Hive tables backed by Hudi’s custom inputformats. Once the proper 
hudibundle has been installed, the table can be queried by popular query 
engines like Hive, Spark SQL, Spark Datasource API and PrestoDB.">
 
 
 
@@ -384,7 +384,7 @@
       <li><a href="#spark-incr-query">Incremental query</a></li>
     </ul>
   </li>
-  <li><a href="#presto">Presto</a></li>
+  <li><a href="#prestodb">PrestoDB</a></li>
   <li><a href="#impala-34-or-later">Impala (3.4 or later)</a>
     <ul>
       <li><a href="#snapshot-query">Snapshot Query</a></li>
@@ -396,7 +396,7 @@
         
         <p>Conceptually, Hudi stores data physically once on DFS, while 
providing 3 different ways of querying, as explained <a 
href="/docs/concepts.html#query-types">before</a>. 
 Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi’s custom inputformats. Once the proper hudi
-bundle has been installed, the table can be queried by popular query engines 
like Hive, Spark SQL, Spark Datasource API and Presto.</p>
+bundle has been installed, the table can be queried by popular query engines 
like Hive, Spark SQL, Spark Datasource API and PrestoDB.</p>
 
 <p>Specifically, following Hive tables are registered based off <a 
href="/docs/configurations.html#TABLE_NAME_OPT_KEY">table name</a> 
 and <a href="/docs/configurations.html#TABLE_TYPE_OPT_KEY">table type</a> 
configs passed during write.</p>
@@ -449,7 +449,7 @@ with special configurations that indicates to query 
planning that only increment
       <td>Y</td>
     </tr>
     <tr>
-      <td><strong>Presto</strong></td>
+      <td><strong>PrestoDB</strong></td>
       <td>Y</td>
       <td>N</td>
     </tr>
@@ -494,8 +494,8 @@ with special configurations that indicates to query 
planning that only increment
       <td>Y</td>
     </tr>
     <tr>
-      <td><strong>Presto</strong></td>
-      <td>N</td>
+      <td><strong>PrestoDB</strong></td>
+      <td>Y</td>
       <td>N</td>
       <td>Y</td>
     </tr>
@@ -706,10 +706,39 @@ Please refer to <a 
href="/docs/configurations.html#spark-datasource">configurati
   </tbody>
 </table>
 
-<h2 id="presto">Presto</h2>
+<h2 id="prestodb">PrestoDB</h2>
+
+<p>PrestoDB is a popular query engine, providing interactive query 
performance. PrestoDB currently supports snapshot querying on COPY_ON_WRITE 
tables. 
+Both snapshot and read optimized queries are supported on MERGE_ON_READ Hudi 
tables. Since PrestoDB-Hudi integration has evolved over time, the installation
+instructions for PrestoDB would vary based on versions. Please check the below 
table for query types supported and installation instructions 
+for different versions of PrestoDB.</p>
 
-<p>Presto is a popular query engine, providing interactive query performance. 
Presto currently supports snapshot queries on COPY_ON_WRITE and read optimized 
queries 
-on MERGE_ON_READ Hudi tables. This requires the <code 
class="highlighter-rouge">hudi-presto-bundle</code> jar to be placed into <code 
class="highlighter-rouge">&lt;presto_install&gt;/plugin/hive-hadoop2/</code>, 
across the installation.</p>
+<table>
+  <thead>
+    <tr>
+      <th><strong>PrestoDB Version</strong></th>
+      <th><strong>Installation description</strong></th>
+      <th><strong>Query types supported</strong></th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>&lt; 0.233</td>
+      <td>Requires the <code 
class="highlighter-rouge">hudi-presto-bundle</code> jar to be placed into <code 
class="highlighter-rouge">&lt;presto_install&gt;/plugin/hive-hadoop2/</code>, 
across the installation.</td>
+      <td>Snapshot querying on COW tables. Read optimized querying on MOR 
tables.</td>
+    </tr>
+    <tr>
+      <td>&gt;= 0.233</td>
+      <td>No action needed. Hudi (0.5.1-incubating) is a compile time 
dependency.</td>
+      <td>Snapshot querying on COW tables. Read optimized querying on MOR 
tables.</td>
+    </tr>
+    <tr>
+      <td>&gt;= 0.240</td>
+      <td>No action needed. Hudi 0.5.3 version is a compile time 
dependency.</td>
+      <td>Snapshot querying on both COW and MOR tables</td>
+    </tr>
+  </tbody>
+</table>
 
 <h2 id="impala-34-or-later">Impala (3.4 or later)</h2>
 
diff --git a/content/docs/structure.html b/content/docs/structure.html
index e68514f..9495489 100644
--- a/content/docs/structure.html
+++ b/content/docs/structure.html
@@ -380,7 +380,7 @@
     <img class="docimage" src="/assets/images/hudi_intro_1.png" 
alt="hudi_intro_1.png" />
 </figure>
 
-<p>By carefully managing how data is laid out in storage &amp; how it’s 
exposed to queries, Hudi is able to power a rich data ecosystem where external 
sources can be ingested in near real-time and made available for interactive 
SQL Engines like <a href="https://prestodb.io";>Presto</a> &amp; <a 
href="https://spark.apache.org/sql/";>Spark</a>, while at the same time capable 
of being consumed incrementally from processing/ETL frameworks like <a 
href="https://hive.apache.org/";>Hive</a> &amp;  [...]
+<p>By carefully managing how data is laid out in storage &amp; how it’s 
exposed to queries, Hudi is able to power a rich data ecosystem where external 
sources can be ingested in near real-time and made available for interactive 
SQL Engines like <a href="https://prestodb.io";>PrestoDB</a> &amp; <a 
href="https://spark.apache.org/sql/";>Spark</a>, while at the same time capable 
of being consumed incrementally from processing/ETL frameworks like <a 
href="https://hive.apache.org/";>Hive</a> &amp [...]
 
 <p>Hudi broadly consists of a self contained Spark library to build tables and 
integrations with existing query engines for data access. See <a 
href="/docs/quick-start-guide">quickstart</a> for a demo.</p>

[hudi] branch asf-site updated: Travis CI build asf-site

Reply via email to