[incubator-hudi] branch asf-site updated: Include index performance numbers + updating site

vinoth Tue, 14 May 2019 05:11:09 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 5976398  Include index performance numbers + updating site
5976398 is described below

commit 597639811c73db0570628cff463e44070793b673
Author: Vinoth Chandar <[email protected]>
AuthorDate: Tue May 14 05:04:21 2019 -0700

    Include index performance numbers + updating site
---
 content/configurations.html | 33 +++++++++++++++++++--
 content/feed.xml            |  4 +--
 content/performance.html    | 71 ++++++++++----------------------------------
 content/writing_data.html   | 17 +++++++++--
 docs/performance.md         | 72 ++++++++++-----------------------------------
 5 files changed, 77 insertions(+), 120 deletions(-)

diff --git a/content/configurations.html b/content/configurations.html
index 76e8cb7..6f76321 100644
--- a/content/configurations.html
+++ b/content/configurations.html
@@ -593,21 +593,37 @@ HoodieWriteConfig can be built using a builder pattern as 
below.</p>
 <p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.use.caching</code> <br />
 <span style="color:grey">Only applies if index type is BLOOM. <br /> When 
true, the input RDD will cached to speed up index lookup by reducing IO for 
computing parallelism or affected partitions</span></p>
 
+<h5 id="bloomIndexTreebasedFilter">bloomIndexTreebasedFilter(useTreeFilter = 
true)</h5>
+<p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.use.treebased.filter</code> <br />
+<span style="color:grey">Only applies if index type is BLOOM. <br /> When 
true, interval tree based file pruning optimization is enabled. This mode 
speeds-up file-pruning based on key ranges when compared with the brute-force 
mode</span></p>
+
+<h5 
id="bloomIndexBucketizedChecking">bloomIndexBucketizedChecking(bucketizedChecking
 = true)</h5>
+<p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.bucketized.checking</code> <br />
+<span style="color:grey">Only applies if index type is BLOOM. <br /> When 
true, bucketized bloom filtering is enabled. This reduces skew seen in sort 
based bloom index lookup</span></p>
+
+<h5 id="bloomIndexKeysPerBucket">bloomIndexKeysPerBucket(keysPerBucket = 
10000000)</h5>
+<p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.keys.per.bucket</code> <br />
+<span style="color:grey">Only applies if bloomIndexBucketizedChecking is 
enabled and index type is bloom. <br /> This configuration controls the 
“bucket” size which tracks the number of record-key checks made against a 
single file and is the unit of work allocated to each partition performing 
bloom filter lookup. A higher value would amortize the fixed cost of reading a 
bloom filter to memory. </span></p>
+
 <h5 id="bloomIndexParallelism">bloomIndexParallelism(0)</h5>
 <p>Property: <code 
class="highlighter-rouge">hoodie.bloom.index.parallelism</code> <br />
 <span style="color:grey">Only applies if index type is BLOOM. <br /> This is 
the amount of parallelism for index lookup, which involves a Spark Shuffle. By 
default, this is auto computed based on input workload 
characteristics</span></p>
 
 <h5 id="hbaseZkQuorum">hbaseZkQuorum(zkString) [Required]</h5>
 <p>Property: <code 
class="highlighter-rouge">hoodie.index.hbase.zkquorum</code> <br />
-<span style="color:grey">Only application if index type is HBASE. HBase ZK 
Quorum url to connect to.</span></p>
+<span style="color:grey">Only applies if index type is HBASE. HBase ZK Quorum 
url to connect to.</span></p>
 
 <h5 id="hbaseZkPort">hbaseZkPort(port) [Required]</h5>
 <p>Property: <code class="highlighter-rouge">hoodie.index.hbase.zkport</code> 
<br />
-<span style="color:grey">Only application if index type is HBASE. HBase ZK 
Quorum port to connect to.</span></p>
+<span style="color:grey">Only applies if index type is HBASE. HBase ZK Quorum 
port to connect to.</span></p>
+
+<h5 id="hbaseTableName">hbaseZkZnodeParent(zkZnodeParent)  [Required]</h5>
+<p>Property: <code 
class="highlighter-rouge">hoodie.index.hbase.zknode.path</code> <br />
+<span style="color:grey">Only applies if index type is HBASE. This is the root 
znode that will contain all the znodes created/used by HBase.</span></p>
 
 <h5 id="hbaseTableName">hbaseTableName(tableName)  [Required]</h5>
 <p>Property: <code class="highlighter-rouge">hoodie.index.hbase.table</code> 
<br />
-<span style="color:grey">Only application if index type is HBASE. HBase Table 
name to use as the index. Hudi stores the row_key and [partition_path, fileID, 
commitTime] mapping in the table.</span></p>
+<span style="color:grey">Only applies if index type is HBASE. HBase Table name 
to use as the index. Hudi stores the row_key and [partition_path, fileID, 
commitTime] mapping in the table.</span></p>
 
 <h4 id="storage-configs">Storage configs</h4>
 <p>Controls aspects around sizing parquet and log files.</p>
@@ -646,6 +662,10 @@ HoodieWriteConfig can be built using a builder pattern as 
below.</p>
 <p>Property: <code 
class="highlighter-rouge">hoodie.logfile.to.parquet.compression.ratio</code> 
<br />
 <span style="color:grey">Expected additional compression as records move from 
log files to parquet. Used for merge_on_read storage to send inserts into log 
files &amp; control the size of compacted parquet file.</span></p>
 
+<h5 
id="parquetCompressionCodec">parquetCompressionCodec(parquetCompressionCodec = 
gzip)</h5>
+<p>Property: <code 
class="highlighter-rouge">hoodie.parquet.compression.codec</code> <br />
+<span style="color:grey">Compression Codec for parquet files </span></p>
+
 <h4 id="compaction-configs">Compaction configs</h4>
 <p>Configs that control compaction (merging of log files onto a new parquet 
base file), cleaning (reclamation of older/unused file groups).
 <a href="#withCompactionConfig">withCompactionConfig</a> 
(HoodieCompactionConfig) <br /></p>
@@ -662,6 +682,10 @@ HoodieWriteConfig can be built using a builder pattern as 
below.</p>
 <p>Property: <code class="highlighter-rouge">hoodie.keep.min.commits</code>, 
<code class="highlighter-rouge">hoodie.keep.max.commits</code> <br />
 <span style="color:grey">Each commit is a small file in the <code 
class="highlighter-rouge">.hoodie</code> directory. Since DFS typically does 
not favor lots of small files, Hudi archives older commits into a sequential 
log. A commit is published atomically by a rename of the commit file.</span></p>
 
+<h5 id="withCommitsArchivalBatchSize">withCommitsArchivalBatchSize(batch = 
10)</h5>
+<p>Property: <code 
class="highlighter-rouge">hoodie.commits.archival.batch</code> <br />
+<span style="color:grey">This controls the number of commit instants read in 
memory as a batch and archived together.</span></p>
+
 <h5 id="compactionSmallFileSize">compactionSmallFileSize(size = 0)</h5>
 <p>Property: <code 
class="highlighter-rouge">hoodie.parquet.small.file.limit</code> <br />
 <span style="color:grey">This should be less &lt; maxFileSize and setting it 
to 0, turns off this feature. Small files can always happen because of the 
number of insert records in a partition in a batch. Hudi has an option to 
auto-resolve small files by masking inserts into this partition as updates to 
existing small files. The size here is the minimum file size considered as a 
“small file size”.</span></p>
@@ -752,6 +776,9 @@ HoodieWriteConfig can be built using a builder pattern as 
below.</p>
 <p>Property: <code 
class="highlighter-rouge">hoodie.memory.compaction.fraction</code> <br />
 <span style="color:grey">HoodieCompactedLogScanner reads logblocks, converts 
records to HoodieRecords and then merges these log blocks and records. At any 
point, the number of entries in a log block can be less than or equal to the 
number of entries in the corresponding parquet file. This can lead to OOM in 
the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use 
this config to set the max allowable inMemory footprint of the spillable 
map.</span></p>
 
+<h5 
id="withWriteStatusFailureFraction">withWriteStatusFailureFraction(failureFraction
 = 0.1)</h5>
+<p>Property: <code 
class="highlighter-rouge">hoodie.memory.writestatus.failure.fraction</code> <br 
/>
+<span style="color:grey">This property controls what fraction of the failed 
record, exceptions we report back to driver</span></p>
 
 
     <div class="tags">
diff --git a/content/feed.xml b/content/feed.xml
index af8b991..6e0b82a 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -5,8 +5,8 @@
         <description>Apache Hudi (pronounced “Hoodie”) provides upserts and 
incremental processing capaibilities on Big Data</description>
         <link>http://0.0.0.0:4000/</link>
         <atom:link href="http://0.0.0.0:4000/feed.xml"; rel="self" 
type="application/rss+xml"/>
-        <pubDate>Mon, 06 May 2019 21:51:11 +0000</pubDate>
-        <lastBuildDate>Mon, 06 May 2019 21:51:11 +0000</lastBuildDate>
+        <pubDate>Tue, 14 May 2019 12:07:03 +0000</pubDate>
+        <lastBuildDate>Tue, 14 May 2019 12:07:03 +0000</lastBuildDate>
         <generator>Jekyll v3.3.1</generator>
         
         <item>
diff --git a/content/performance.html b/content/performance.html
index 0442986..9d4c20a 100644
--- a/content/performance.html
+++ b/content/performance.html
@@ -336,8 +336,8 @@ the conventional alternatives for achieving these tasks.</p>
 
 <h2 id="upserts">Upserts</h2>
 
-<p>Following shows the speed up obtained for NoSQL ingestion, 
-by switching from bulk loads off HBase to Parquet to incrementally upserting 
on a Hudi dataset, on 5 tables ranging from small to huge.</p>
+<p>Following shows the speed up obtained for NoSQL database ingestion, from 
incrementally upserting on a Hudi dataset on the copy-on-write storage,
+on 5 tables ranging from small to huge (as opposed to bulk loading the 
tables)</p>
 
 <figure><img class="docimage" src="images/hudi_upsert_perf1.png" 
alt="hudi_upsert_perf1.png" style="max-width: 1000px" /></figure>
 
@@ -346,64 +346,23 @@ significant savings on the overall compute cost.</p>
 
 <figure><img class="docimage" src="images/hudi_upsert_perf2.png" 
alt="hudi_upsert_perf2.png" style="max-width: 1000px" /></figure>
 
-<p>Hudi upserts have been stress tested upto 4TB in a single commit across the 
t1 table.</p>
+<p>Hudi upserts have been stress tested upto 4TB in a single commit across the 
t1 table. 
+See <a 
href="https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide";>here</a> 
for some tuning tips.</p>
 
-<h2 id="tuning">Tuning</h2>
+<h2 id="indexing">Indexing</h2>
 
-<p>Writing data via Hudi happens as a Spark job and thus general rules of 
spark debugging applies here too. Below is a list of things to keep in mind, if 
you are looking to improving performance or reliability.</p>
+<p>In order to efficiently upsert data, Hudi needs to classify records in a 
write batch into inserts &amp; updates (tagged with the file group 
+it belongs to). In order to speed this operation, Hudi employs a pluggable 
index mechanism that stores a mapping between recordKey and 
+the file group id it belongs to. By default, Hudi uses a built in index that 
uses file ranges and bloom filters to accomplish this, with
+upto 10x speed up over a spark join to do the same.</p>
 
-<p><strong>Input Parallelism</strong> : By default, Hudi tends to 
over-partition input (i.e <code 
class="highlighter-rouge">withParallelism(1500)</code>), to ensure each Spark 
partition stays within the 2GB limit for inputs upto 500GB. Bump this up 
accordingly if you have larger inputs. We recommend having shuffle parallelism 
<code 
class="highlighter-rouge">hoodie.[insert|upsert|bulkinsert].shuffle.parallelism</code>
 such that its atleast input_data_size/500MB</p>
+<p>Hudi provides best indexing performance when you model the recordKey to be 
monotonically increasing (e.g timestamp prefix), leading to range pruning 
filtering
+out a lot of files for comparison. Even for UUID based keys, there are <a 
href="https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/";>known 
techniques</a> to achieve this.
+For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a 
event table with 80B keys/3 partitions/11416 files/10TB data, Hudi index 
achieves a 
+<strong>~7X (2880 secs vs 440 secs) speed up</strong> over vanilla spark join. 
Even for a challenging workload like an ‘100% update’ database ingestion 
workload spanning 
+3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers 
a <strong>80-100% speedup</strong>.</p>
 
-<p><strong>Off-heap memory</strong> : Hudi writes parquet files and that needs 
good amount of off-heap memory proportional to schema width. Consider setting 
something like <code 
class="highlighter-rouge">spark.yarn.executor.memoryOverhead</code> or <code 
class="highlighter-rouge">spark.yarn.driver.memoryOverhead</code>, if you are 
running into such failures.</p>
-
-<p><strong>Spark Memory</strong> : Typically, hudi needs to be able to read a 
single file into memory to perform merges or compactions and thus the executor 
memory should be sufficient to accomodate this. In addition, Hoodie caches the 
input to be able to intelligently place data and thus leaving some <code 
class="highlighter-rouge">spark.storage.memoryFraction</code> will generally 
help boost performance.</p>
-
-<p><strong>Sizing files</strong> : Set <code 
class="highlighter-rouge">limitFileSize</code> above judiciously, to balance 
ingest/write latency vs number of files &amp; consequently metadata overhead 
associated with it.</p>
-
-<p><strong>Timeseries/Log data</strong> : Default configs are tuned for 
database/nosql changelogs where individual record sizes are large. Another very 
popular class of data is timeseries/event/log data that tends to be more 
volumnious with lot more records per partition. In such cases
-    - Consider tuning the bloom filter accuracy via <code 
class="highlighter-rouge">.bloomFilterFPP()/bloomFilterNumEntries()</code> to 
achieve your target index look up time
-    - Consider making a key that is prefixed with time of the event, which 
will enable range pruning &amp; significantly speeding up index lookup.</p>
-
-<p><strong>GC Tuning</strong> : Please be sure to follow garbage collection 
tuning tips from Spark tuning guide to avoid OutOfMemory errors
-[Must] Use G1/CMS Collector. Sample CMS Flags to add to 
spark.executor.extraJavaOptions :</p>
-
-<div class="highlighter-rouge"><pre class="highlight"><code>-XX:NewSize=1g 
-XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC 
-XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
-</code></pre>
-</div>
-
-<p>If it keeps OOMing still, reduce spark memory conservatively: <code 
class="highlighter-rouge">spark.memory.fraction=0.2, 
spark.memory.storageFraction=0.2</code> allowing it to spill rather than OOM. 
(reliably slow vs crashing intermittently)</p>
-
-<p>Below is a full working production config</p>
-
-<div class="highlighter-rouge"><pre class="highlight"><code> 
spark.driver.extraClassPath    /etc/hive/conf
- spark.driver.extraJavaOptions    -XX:+PrintTenuringDistribution 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
- spark.driver.maxResultSize    2g
- spark.driver.memory    4g
- spark.executor.cores    1
- spark.executor.extraJavaOptions    -XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
- spark.executor.id    driver
- spark.executor.instances    300
- spark.executor.memory    6g
- spark.rdd.compress true
-
- spark.kryoserializer.buffer.max    512m
- spark.serializer    org.apache.spark.serializer.KryoSerializer
- spark.shuffle.memoryFraction    0.2
- spark.shuffle.service.enabled    true
- spark.sql.hive.convertMetastoreParquet    false
- spark.storage.memoryFraction    0.6
- spark.submit.deployMode    cluster
- spark.task.cpus    1
- spark.task.maxFailures    4
-
- spark.yarn.driver.memoryOverhead    1024
- spark.yarn.executor.memoryOverhead    3072
- spark.yarn.max.executor.failures    100
-
-</code></pre>
-</div>
-
-<h2 id="read-optimized-query-performance">Read Optimized Query Performance</h2>
+<h2 id="read-optimized-queries">Read Optimized Queries</h2>
 
 <p>The major design goal for read optimized view is to achieve the latency 
reduction &amp; efficiency gains in previous section,
 with no impact on queries. Following charts compare the Hudi vs non-Hudi 
datasets across Hive/Presto/Spark queries and demonstrate this.</p>
diff --git a/content/writing_data.html b/content/writing_data.html
index f1fa4b0..061abb7 100644
--- a/content/writing_data.html
+++ b/content/writing_data.html
@@ -353,8 +353,22 @@ speeding up large Spark jobs via upserts using the <a 
href="#datasource-writer">
 <div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$ 
spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls 
hoodie-utilities/target/hoodie-utilities-*-SNAPSHOT.jar` --help
 Usage: &lt;main class&gt; [options]
   Options:
+    --commit-on-errors
+        Commit even when some records failed to be written
+      Default: false
+    --enable-hive-sync
+          Enable syncing to hive
+       Default: false
+    --filter-dupes
+          Should duplicate records from source be dropped/filtered outbefore 
+          insert/bulk-insert 
+      Default: false
     --help, -h
-
+    --hoodie-conf
+          Any configuration that can be set in the properties file (using the 
CLI 
+          parameter "--propsFilePath") can also be passed command line using 
this 
+          parameter 
+          Default: []
     --key-generator-class
       Subclass of com.uber.hoodie.KeyGenerator to generate a HoodieKey from
       the given avro record. Built in: SimpleKeyGenerator (uses provided field
@@ -411,7 +425,6 @@ Usage: &lt;main class&gt; [options]
       schema) before writing. Default : Not set. E:g -
       com.uber.hoodie.utilities.transform.SqlQueryBasedTransformer (which
       allows a SQL query template to be passed as a transformation function)
-
 </code></pre>
 </div>
 
diff --git a/docs/performance.md b/docs/performance.md
index a4c0f27..665f9dc 100644
--- a/docs/performance.md
+++ b/docs/performance.md
@@ -11,8 +11,8 @@ the conventional alternatives for achieving these tasks.
 
 ## Upserts
 
-Following shows the speed up obtained for NoSQL ingestion, 
-by switching from bulk loads off HBase to Parquet to incrementally upserting 
on a Hudi dataset, on 5 tables ranging from small to huge.
+Following shows the speed up obtained for NoSQL database ingestion, from 
incrementally upserting on a Hudi dataset on the copy-on-write storage,
+on 5 tables ranging from small to huge (as opposed to bulk loading the tables)
 
 {% include image.html file="hudi_upsert_perf1.png" alt="hudi_upsert_perf1.png" 
max-width="1000" %}
 
@@ -21,65 +21,23 @@ significant savings on the overall compute cost.
 
 {% include image.html file="hudi_upsert_perf2.png" alt="hudi_upsert_perf2.png" 
max-width="1000" %}
 
-Hudi upserts have been stress tested upto 4TB in a single commit across the t1 
table.
+Hudi upserts have been stress tested upto 4TB in a single commit across the t1 
table. 
+See [here](https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide) for 
some tuning tips.
 
-## Tuning
+## Indexing
 
-Writing data via Hudi happens as a Spark job and thus general rules of spark 
debugging applies here too. Below is a list of things to keep in mind, if you 
are looking to improving performance or reliability.
+In order to efficiently upsert data, Hudi needs to classify records in a write 
batch into inserts & updates (tagged with the file group 
+it belongs to). In order to speed this operation, Hudi employs a pluggable 
index mechanism that stores a mapping between recordKey and 
+the file group id it belongs to. By default, Hudi uses a built in index that 
uses file ranges and bloom filters to accomplish this, with
+upto 10x speed up over a spark join to do the same. 
 
-**Input Parallelism** : By default, Hudi tends to over-partition input (i.e 
`withParallelism(1500)`), to ensure each Spark partition stays within the 2GB 
limit for inputs upto 500GB. Bump this up accordingly if you have larger 
inputs. We recommend having shuffle parallelism 
`hoodie.[insert|upsert|bulkinsert].shuffle.parallelism` such that its atleast 
input_data_size/500MB
+Hudi provides best indexing performance when you model the recordKey to be 
monotonically increasing (e.g timestamp prefix), leading to range pruning 
filtering
+out a lot of files for comparison. Even for UUID based keys, there are [known 
techniques](https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/) 
to achieve this.
+For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a 
event table with 80B keys/3 partitions/11416 files/10TB data, Hudi index 
achieves a 
+**~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a 
challenging workload like an '100% update' database ingestion workload spanning 
+3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers 
a **80-100% speedup**.
 
-**Off-heap memory** : Hudi writes parquet files and that needs good amount of 
off-heap memory proportional to schema width. Consider setting something like 
`spark.yarn.executor.memoryOverhead` or `spark.yarn.driver.memoryOverhead`, if 
you are running into such failures.
-
-**Spark Memory** : Typically, hudi needs to be able to read a single file into 
memory to perform merges or compactions and thus the executor memory should be 
sufficient to accomodate this. In addition, Hoodie caches the input to be able 
to intelligently place data and thus leaving some 
`spark.storage.memoryFraction` will generally help boost performance.
-
-**Sizing files** : Set `limitFileSize` above judiciously, to balance 
ingest/write latency vs number of files & consequently metadata overhead 
associated with it.
-
-**Timeseries/Log data** : Default configs are tuned for database/nosql 
changelogs where individual record sizes are large. Another very popular class 
of data is timeseries/event/log data that tends to be more volumnious with lot 
more records per partition. In such cases
-    - Consider tuning the bloom filter accuracy via 
`.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look 
up time
-    - Consider making a key that is prefixed with time of the event, which 
will enable range pruning & significantly speeding up index lookup.
-
-**GC Tuning** : Please be sure to follow garbage collection tuning tips from 
Spark tuning guide to avoid OutOfMemory errors
-[Must] Use G1/CMS Collector. Sample CMS Flags to add to 
spark.executor.extraJavaOptions :
-
-```
--XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops 
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
-````
-
-If it keeps OOMing still, reduce spark memory conservatively: 
`spark.memory.fraction=0.2, spark.memory.storageFraction=0.2` allowing it to 
spill rather than OOM. (reliably slow vs crashing intermittently)
-
-Below is a full working production config
-
-```
- spark.driver.extraClassPath    /etc/hive/conf
- spark.driver.extraJavaOptions    -XX:+PrintTenuringDistribution 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
- spark.driver.maxResultSize    2g
- spark.driver.memory    4g
- spark.executor.cores    1
- spark.executor.extraJavaOptions    -XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
- spark.executor.id    driver
- spark.executor.instances    300
- spark.executor.memory    6g
- spark.rdd.compress true
-
- spark.kryoserializer.buffer.max    512m
- spark.serializer    org.apache.spark.serializer.KryoSerializer
- spark.shuffle.memoryFraction    0.2
- spark.shuffle.service.enabled    true
- spark.sql.hive.convertMetastoreParquet    false
- spark.storage.memoryFraction    0.6
- spark.submit.deployMode    cluster
- spark.task.cpus    1
- spark.task.maxFailures    4
-
- spark.yarn.driver.memoryOverhead    1024
- spark.yarn.executor.memoryOverhead    3072
- spark.yarn.max.executor.failures    100
-
-````
-
-
-## Read Optimized Query Performance
+## Read Optimized Queries
 
 The major design goal for read optimized view is to achieve the latency 
reduction & efficiency gains in previous section,
 with no impact on queries. Following charts compare the Hudi vs non-Hudi 
datasets across Hive/Presto/Spark queries and demonstrate this.

[incubator-hudi] branch asf-site updated: Include index performance numbers + updating site

Reply via email to