This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 5976398 Include index performance numbers + updating site
5976398 is described below
commit 597639811c73db0570628cff463e44070793b673
Author: Vinoth Chandar <[email protected]>
AuthorDate: Tue May 14 05:04:21 2019 -0700
Include index performance numbers + updating site
---
content/configurations.html | 33 +++++++++++++++++++--
content/feed.xml | 4 +--
content/performance.html | 71 ++++++++++----------------------------------
content/writing_data.html | 17 +++++++++--
docs/performance.md | 72 ++++++++++-----------------------------------
5 files changed, 77 insertions(+), 120 deletions(-)
diff --git a/content/configurations.html b/content/configurations.html
index 76e8cb7..6f76321 100644
--- a/content/configurations.html
+++ b/content/configurations.html
@@ -593,21 +593,37 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.use.caching</code> <br />
<span style="color:grey">Only applies if index type is BLOOM. <br /> When
true, the input RDD will cached to speed up index lookup by reducing IO for
computing parallelism or affected partitions</span></p>
+<h5 id="bloomIndexTreebasedFilter">bloomIndexTreebasedFilter(useTreeFilter =
true)</h5>
+<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.use.treebased.filter</code> <br />
+<span style="color:grey">Only applies if index type is BLOOM. <br /> When
true, interval tree based file pruning optimization is enabled. This mode
speeds-up file-pruning based on key ranges when compared with the brute-force
mode</span></p>
+
+<h5
id="bloomIndexBucketizedChecking">bloomIndexBucketizedChecking(bucketizedChecking
= true)</h5>
+<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.bucketized.checking</code> <br />
+<span style="color:grey">Only applies if index type is BLOOM. <br /> When
true, bucketized bloom filtering is enabled. This reduces skew seen in sort
based bloom index lookup</span></p>
+
+<h5 id="bloomIndexKeysPerBucket">bloomIndexKeysPerBucket(keysPerBucket =
10000000)</h5>
+<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.keys.per.bucket</code> <br />
+<span style="color:grey">Only applies if bloomIndexBucketizedChecking is
enabled and index type is bloom. <br /> This configuration controls the
“bucket” size which tracks the number of record-key checks made against a
single file and is the unit of work allocated to each partition performing
bloom filter lookup. A higher value would amortize the fixed cost of reading a
bloom filter to memory. </span></p>
+
<h5 id="bloomIndexParallelism">bloomIndexParallelism(0)</h5>
<p>Property: <code
class="highlighter-rouge">hoodie.bloom.index.parallelism</code> <br />
<span style="color:grey">Only applies if index type is BLOOM. <br /> This is
the amount of parallelism for index lookup, which involves a Spark Shuffle. By
default, this is auto computed based on input workload
characteristics</span></p>
<h5 id="hbaseZkQuorum">hbaseZkQuorum(zkString) [Required]</h5>
<p>Property: <code
class="highlighter-rouge">hoodie.index.hbase.zkquorum</code> <br />
-<span style="color:grey">Only application if index type is HBASE. HBase ZK
Quorum url to connect to.</span></p>
+<span style="color:grey">Only applies if index type is HBASE. HBase ZK Quorum
url to connect to.</span></p>
<h5 id="hbaseZkPort">hbaseZkPort(port) [Required]</h5>
<p>Property: <code class="highlighter-rouge">hoodie.index.hbase.zkport</code>
<br />
-<span style="color:grey">Only application if index type is HBASE. HBase ZK
Quorum port to connect to.</span></p>
+<span style="color:grey">Only applies if index type is HBASE. HBase ZK Quorum
port to connect to.</span></p>
+
+<h5 id="hbaseTableName">hbaseZkZnodeParent(zkZnodeParent) [Required]</h5>
+<p>Property: <code
class="highlighter-rouge">hoodie.index.hbase.zknode.path</code> <br />
+<span style="color:grey">Only applies if index type is HBASE. This is the root
znode that will contain all the znodes created/used by HBase.</span></p>
<h5 id="hbaseTableName">hbaseTableName(tableName) [Required]</h5>
<p>Property: <code class="highlighter-rouge">hoodie.index.hbase.table</code>
<br />
-<span style="color:grey">Only application if index type is HBASE. HBase Table
name to use as the index. Hudi stores the row_key and [partition_path, fileID,
commitTime] mapping in the table.</span></p>
+<span style="color:grey">Only applies if index type is HBASE. HBase Table name
to use as the index. Hudi stores the row_key and [partition_path, fileID,
commitTime] mapping in the table.</span></p>
<h4 id="storage-configs">Storage configs</h4>
<p>Controls aspects around sizing parquet and log files.</p>
@@ -646,6 +662,10 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<p>Property: <code
class="highlighter-rouge">hoodie.logfile.to.parquet.compression.ratio</code>
<br />
<span style="color:grey">Expected additional compression as records move from
log files to parquet. Used for merge_on_read storage to send inserts into log
files & control the size of compacted parquet file.</span></p>
+<h5
id="parquetCompressionCodec">parquetCompressionCodec(parquetCompressionCodec =
gzip)</h5>
+<p>Property: <code
class="highlighter-rouge">hoodie.parquet.compression.codec</code> <br />
+<span style="color:grey">Compression Codec for parquet files </span></p>
+
<h4 id="compaction-configs">Compaction configs</h4>
<p>Configs that control compaction (merging of log files onto a new parquet
base file), cleaning (reclamation of older/unused file groups).
<a href="#withCompactionConfig">withCompactionConfig</a>
(HoodieCompactionConfig) <br /></p>
@@ -662,6 +682,10 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<p>Property: <code class="highlighter-rouge">hoodie.keep.min.commits</code>,
<code class="highlighter-rouge">hoodie.keep.max.commits</code> <br />
<span style="color:grey">Each commit is a small file in the <code
class="highlighter-rouge">.hoodie</code> directory. Since DFS typically does
not favor lots of small files, Hudi archives older commits into a sequential
log. A commit is published atomically by a rename of the commit file.</span></p>
+<h5 id="withCommitsArchivalBatchSize">withCommitsArchivalBatchSize(batch =
10)</h5>
+<p>Property: <code
class="highlighter-rouge">hoodie.commits.archival.batch</code> <br />
+<span style="color:grey">This controls the number of commit instants read in
memory as a batch and archived together.</span></p>
+
<h5 id="compactionSmallFileSize">compactionSmallFileSize(size = 0)</h5>
<p>Property: <code
class="highlighter-rouge">hoodie.parquet.small.file.limit</code> <br />
<span style="color:grey">This should be less < maxFileSize and setting it
to 0, turns off this feature. Small files can always happen because of the
number of insert records in a partition in a batch. Hudi has an option to
auto-resolve small files by masking inserts into this partition as updates to
existing small files. The size here is the minimum file size considered as a
“small file size”.</span></p>
@@ -752,6 +776,9 @@ HoodieWriteConfig can be built using a builder pattern as
below.</p>
<p>Property: <code
class="highlighter-rouge">hoodie.memory.compaction.fraction</code> <br />
<span style="color:grey">HoodieCompactedLogScanner reads logblocks, converts
records to HoodieRecords and then merges these log blocks and records. At any
point, the number of entries in a log block can be less than or equal to the
number of entries in the corresponding parquet file. This can lead to OOM in
the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use
this config to set the max allowable inMemory footprint of the spillable
map.</span></p>
+<h5
id="withWriteStatusFailureFraction">withWriteStatusFailureFraction(failureFraction
= 0.1)</h5>
+<p>Property: <code
class="highlighter-rouge">hoodie.memory.writestatus.failure.fraction</code> <br
/>
+<span style="color:grey">This property controls what fraction of the failed
record, exceptions we report back to driver</span></p>
<div class="tags">
diff --git a/content/feed.xml b/content/feed.xml
index af8b991..6e0b82a 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -5,8 +5,8 @@
<description>Apache Hudi (pronounced “Hoodie”) provides upserts and
incremental processing capaibilities on Big Data</description>
<link>http://0.0.0.0:4000/</link>
<atom:link href="http://0.0.0.0:4000/feed.xml" rel="self"
type="application/rss+xml"/>
- <pubDate>Mon, 06 May 2019 21:51:11 +0000</pubDate>
- <lastBuildDate>Mon, 06 May 2019 21:51:11 +0000</lastBuildDate>
+ <pubDate>Tue, 14 May 2019 12:07:03 +0000</pubDate>
+ <lastBuildDate>Tue, 14 May 2019 12:07:03 +0000</lastBuildDate>
<generator>Jekyll v3.3.1</generator>
<item>
diff --git a/content/performance.html b/content/performance.html
index 0442986..9d4c20a 100644
--- a/content/performance.html
+++ b/content/performance.html
@@ -336,8 +336,8 @@ the conventional alternatives for achieving these tasks.</p>
<h2 id="upserts">Upserts</h2>
-<p>Following shows the speed up obtained for NoSQL ingestion,
-by switching from bulk loads off HBase to Parquet to incrementally upserting
on a Hudi dataset, on 5 tables ranging from small to huge.</p>
+<p>Following shows the speed up obtained for NoSQL database ingestion, from
incrementally upserting on a Hudi dataset on the copy-on-write storage,
+on 5 tables ranging from small to huge (as opposed to bulk loading the
tables)</p>
<figure><img class="docimage" src="images/hudi_upsert_perf1.png"
alt="hudi_upsert_perf1.png" style="max-width: 1000px" /></figure>
@@ -346,64 +346,23 @@ significant savings on the overall compute cost.</p>
<figure><img class="docimage" src="images/hudi_upsert_perf2.png"
alt="hudi_upsert_perf2.png" style="max-width: 1000px" /></figure>
-<p>Hudi upserts have been stress tested upto 4TB in a single commit across the
t1 table.</p>
+<p>Hudi upserts have been stress tested upto 4TB in a single commit across the
t1 table.
+See <a
href="https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide">here</a>
for some tuning tips.</p>
-<h2 id="tuning">Tuning</h2>
+<h2 id="indexing">Indexing</h2>
-<p>Writing data via Hudi happens as a Spark job and thus general rules of
spark debugging applies here too. Below is a list of things to keep in mind, if
you are looking to improving performance or reliability.</p>
+<p>In order to efficiently upsert data, Hudi needs to classify records in a
write batch into inserts & updates (tagged with the file group
+it belongs to). In order to speed this operation, Hudi employs a pluggable
index mechanism that stores a mapping between recordKey and
+the file group id it belongs to. By default, Hudi uses a built in index that
uses file ranges and bloom filters to accomplish this, with
+upto 10x speed up over a spark join to do the same.</p>
-<p><strong>Input Parallelism</strong> : By default, Hudi tends to
over-partition input (i.e <code
class="highlighter-rouge">withParallelism(1500)</code>), to ensure each Spark
partition stays within the 2GB limit for inputs upto 500GB. Bump this up
accordingly if you have larger inputs. We recommend having shuffle parallelism
<code
class="highlighter-rouge">hoodie.[insert|upsert|bulkinsert].shuffle.parallelism</code>
such that its atleast input_data_size/500MB</p>
+<p>Hudi provides best indexing performance when you model the recordKey to be
monotonically increasing (e.g timestamp prefix), leading to range pruning
filtering
+out a lot of files for comparison. Even for UUID based keys, there are <a
href="https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/">known
techniques</a> to achieve this.
+For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a
event table with 80B keys/3 partitions/11416 files/10TB data, Hudi index
achieves a
+<strong>~7X (2880 secs vs 440 secs) speed up</strong> over vanilla spark join.
Even for a challenging workload like an ‘100% update’ database ingestion
workload spanning
+3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers
a <strong>80-100% speedup</strong>.</p>
-<p><strong>Off-heap memory</strong> : Hudi writes parquet files and that needs
good amount of off-heap memory proportional to schema width. Consider setting
something like <code
class="highlighter-rouge">spark.yarn.executor.memoryOverhead</code> or <code
class="highlighter-rouge">spark.yarn.driver.memoryOverhead</code>, if you are
running into such failures.</p>
-
-<p><strong>Spark Memory</strong> : Typically, hudi needs to be able to read a
single file into memory to perform merges or compactions and thus the executor
memory should be sufficient to accomodate this. In addition, Hoodie caches the
input to be able to intelligently place data and thus leaving some <code
class="highlighter-rouge">spark.storage.memoryFraction</code> will generally
help boost performance.</p>
-
-<p><strong>Sizing files</strong> : Set <code
class="highlighter-rouge">limitFileSize</code> above judiciously, to balance
ingest/write latency vs number of files & consequently metadata overhead
associated with it.</p>
-
-<p><strong>Timeseries/Log data</strong> : Default configs are tuned for
database/nosql changelogs where individual record sizes are large. Another very
popular class of data is timeseries/event/log data that tends to be more
volumnious with lot more records per partition. In such cases
- - Consider tuning the bloom filter accuracy via <code
class="highlighter-rouge">.bloomFilterFPP()/bloomFilterNumEntries()</code> to
achieve your target index look up time
- - Consider making a key that is prefixed with time of the event, which
will enable range pruning & significantly speeding up index lookup.</p>
-
-<p><strong>GC Tuning</strong> : Please be sure to follow garbage collection
tuning tips from Spark tuning guide to avoid OutOfMemory errors
-[Must] Use G1/CMS Collector. Sample CMS Flags to add to
spark.executor.extraJavaOptions :</p>
-
-<div class="highlighter-rouge"><pre class="highlight"><code>-XX:NewSize=1g
-XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC
-XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
-</code></pre>
-</div>
-
-<p>If it keeps OOMing still, reduce spark memory conservatively: <code
class="highlighter-rouge">spark.memory.fraction=0.2,
spark.memory.storageFraction=0.2</code> allowing it to spill rather than OOM.
(reliably slow vs crashing intermittently)</p>
-
-<p>Below is a full working production config</p>
-
-<div class="highlighter-rouge"><pre class="highlight"><code>
spark.driver.extraClassPath /etc/hive/conf
- spark.driver.extraJavaOptions -XX:+PrintTenuringDistribution
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
- spark.driver.maxResultSize 2g
- spark.driver.memory 4g
- spark.executor.cores 1
- spark.executor.extraJavaOptions -XX:+PrintFlagsFinal -XX:+PrintReferenceGC
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
- spark.executor.id driver
- spark.executor.instances 300
- spark.executor.memory 6g
- spark.rdd.compress true
-
- spark.kryoserializer.buffer.max 512m
- spark.serializer org.apache.spark.serializer.KryoSerializer
- spark.shuffle.memoryFraction 0.2
- spark.shuffle.service.enabled true
- spark.sql.hive.convertMetastoreParquet false
- spark.storage.memoryFraction 0.6
- spark.submit.deployMode cluster
- spark.task.cpus 1
- spark.task.maxFailures 4
-
- spark.yarn.driver.memoryOverhead 1024
- spark.yarn.executor.memoryOverhead 3072
- spark.yarn.max.executor.failures 100
-
-</code></pre>
-</div>
-
-<h2 id="read-optimized-query-performance">Read Optimized Query Performance</h2>
+<h2 id="read-optimized-queries">Read Optimized Queries</h2>
<p>The major design goal for read optimized view is to achieve the latency
reduction & efficiency gains in previous section,
with no impact on queries. Following charts compare the Hudi vs non-Hudi
datasets across Hive/Presto/Spark queries and demonstrate this.</p>
diff --git a/content/writing_data.html b/content/writing_data.html
index f1fa4b0..061abb7 100644
--- a/content/writing_data.html
+++ b/content/writing_data.html
@@ -353,8 +353,22 @@ speeding up large Spark jobs via upserts using the <a
href="#datasource-writer">
<div class="highlighter-rouge"><pre class="highlight"><code>[hoodie]$
spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls
hoodie-utilities/target/hoodie-utilities-*-SNAPSHOT.jar` --help
Usage: <main class> [options]
Options:
+ --commit-on-errors
+ Commit even when some records failed to be written
+ Default: false
+ --enable-hive-sync
+ Enable syncing to hive
+ Default: false
+ --filter-dupes
+ Should duplicate records from source be dropped/filtered outbefore
+ insert/bulk-insert
+ Default: false
--help, -h
-
+ --hoodie-conf
+ Any configuration that can be set in the properties file (using the
CLI
+ parameter "--propsFilePath") can also be passed command line using
this
+ parameter
+ Default: []
--key-generator-class
Subclass of com.uber.hoodie.KeyGenerator to generate a HoodieKey from
the given avro record. Built in: SimpleKeyGenerator (uses provided field
@@ -411,7 +425,6 @@ Usage: <main class> [options]
schema) before writing. Default : Not set. E:g -
com.uber.hoodie.utilities.transform.SqlQueryBasedTransformer (which
allows a SQL query template to be passed as a transformation function)
-
</code></pre>
</div>
diff --git a/docs/performance.md b/docs/performance.md
index a4c0f27..665f9dc 100644
--- a/docs/performance.md
+++ b/docs/performance.md
@@ -11,8 +11,8 @@ the conventional alternatives for achieving these tasks.
## Upserts
-Following shows the speed up obtained for NoSQL ingestion,
-by switching from bulk loads off HBase to Parquet to incrementally upserting
on a Hudi dataset, on 5 tables ranging from small to huge.
+Following shows the speed up obtained for NoSQL database ingestion, from
incrementally upserting on a Hudi dataset on the copy-on-write storage,
+on 5 tables ranging from small to huge (as opposed to bulk loading the tables)
{% include image.html file="hudi_upsert_perf1.png" alt="hudi_upsert_perf1.png"
max-width="1000" %}
@@ -21,65 +21,23 @@ significant savings on the overall compute cost.
{% include image.html file="hudi_upsert_perf2.png" alt="hudi_upsert_perf2.png"
max-width="1000" %}
-Hudi upserts have been stress tested upto 4TB in a single commit across the t1
table.
+Hudi upserts have been stress tested upto 4TB in a single commit across the t1
table.
+See [here](https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide) for
some tuning tips.
-## Tuning
+## Indexing
-Writing data via Hudi happens as a Spark job and thus general rules of spark
debugging applies here too. Below is a list of things to keep in mind, if you
are looking to improving performance or reliability.
+In order to efficiently upsert data, Hudi needs to classify records in a write
batch into inserts & updates (tagged with the file group
+it belongs to). In order to speed this operation, Hudi employs a pluggable
index mechanism that stores a mapping between recordKey and
+the file group id it belongs to. By default, Hudi uses a built in index that
uses file ranges and bloom filters to accomplish this, with
+upto 10x speed up over a spark join to do the same.
-**Input Parallelism** : By default, Hudi tends to over-partition input (i.e
`withParallelism(1500)`), to ensure each Spark partition stays within the 2GB
limit for inputs upto 500GB. Bump this up accordingly if you have larger
inputs. We recommend having shuffle parallelism
`hoodie.[insert|upsert|bulkinsert].shuffle.parallelism` such that its atleast
input_data_size/500MB
+Hudi provides best indexing performance when you model the recordKey to be
monotonically increasing (e.g timestamp prefix), leading to range pruning
filtering
+out a lot of files for comparison. Even for UUID based keys, there are [known
techniques](https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/)
to achieve this.
+For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a
event table with 80B keys/3 partitions/11416 files/10TB data, Hudi index
achieves a
+**~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a
challenging workload like an '100% update' database ingestion workload spanning
+3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers
a **80-100% speedup**.
-**Off-heap memory** : Hudi writes parquet files and that needs good amount of
off-heap memory proportional to schema width. Consider setting something like
`spark.yarn.executor.memoryOverhead` or `spark.yarn.driver.memoryOverhead`, if
you are running into such failures.
-
-**Spark Memory** : Typically, hudi needs to be able to read a single file into
memory to perform merges or compactions and thus the executor memory should be
sufficient to accomodate this. In addition, Hoodie caches the input to be able
to intelligently place data and thus leaving some
`spark.storage.memoryFraction` will generally help boost performance.
-
-**Sizing files** : Set `limitFileSize` above judiciously, to balance
ingest/write latency vs number of files & consequently metadata overhead
associated with it.
-
-**Timeseries/Log data** : Default configs are tuned for database/nosql
changelogs where individual record sizes are large. Another very popular class
of data is timeseries/event/log data that tends to be more volumnious with lot
more records per partition. In such cases
- - Consider tuning the bloom filter accuracy via
`.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look
up time
- - Consider making a key that is prefixed with time of the event, which
will enable range pruning & significantly speeding up index lookup.
-
-**GC Tuning** : Please be sure to follow garbage collection tuning tips from
Spark tuning guide to avoid OutOfMemory errors
-[Must] Use G1/CMS Collector. Sample CMS Flags to add to
spark.executor.extraJavaOptions :
-
-```
--XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
-````
-
-If it keeps OOMing still, reduce spark memory conservatively:
`spark.memory.fraction=0.2, spark.memory.storageFraction=0.2` allowing it to
spill rather than OOM. (reliably slow vs crashing intermittently)
-
-Below is a full working production config
-
-```
- spark.driver.extraClassPath /etc/hive/conf
- spark.driver.extraJavaOptions -XX:+PrintTenuringDistribution
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
- spark.driver.maxResultSize 2g
- spark.driver.memory 4g
- spark.executor.cores 1
- spark.executor.extraJavaOptions -XX:+PrintFlagsFinal -XX:+PrintReferenceGC
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
- spark.executor.id driver
- spark.executor.instances 300
- spark.executor.memory 6g
- spark.rdd.compress true
-
- spark.kryoserializer.buffer.max 512m
- spark.serializer org.apache.spark.serializer.KryoSerializer
- spark.shuffle.memoryFraction 0.2
- spark.shuffle.service.enabled true
- spark.sql.hive.convertMetastoreParquet false
- spark.storage.memoryFraction 0.6
- spark.submit.deployMode cluster
- spark.task.cpus 1
- spark.task.maxFailures 4
-
- spark.yarn.driver.memoryOverhead 1024
- spark.yarn.executor.memoryOverhead 3072
- spark.yarn.max.executor.failures 100
-
-````
-
-
-## Read Optimized Query Performance
+## Read Optimized Queries
The major design goal for read optimized view is to achieve the latency
reduction & efficiency gains in previous section,
with no impact on queries. Following charts compare the Hudi vs non-Hudi
datasets across Hive/Presto/Spark queries and demonstrate this.