[incubator-hudi-site] 18/19: Simplifying site navigation

vinoth Wed, 13 Mar 2019 15:41:26 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi-site.git


commit aa361cc190f23805b5b4b6cd69a6a49e62c45b81
Author: Vinoth Chandar <[email protected]>
AuthorDate: Sat Mar 9 23:02:18 2019 -0800

    Simplifying site navigation
    
     - Removed overview and migration guide from sidebar
     - Migration guide linked off quickstart
     - Moved detailed implementation notes to cwiki
     - Merged performance, tuning into a new "performance" page
     - Rename "incremental processing" to simply "writing data"
     - Rename "SQL Queries" to simply "Querying data"
     - Internal linking for Admin guide page
---
 docs/_data/sidebars/mydoc_sidebar.yml              |  35 ++-
 docs/admin_guide.md                                |  18 +-
 docs/configurations.md                             |  54 ----
 docs/implementation.md                             | 278 ---------------------
 docs/performance.md                                |  94 +++++++
 docs/{sql_queries.md => querying_data.md}          |   4 +-
 docs/quickstart.md                                 |  12 +-
 .../{incremental_processing.md => writing_data.md} |   4 +-
 8 files changed, 127 insertions(+), 372 deletions(-)

diff --git a/docs/_data/sidebars/mydoc_sidebar.yml 
b/docs/_data/sidebars/mydoc_sidebar.yml
index 118bd89..8cf59af 100644
--- a/docs/_data/sidebars/mydoc_sidebar.yml
+++ b/docs/_data/sidebars/mydoc_sidebar.yml
@@ -10,11 +10,6 @@ entries:
     output: web
     folderitems:
 
-    - title: Overview
-      url: /index.html
-      output: web
-      type: homepage
-
     - title: Quickstart
       url: /quickstart.html
       output: web
@@ -27,6 +22,10 @@ entries:
       url: /powered_by.html
       output: web
 
+    - title: Comparison
+      url: /comparison.html
+      output: web
+
   - title: Documentation
     output: web
     folderitems:
@@ -35,30 +34,24 @@ entries:
       url: /concepts.html
       output: web
 
-    - title: Implementation
-      url: /implementation.html
-      output: web
-
-    - title: Configurations
-      url: /configurations.html
+    - title: Writing Data
+      url: /writing_data.html
       output: web
 
-    - title: SQL Queries
-      url: /sql_queries.html
+    - title: Querying Data
+      url: /querying_data.html
       output: web
 
-    - title: Migration Guide
-      url: /migration_guide.html
+    - title: Configuration
+      url: /configurations.html
       output: web
 
-    - title: Incremental Processing
-      url: /incremental_processing.html
+    - title: Performance
+      url: /performance.html
       output: web
 
-    - title: Admin Guide
+    - title: Administering
       url: /admin_guide.html
       output: web
 
-    - title: Comparison
-      url: /comparison.html
-      output: web
+
diff --git a/docs/admin_guide.md b/docs/admin_guide.md
index 58dbd15..3fbf1a2 100644
--- a/docs/admin_guide.md
+++ b/docs/admin_guide.md
@@ -1,5 +1,5 @@
 ---
-title: Admin Guide
+title: Administering Hudi Pipelines
 keywords: hudi, administration, operation, devops
 sidebar: mydoc_sidebar
 permalink: admin_guide.html
@@ -9,13 +9,13 @@ summary: This section offers an overview of tools available 
to operate an ecosys
 
 Admins/ops can gain visibility into Hudi datasets/pipelines in the following 
ways
 
- - Administering via the Admin CLI
- - Graphite metrics
- - Spark UI of the Hudi Application
+ - [Administering via the Admin CLI](#admin-cli)
+ - [Graphite metrics](#metrics)
+ - [Spark UI of the Hudi Application](#spark-ui)
 
-This section provides a glimpse into each of these, with some general guidance 
on troubleshooting
+This section provides a glimpse into each of these, with some general guidance 
on [troubleshooting](#troubleshooting)
 
-## Admin CLI
+## Admin CLI {#admin-cli}
 
 Once hudi has been built, the shell can be fired by via  `cd hoodie-cli && 
./hoodie-cli.sh`.
 A hudi dataset resides on DFS, in a location referred to as the **basePath** 
and we would need this location in order to connect to a Hudi dataset.
@@ -385,7 +385,7 @@ Compaction successfully repaired
 ```
 
 
-## Metrics
+## Metrics {#metrics}
 
 Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
 
@@ -400,7 +400,7 @@ These metrics can then be plotted on a standard tool like 
grafana. Below is a sa
 {% include image.html file="hudi_commit_duration.png" 
alt="hudi_commit_duration.png" max-width="1000" %}
 
 
-## Troubleshooting Failures
+## Troubleshooting Failures {#troubleshooting}
 
 Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
 
@@ -425,7 +425,7 @@ First of all, please confirm if you do indeed have 
duplicates **AFTER** ensuring
  - If duplicates span files across partitionpath, then this means your 
application is generating different partitionPaths for same recordKey, Please 
fix your app
  - if duplicates span multiple files within the same partitionpath, please 
engage with mailing list. This should not happen. You can use the `records 
deduplicate` command to fix your data.
 
-#### Spark failures
+#### Spark failures {#spark-ui}
 
 Typical upsert() DAG looks like below. Note that Hudi client also caches 
intermediate RDDs to intelligently profile workload and size files and spark 
parallelism.
 Also Spark UI shows sortByKey twice due to the probe job also being shown, 
nonetheless its just a single sort.
diff --git a/docs/configurations.md b/docs/configurations.md
index dd2fa8a..d86d0ce 100644
--- a/docs/configurations.md
+++ b/docs/configurations.md
@@ -331,57 +331,3 @@ Controls memory usage for compaction and merges, performed 
internally by Hudi
     <span style="color:grey">HoodieCompactedLogScanner reads logblocks, 
converts records to HoodieRecords and then merges these log blocks and records. 
At any point, the number of entries in a log block can be less than or equal to 
the number of entries in the corresponding parquet file. This can lead to OOM 
in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use 
this config to set the max allowable inMemory footprint of the spillable 
map.</span>
 
 
-### Tuning
-
-Writing data via Hudi happens as a Spark job and thus general rules of spark 
debugging applies here too. Below is a list of things to keep in mind, if you 
are looking to improving performance or reliability.
-
-**Input Parallelism** : By default, Hudi tends to over-partition input (i.e 
`withParallelism(1500)`), to ensure each Spark partition stays within the 2GB 
limit for inputs upto 500GB. Bump this up accordingly if you have larger 
inputs. We recommend having shuffle parallelism 
`hoodie.[insert|upsert|bulkinsert].shuffle.parallelism` such that its atleast 
input_data_size/500MB
-
-**Off-heap memory** : Hudi writes parquet files and that needs good amount of 
off-heap memory proportional to schema width. Consider setting something like 
`spark.yarn.executor.memoryOverhead` or `spark.yarn.driver.memoryOverhead`, if 
you are running into such failures.
-
-**Spark Memory** : Typically, hudi needs to be able to read a single file into 
memory to perform merges or compactions and thus the executor memory should be 
sufficient to accomodate this. In addition, Hoodie caches the input to be able 
to intelligently place data and thus leaving some 
`spark.storage.memoryFraction` will generally help boost performance.
-
-**Sizing files** : Set `limitFileSize` above judiciously, to balance 
ingest/write latency vs number of files & consequently metadata overhead 
associated with it.
-
-**Timeseries/Log data** : Default configs are tuned for database/nosql 
changelogs where individual record sizes are large. Another very popular class 
of data is timeseries/event/log data that tends to be more volumnious with lot 
more records per partition. In such cases
-    - Consider tuning the bloom filter accuracy via 
`.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look 
up time
-    - Consider making a key that is prefixed with time of the event, which 
will enable range pruning & significantly speeding up index lookup.
-
-**GC Tuning** : Please be sure to follow garbage collection tuning tips from 
Spark tuning guide to avoid OutOfMemory errors
-[Must] Use G1/CMS Collector. Sample CMS Flags to add to 
spark.executor.extraJavaOptions :
-
-```
--XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops 
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
-````
-
-If it keeps OOMing still, reduce spark memory conservatively: 
`spark.memory.fraction=0.2, spark.memory.storageFraction=0.2` allowing it to 
spill rather than OOM. (reliably slow vs crashing intermittently)
-
-Below is a full working production config
-
-```
- spark.driver.extraClassPath    /etc/hive/conf
- spark.driver.extraJavaOptions    -XX:+PrintTenuringDistribution 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
- spark.driver.maxResultSize    2g
- spark.driver.memory    4g
- spark.executor.cores    1
- spark.executor.extraJavaOptions    -XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
- spark.executor.id    driver
- spark.executor.instances    300
- spark.executor.memory    6g
- spark.rdd.compress true
-
- spark.kryoserializer.buffer.max    512m
- spark.serializer    org.apache.spark.serializer.KryoSerializer
- spark.shuffle.memoryFraction    0.2
- spark.shuffle.service.enabled    true
- spark.sql.hive.convertMetastoreParquet    false
- spark.storage.memoryFraction    0.6
- spark.submit.deployMode    cluster
- spark.task.cpus    1
- spark.task.maxFailures    4
-
- spark.yarn.driver.memoryOverhead    1024
- spark.yarn.executor.memoryOverhead    3072
- spark.yarn.max.executor.failures    100
-
-````
diff --git a/docs/implementation.md b/docs/implementation.md
deleted file mode 100644
index 54966e2..0000000
--- a/docs/implementation.md
+++ /dev/null
@@ -1,278 +0,0 @@
----
-title: Implementation
-keywords: hudi, index, storage, compaction, cleaning, implementation
-sidebar: mydoc_sidebar
-toc: false
-permalink: implementation.html
----
-
-Hudi (pronounced “Hoodie”) is implemented as a Spark library, which makes it 
easy to integrate into existing data pipelines or ingestion
-libraries (which we will refer to as `Hudi clients`). Hudi Clients prepare an 
`RDD[HoodieRecord]` that contains the data to be upserted and
-Hudi upsert/insert is merely a Spark DAG, that can be broken into two big 
pieces.
-
- - **Indexing** :  A big part of Hudi's efficiency comes from indexing the 
mapping from record keys to the file ids, to which they belong to.
- This index also helps the `HoodieWriteClient` separate upserted records into 
inserts and updates, so they can be treated differently.
- `HoodieReadClient` supports operations such as `filterExists` (used for 
de-duplication of table) and an efficient batch `read(keys)` api, that
- can read out the records corresponding to the keys using the index much 
quickly, than a typical scan via a query. The index is also atomically
- updated each commit, and is also rolled back when commits are rolled back.
-
- - **Storage** : The storage part of the DAG is responsible for taking an 
`RDD[HoodieRecord]`, that has been tagged as
- an insert or update via index lookup, and writing it out efficiently onto 
storage.
-
-## Index
-
-Hudi currently provides two choices for indexes : `BloomIndex` and 
`HBaseIndex` to map a record key into the file id to which it belongs to. This 
enables
-us to speed up upserts significantly, without scanning over every record in 
the dataset. Hudi Indices can be classified based on
-their ability to lookup records across partition. A `global` index does not 
need partition information for finding the file-id for a record key
-but a `non-global` does.
-
-#### HBase Index (global)
-
-Here, we just use HBase in a straightforward way to store the mapping above. 
The challenge with using HBase (or any external key-value store
- for that matter) is performing rollback of a commit and handling partial 
index updates.
- Since the HBase table is indexed by record key and not commit Time, we would 
have to scan all the entries which will be prohibitively expensive.
- Insteead, we store the commit time with the value and discard its value if it 
does not belong to a valid commit.
-
-#### Bloom Index (non-global)
-
-This index is built by adding bloom filters with a very high false positive 
tolerance (e.g: 1/10^9), to the parquet file footers.
-The advantage of this index over HBase is the obvious removal of a big 
external dependency, and also nicer handling of rollbacks & partial updates
-since the index is part of the data file itself.
-
-At runtime, checking the Bloom Index for a given set of record keys 
effectively amounts to checking all the bloom filters within a given
-partition, against the incoming records, using a Spark join. Much of the 
engineering effort towards the Bloom index has gone into scaling this join
-by caching the incoming RDD[HoodieRecord] and dynamically tuning join 
parallelism, to avoid hitting Spark limitations like 2GB maximum
-for partition size. As a result, Bloom Index implementation has been able to 
handle single upserts upto 5TB, in a reliable manner.
-
-
-## Storage
-
-The implementation specifics of the two storage types, introduced in 
[concepts](concepts.html) section, are detailed below.
-
-
-#### Copy On Write
-
-The Spark DAG for this storage, is relatively simpler. The key goal here is to 
group the tagged Hudi record RDD, into a series of
-updates and inserts, by using a partitioner. To achieve the goals of 
maintaining file sizes, we first sample the input to obtain a `workload profile`
-that understands the spread of inserts vs updates, their distribution among 
the partitions etc. With this information, we bin-pack the
-records such that
-
- - For updates, the latest version of the that file id, is rewritten once, 
with new values for all records that have changed
- - For inserts, the records are first packed onto the smallest file in each 
partition path, until it reaches the configured maximum size.
-   Any remaining records after that, are again packed into new file id groups, 
again meeting the size requirements.
-
-In this storage, index updation is a no-op, since the bloom filters are 
already written as a part of committing data.
-
-In the case of Copy-On-Write, a single parquet file constitutes one `file 
slice` which contains one complete version of
-the file
-
-{% include image.html file="hudi_log_format_v2.png" 
alt="hudi_log_format_v2.png" max-width="1000" %}
-
-#### Merge On Read
-
-In the Merge-On-Read storage model, there are 2 logical components - one for 
ingesting data (both inserts/updates) into the dataset
- and another for creating compacted views. The former is hereby referred to as 
`Writer` while the later
- is referred as `Compactor`.
-
-##### Merge On Read Writer
-
- At a high level, Merge-On-Read Writer goes through same stages as 
Copy-On-Write writer in ingesting data.
- The key difference here is that updates are appended to latest log (delta) 
file belonging to the latest file slice
- without merging. For inserts, Hudi supports 2 modes:
-
-   1. Inserts to Log Files - This is done for datasets that have an indexable 
log files (for eg global index)
-   2. Inserts to parquet files - This is done for datasets that do not have 
indexable log files, for eg bloom index
-      embedded in parquer files. Hudi treats writing new records in the same 
way as inserting to Copy-On-Write files.
-
-As in the case of Copy-On-Write, the input tagged records are partitioned such 
that all upserts destined to
-a `file id` are grouped together. This upsert-batch is written as one or more 
log-blocks written to log-files.
-Hudi allows clients to control log file sizes (See [Storage 
Configs](../configurations))
-
-The WriteClient API is same for both Copy-On-Write and Merge-On-Read writers.
-
-With Merge-On-Read, several rounds of data-writes would have resulted in 
accumulation of one or more log-files.
-All these log-files along with base-parquet (if exists) constitute a `file 
slice` which represents one complete version
-of the file.
-
-#### Compactor
-
-Realtime Readers will perform in-situ merge of these delta log-files to 
provide the most recent (committed) view of
-the dataset. To keep the query-performance in check and eventually achieve 
read-optimized performance, Hudi supports
-compacting these log-files asynchronously to create read-optimized views.
-
-Asynchronous Compaction involves 2 steps:
-
-  * `Compaction Schedule` : Hudi Write Client exposes API to create Compaction 
plans which contains the list of `file slice`
-    to be compacted atomically in a single compaction commit. Hudi allows 
pluggable strategies for choosing
-    file slices for each compaction runs. This step is typically done inline 
by Writer process as Hudi expects
-    only one schedule is being generated at a time which allows Hudi to 
enforce the constraint that pending compaction
-    plans do not step on each other file-slices. This constraint allows for 
multiple concurrent `Compactors` to run at
-    the same time. Some of the common strategies used for choosing `file 
slice` for compaction are:
-    * BoundedIO - Limit the number of file slices chosen for a compaction plan 
by expected total IO (read + write)
-    needed to complete compaction run
-    * Log File Size - Prefer file-slices with larger amounts of delta log data 
to be merged
-    * Day Based - Prefer file slice belonging to latest day partitions
-
-  * `Compactor` : Hudi provides a separate API in Write Client to execute a 
compaction plan. The compaction
-    plan (just like a commit) is identified by a timestamp. Most of the design 
and implementation complexities for Async
-    Compaction is for guaranteeing snapshot isolation to readers and writer 
when
-    multiple concurrent compactors are running. Typical compactor deployment 
involves launching a separate
-    spark application which executes pending compactions when they become 
available. The core logic of compacting
-    file slices in the Compactor is very similar to that of merging updates in 
a Copy-On-Write table. The only
-    difference being in the case of compaction, there is an additional step of 
merging the records in delta log-files.
-
-Here are the main API to lookup and execute a compaction plan.
-
-```
-   Main API in HoodieWriteClient for running Compaction:
-   /**
-    * Performs Compaction corresponding to instant-time
-    * @param compactionInstantTime   Compaction Instant Time
-    * @return
-    * @throws IOException
-    */
-  public JavaRDD<WriteStatus> compact(String compactionInstantTime) throws 
IOException;
-
-  To lookup all pending compactions, use the API defined in HoodieReadClient
-
-  /**
-   * Return all pending compactions with instant time for clients to decide 
what to compact next.
-   * @return
-   */
-   public List<Pair<String, HoodieCompactionPlan>> getPendingCompactions();
-```
-API for scheduling compaction
-
-```
-
-          /**
-           * Schedules a new compaction instant
-           * @param extraMetadata
-           * @return Compaction Instant timestamp if a new compaction plan is 
scheduled
-           */
-           Optional<String> scheduleCompaction(Optional<Map<String, String>> 
extraMetadata) throws IOException;
-```
-
-Refer to  __hoodie-client/src/test/java/HoodieClientExample.java__ class for 
an example of how compaction
-is scheduled and executed.
-
-##### Deployment Models
-
-These are typical Hudi Writer and Compaction deployment models
-
-  * `Inline Compaction` : At each round, a single spark application ingests 
new batch to dataset. It then optionally decides to schedule
-   a compaction run and executes it in sequence.
-  * `Single Dedicated Async Compactor` :  The Spark application which brings 
in new changes to dataset (writer) periodically
-     schedules compaction. The Writer application does not run compaction 
inline. A separate spark applications periodically
-     probes for pending compaction and executes the compaction.
-  * ` Multi Async Compactors` : This mode is similar to `Single Dedicated 
Async Compactor` mode. The main difference being
-     now there can be more than one spark application picking different 
compactions and executing them in parallel.
-     In order to ensure compactors do not step on each other, they use 
coordination service like zookeeper to pickup unique
-     pending compaction instants and run them.
-
-The Compaction process requires one executor per file-slice in the compaction 
plan. So, the best resource allocation
-strategy (both in terms of speed and resource usage) for clusters supporting 
dynamic allocation is to lookup the compaction
-plan to be run to figure out the number of file slices being compacted and 
choose that many number of executors.
-
-## Async Compaction Design Deep-Dive (Optional)
-
-For the purpose of this section, it is important to distinguish between 2 
types of commits as pertaining to the file-group:
-
-A commit which generates a merged and read-optimized file-slice is called 
`snapshot commit` (SC) with respect to that file-group.
-A commit which merely appended the new/updated records assigned to the 
file-group into a new log block is called `delta commit` (DC)
-with respect to that file-group.
-
-### Algorithm
-
-The algorithm is described with an illustration. Let us assume a scenario 
where there are commits SC1, DC2, DC3 that have
-already completed on a data-set. Commit DC4 is currently ongoing with the 
writer (ingestion) process using it to upsert data.
-Let us also imagine there are a set of file-groups (FG1 … FGn) in the data-set 
whose latest version (`File-Slice`)
-contains the base file created by commit SC1 (snapshot-commit in columnar 
format) and a log file containing row-based
-log blocks of 2 delta-commits (DC2 and DC3).
-
-{% include image.html file="async_compac_1.png" alt="async_compac_1.png" 
max-width="1000" %}
-
- * Writer (Ingestion) that is going to commit "DC4" starts. The record updates 
in this batch are grouped by file-groups
-   and appended in row formats to the corresponding log file as delta commit. 
Let us imagine a subset of file-groups has
-   this new log block (delta commit) DC4 added.
- * Before the writer job completes, it runs the compaction strategy to decide 
which file-group to compact by compactor
-   and creates a new compaction-request commit SC5. This commit file is marked 
as “requested” with metadata denoting
-   which fileIds to compact (based on selection policy). Writer completes 
without running compaction (will be run async).
-
-   {% include image.html file="async_compac_2.png" alt="async_compac_2.png" 
max-width="1000" %}
-
- * Writer job runs again ingesting next batch. It starts with commit DC6. It 
reads the earliest inflight compaction
-   request marker commit in timeline order and collects the (fileId, 
Compaction Commit Id “CcId” ) pairs from meta-data.
-   Ingestion DC6 ensures a new file-slice with base-commit “CcId” gets 
allocated for the file-group.
-   The Writer will simply append records in row-format to the first log-file 
(as delta-commit) assuming the
-   base-file (“Phantom-Base-File”) will be created eventually by the compactor.
-
-   {% include image.html file="async_compac_3.png" alt="async_compac_3.png" 
max-width="1000" %}
-
- * Compactor runs at some time  and commits at “Tc” (concurrently or 
before/after Ingestion DC6). It reads the commit-timeline
-   and finds the first unprocessed compaction request marker commit. Compactor 
reads the commit’s metadata finding the
-   file-slices to be compacted. It compacts the file-slice and creates the 
missing base-file (“Phantom-Base-File”)
-   with “CCId” as the commit-timestamp. Compactor then marks the compaction 
commit timestamp as completed.
-   It is important to realize that at data-set level, there could be different 
file-groups requesting compaction at
-   different commit timestamps.
-
-    {% include image.html file="async_compac_4.png" alt="async_compac_4.png" 
max-width="1000" %}
-
- * Near Real-time reader interested in getting the latest snapshot will have 2 
cases. Let us assume that the
-   incremental ingestion (writer at DC6) happened before the compaction (some 
time “Tc”’).  
-   The below description is with regards to compaction from file-group 
perspective.
-   * `Reader querying at time between ingestion completion time for DC6 and 
compaction finish “Tc”`:
-     Hudi’s implementation will be changed to become aware of file-groups 
currently waiting for compaction and
-     merge log-files corresponding to DC2-DC6 with the base-file corresponding 
to SC1. In essence, Hudi will create
-     a pseudo file-slice by combining the 2 file-slices starting at 
base-commits SC1 and SC5 to one.
-     For file-groups not waiting for compaction, the reader behavior is 
essentially the same - read latest file-slice
-     and merge on the fly.
-   * `Reader querying at time after compaction finished (> “Tc”)` : In this 
case, reader will not find any pending
-     compactions in the timeline and will simply have the current behavior of 
reading the latest file-slice and
-     merging on-the-fly.
-
- * Read-Optimized View readers will query against the latest columnar 
base-file for each file-groups.
-
-The above algorithm explains Async compaction w.r.t a single compaction run on 
a single file-group. It is important
-to note that multiple compaction plans can be run concurrently as they are 
essentially operating on different
-file-groups.
-
-## Performance
-
-In this section, we go over some real world performance numbers for Hudi 
upserts, incremental pull and compare them against
-the conventional alternatives for achieving these tasks.
-
-#### Upsert vs Bulk Loading
-
-Following shows the speed up obtained for NoSQL ingestion, by switching from 
bulk loads off HBase to Parquet to incrementally upserting
-on a Hudi dataset, on 5 tables ranging from small to huge.
-
-{% include image.html file="hudi_upsert_perf1.png" alt="hudi_upsert_perf1.png" 
max-width="1000" %}
-
-
-Given Hudi can build the dataset incrementally, it opens doors for also 
scheduling ingesting more frequently thus reducing latency, with
-significant savings on the overall compute cost.
-
-
-{% include image.html file="hudi_upsert_perf2.png" alt="hudi_upsert_perf2.png" 
max-width="1000" %}
-
-Hudi upserts have been stress tested upto 4TB in a single commit across the t1 
table.
-
-
-
-#### Copy On Write Regular Query Performance
-
-The major design goal for copy-on-write storage was to achieve the latency 
reduction & efficiency gains in previous section,
-with no impact on queries. Following charts compare the Hudi vs non-Hudi 
datasets across Hive/Presto/Spark queries.
-
-**Hive**
-
-{% include image.html file="hudi_query_perf_hive.png" 
alt="hudi_query_perf_hive.png" max-width="800" %}
-
-**Spark**
-
-{% include image.html file="hudi_query_perf_spark.png" 
alt="hudi_query_perf_spark.png" max-width="1000" %}
-
-**Presto**
-
-{% include image.html file="hudi_query_perf_presto.png" 
alt="hudi_query_perf_presto.png" max-width="1000" %}
diff --git a/docs/performance.md b/docs/performance.md
new file mode 100644
index 0000000..6795171
--- /dev/null
+++ b/docs/performance.md
@@ -0,0 +1,94 @@
+---
+title: Implementation
+keywords: hudi, index, storage, compaction, cleaning, implementation
+sidebar: mydoc_sidebar
+toc: false
+permalink: performance.html
+---
+## Performance
+
+In this section, we go over some real world performance numbers for Hudi 
upserts, incremental pull and compare them against
+the conventional alternatives for achieving these tasks. Following shows the 
speed up obtained for NoSQL ingestion, 
+by switching from bulk loads off HBase to Parquet to incrementally upserting 
on a Hudi dataset, on 5 tables ranging from small to huge.
+
+{% include image.html file="hudi_upsert_perf1.png" alt="hudi_upsert_perf1.png" 
max-width="1000" %}
+
+Given Hudi can build the dataset incrementally, it opens doors for also 
scheduling ingesting more frequently thus reducing latency, with
+significant savings on the overall compute cost.
+
+{% include image.html file="hudi_upsert_perf2.png" alt="hudi_upsert_perf2.png" 
max-width="1000" %}
+
+Hudi upserts have been stress tested upto 4TB in a single commit across the t1 
table.
+
+### Tuning
+
+Writing data via Hudi happens as a Spark job and thus general rules of spark 
debugging applies here too. Below is a list of things to keep in mind, if you 
are looking to improving performance or reliability.
+
+**Input Parallelism** : By default, Hudi tends to over-partition input (i.e 
`withParallelism(1500)`), to ensure each Spark partition stays within the 2GB 
limit for inputs upto 500GB. Bump this up accordingly if you have larger 
inputs. We recommend having shuffle parallelism 
`hoodie.[insert|upsert|bulkinsert].shuffle.parallelism` such that its atleast 
input_data_size/500MB
+
+**Off-heap memory** : Hudi writes parquet files and that needs good amount of 
off-heap memory proportional to schema width. Consider setting something like 
`spark.yarn.executor.memoryOverhead` or `spark.yarn.driver.memoryOverhead`, if 
you are running into such failures.
+
+**Spark Memory** : Typically, hudi needs to be able to read a single file into 
memory to perform merges or compactions and thus the executor memory should be 
sufficient to accomodate this. In addition, Hoodie caches the input to be able 
to intelligently place data and thus leaving some 
`spark.storage.memoryFraction` will generally help boost performance.
+
+**Sizing files** : Set `limitFileSize` above judiciously, to balance 
ingest/write latency vs number of files & consequently metadata overhead 
associated with it.
+
+**Timeseries/Log data** : Default configs are tuned for database/nosql 
changelogs where individual record sizes are large. Another very popular class 
of data is timeseries/event/log data that tends to be more volumnious with lot 
more records per partition. In such cases
+    - Consider tuning the bloom filter accuracy via 
`.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look 
up time
+    - Consider making a key that is prefixed with time of the event, which 
will enable range pruning & significantly speeding up index lookup.
+
+**GC Tuning** : Please be sure to follow garbage collection tuning tips from 
Spark tuning guide to avoid OutOfMemory errors
+[Must] Use G1/CMS Collector. Sample CMS Flags to add to 
spark.executor.extraJavaOptions :
+
+```
+-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops 
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
+````
+
+If it keeps OOMing still, reduce spark memory conservatively: 
`spark.memory.fraction=0.2, spark.memory.storageFraction=0.2` allowing it to 
spill rather than OOM. (reliably slow vs crashing intermittently)
+
+Below is a full working production config
+
+```
+ spark.driver.extraClassPath    /etc/hive/conf
+ spark.driver.extraJavaOptions    -XX:+PrintTenuringDistribution 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
+ spark.driver.maxResultSize    2g
+ spark.driver.memory    4g
+ spark.executor.cores    1
+ spark.executor.extraJavaOptions    -XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
+ spark.executor.id    driver
+ spark.executor.instances    300
+ spark.executor.memory    6g
+ spark.rdd.compress true
+
+ spark.kryoserializer.buffer.max    512m
+ spark.serializer    org.apache.spark.serializer.KryoSerializer
+ spark.shuffle.memoryFraction    0.2
+ spark.shuffle.service.enabled    true
+ spark.sql.hive.convertMetastoreParquet    false
+ spark.storage.memoryFraction    0.6
+ spark.submit.deployMode    cluster
+ spark.task.cpus    1
+ spark.task.maxFailures    4
+
+ spark.yarn.driver.memoryOverhead    1024
+ spark.yarn.executor.memoryOverhead    3072
+ spark.yarn.max.executor.failures    100
+
+````
+
+
+#### Read Optimized Query Performance
+
+The major design goal for read optimized view is to achieve the latency 
reduction & efficiency gains in previous section,
+with no impact on queries. Following charts compare the Hudi vs non-Hudi 
datasets across Hive/Presto/Spark queries and demonstrate this.
+
+**Hive**
+
+{% include image.html file="hudi_query_perf_hive.png" 
alt="hudi_query_perf_hive.png" max-width="800" %}
+
+**Spark**
+
+{% include image.html file="hudi_query_perf_spark.png" 
alt="hudi_query_perf_spark.png" max-width="1000" %}
+
+**Presto**
+
+{% include image.html file="hudi_query_perf_presto.png" 
alt="hudi_query_perf_presto.png" max-width="1000" %}
diff --git a/docs/sql_queries.md b/docs/querying_data.md
similarity index 98%
rename from docs/sql_queries.md
rename to docs/querying_data.md
index 4dc7493..452c92d 100644
--- a/docs/sql_queries.md
+++ b/docs/querying_data.md
@@ -1,8 +1,8 @@
 ---
-title: SQL Queries
+title: Querying Hudi Datasets
 keywords: hudi, hive, spark, sql, presto
 sidebar: mydoc_sidebar
-permalink: sql_queries.html
+permalink: querying_data.html
 toc: false
 summary: In this page, we go over how to enable SQL queries on Hudi built 
tables.
 ---
diff --git a/docs/quickstart.md b/docs/quickstart.md
index e19bedf..ff33184 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -8,11 +8,10 @@ permalink: quickstart.html
 ---
 
 
-## Download Hudi
 
-Check out code and pull it into Intellij as a normal maven project.
+## Download Hudi
 
-Normally build the maven project, from command line
+Check out code and pull it into Intellij as a normal maven project. Normally 
build the maven project, from command line
 
 ```
 $ mvn clean install -DskipTests -DskipITs
@@ -23,10 +22,11 @@ To work with older version of Hive (pre Hive-1.2.1), use
 $ mvn clean install -DskipTests -DskipITs -Dhive11
 ```
 
-{% include callout.html content="You might want to add your spark jars folder 
to project dependencies under 'Module Setttings', to be able to run Spark from 
IDE" type="info" %}
-
-{% include note.html content="Setup your local hadoop/hive test environment, 
so you can play with entire ecosystem. See 
[this](http://www.bytearray.io/2016/05/setting-up-hadoopyarnsparkhive-on-mac.html)
 for reference" %}
+{% include callout.html content="You might want to add your spark jars folder 
to project dependencies under 'Module Setttings', to be able to run Spark from 
IDE. 
+Setup your local hadoop/hive test environment, so you can play with entire 
ecosystem. 
+See 
[this](http://www.bytearray.io/2016/05/setting-up-hadoopyarnsparkhive-on-mac.html)
 for reference" type="info" %}
 
+<br/>Please refer to [migration guide](migration_guide.html), for recommended 
ways to migrate your existing dataset to Hudi.
 
 ## Version Compatibility
 
diff --git a/docs/incremental_processing.md b/docs/writing_data.md
similarity index 99%
rename from docs/incremental_processing.md
rename to docs/writing_data.md
index 63f4f39..28fd03e 100644
--- a/docs/incremental_processing.md
+++ b/docs/writing_data.md
@@ -1,8 +1,8 @@
 ---
-title: Incremental Processing
+title: Writing Hudi Datasets
 keywords: hudi, incremental, batch, stream, processing, Hive, ETL, Spark SQL
 sidebar: mydoc_sidebar
-permalink: incremental_processing.html
+permalink: writing_data.html
 toc: false
 summary: In this page, we will discuss some available tools for ingesting data 
incrementally & consuming the changes.
 ---

[incubator-hudi-site] 18/19: Simplifying site navigation

Reply via email to