[incubator-hudi-site] 12/19: Revised community, contributing pages

vinoth Wed, 13 Mar 2019 15:41:26 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi-site.git


commit 94f64a7c7e8c8ec0307fbd78c5ec13ec0a7b9175
Author: Vinoth Chandar <[email protected]>
AuthorDate: Mon Feb 25 07:01:53 2019 -0800

    Revised community, contributing pages
    
     - Community engagement instructions
     - Strawman contribution guide, to get us going
     - Fixed broken image urls from the hudi renames
     - Fixed broken code formatting on couple pages
     - Removed api_setup, roadmap pages and cleaned up structure
---
 .gitignore                                         |   1 +
 docs/README.md                                     |   5 +
 docs/_config.yml                                   |   2 +-
 docs/_data/topnav.yml                              |  24 ++-
 docs/_includes/footer.html                         |   6 +
 docs/_posts/2019-01-18-asf-incubation.md           |  10 ++
 docs/admin_guide.md                                |  22 ++-
 docs/api_docs.md                                   |  10 --
 docs/code_and_design.md                            |  38 -----
 docs/community.md                                  |  38 +++--
 docs/concepts.md                                   |  28 ++--
 docs/configurations.md                             |  38 +++--
 docs/contributing.md                               | 101 +++++++++++++
 docs/dev_setup.md                                  |  13 --
 docs/images/hoodie_cow.png                         | Bin 31136 -> 0 bytes
 docs/images/hoodie_mor.png                         | Bin 56002 -> 0 bytes
 docs/images/hudi_cow.png                           | Bin 0 -> 48994 bytes
 docs/images/hudi_mor.png                           | Bin 0 -> 92073 bytes
 .../{hoodie_timeline.png => hudi_timeline.png}     | Bin
 docs/implementation.md                             | 165 +++++++++++----------
 docs/index.md                                      |   7 +-
 docs/migration_guide.md                            |  70 ++++-----
 docs/quickstart.md                                 |  89 +++++------
 docs/roadmap.md                                    |  14 --
 docs/sql_queries.md                                |   5 +-
 25 files changed, 383 insertions(+), 303 deletions(-)

diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..e43b0f9
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1 @@
+.DS_Store
diff --git a/docs/README.md b/docs/README.md
index 0995250..8593206 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -11,6 +11,11 @@ The site is based on a [Jekyll](https://jekyllrb.com/) theme 
hosted [here](idrat
 
 Simply run `docker-compose build --no-cache && docker-compose up` from the 
`docs` folder and the site should be up & running at `http://localhost:4000`
 
+To see edits reflect on the site, you may have to bounce the container
+
+ - Stop existing container by `ctrl+c` the docker-compose program
+ - (or) alternatively via `docker stop docs_server_1`
+ - Bring up container again using `docker-compose up`
 
 #### Host OS
 
diff --git a/docs/_config.yml b/docs/_config.yml
index 781bdb6..9f0effd 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -77,7 +77,7 @@ defaults:
 sidebars:
 - mydoc_sidebar
 
-description: "Apache Hudi (pronounced “Hoodie”) is a Spark Library, that 
provides upserts and incremental processing capaibilities on Hadoop datasets"
+description: "Apache Hudi (pronounced “Hoodie”) provides upserts and 
incremental processing capaibilities on Big Data"
 # the description is used in the feed.xml file
 
 # needed for sitemap.xml file only
diff --git a/docs/_data/topnav.yml b/docs/_data/topnav.yml
index 190573a..0042feb 100644
--- a/docs/_data/topnav.yml
+++ b/docs/_data/topnav.yml
@@ -7,24 +7,22 @@ topnav:
       url: /news
     - title: Community
       url: /community.html
-    - title: Github
+    - title: Code
       external_url: https://github.com/uber/hoodie
 
 #Topnav dropdowns
 topnav_dropdowns:
 - title: Topnav dropdowns
   folders:
-    - title: Developer Resources
+    - title: Developers
       folderitems:
-          - title: Setup
-            url: /dev_setup.html
-            output: web
-          - title: API Docs
-            url: /api_docs.html
-            output: web
-          - title: Code Structure
-            url: /code_and_design.html
-            output: web
-          - title: Roadmap
-            url: /roadmap.html
+          - title: Contributing
+            url: /contributing.html
             output: web
+          - title: Wiki/Designs
+            external_url: https://cwiki.apache.org/confluence/display/HUDI
+          - title: Issues
+            external_url: https://issues.apache.org/jira/projects/HUDI/summary
+          - title: Blog
+            external_url: 
https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI
+      
diff --git a/docs/_includes/footer.html b/docs/_includes/footer.html
index 00605db..c920c5c 100755
--- a/docs/_includes/footer.html
+++ b/docs/_includes/footer.html
@@ -8,6 +8,12 @@
                   <a class="footer-link-img" href="https://apache.org";>
                     <img src="images/asf_logo.svg" alt="The Apache Software 
Foundation" height="100px" widh="50px"></a>
                   </p>
+                  <p>
+                  Apache Hudi is an effort undergoing incubation at The Apache 
Software Foundation (ASF), sponsored by the name of <a 
href="http://incubator.apache.org/";>Apache Incubator</a>.
+                  Incubation is required of all newly accepted projects until 
a further review indicates that the infrastructure, communications, and 
decision making process have
+                  stabilized in a manner consistent with other successful ASF 
projects. While incubation status is not necessarily a
+                  reflection of the completeness or stability of the code, it 
does indicate that the project has yet to be fully endorsed by the ASF.
+                  </p>
                 </div>
             </div>
 </footer>
diff --git a/docs/_posts/2019-01-18-asf-incubation.md 
b/docs/_posts/2019-01-18-asf-incubation.md
new file mode 100644
index 0000000..79de37c
--- /dev/null
+++ b/docs/_posts/2019-01-18-asf-incubation.md
@@ -0,0 +1,10 @@
+---
+title:  "Hudi entered Apache Incubator"
+categories:  update
+permalink: strata-talk.html
+tags: [news]
+---
+
+In the coming weeks, we will be moving in our new home on the Apache Incubator.
+
+{% include links.html %}
diff --git a/docs/admin_guide.md b/docs/admin_guide.md
index 7f7e610..3d37d22 100644
--- a/docs/admin_guide.md
+++ b/docs/admin_guide.md
@@ -43,7 +43,9 @@ hoodie->create --path /user/hive/warehouse/table1 --tableName 
hoodie_table_1 --t
 ```
 
 To see the description of hoodie table, use the command:
+
 ```
+
 hoodie:hoodie_table_1->desc
 18/09/06 15:57:19 INFO timeline.HoodieActiveTimeline: Loaded instants []
     _________________________________________________________
@@ -55,6 +57,7 @@ hoodie:hoodie_table_1->desc
     | hoodie.table.name       | hoodie_table_1               |
     | hoodie.table.type       | COPY_ON_WRITE                |
     | hoodie.archivelog.folder|                              |
+
 ```
 
 Following is a sample command to connect to a Hoodie dataset contains uber 
trips.
@@ -183,7 +186,7 @@ order (See Concepts). The below commands allow users to 
view the file-slices for
  | Partition | FileId | Base-Instant | Data-File | Data-File Size| Num Delta 
Files| Total Delta Size| Delta Size - compaction scheduled| Delta Size - 
compaction unscheduled| Delta To Base Ratio - compaction scheduled| Delta To 
Base Ratio - compaction unscheduled| Delta Files - compaction scheduled | Delta 
Files - compaction unscheduled|
  
|==========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================
 [...]
  | 2018/08/31| 111415c3-f26d-4639-86c8-f9956f245ac3| 20181002180759| 
hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/111415c3-f26d-4639-86c8-f9956f245ac3_0_20181002180759.parquet|
 432.5 KB | 1 | 20.8 KB | 20.8 KB | 0.0 B | 0.0 B | 0.0 B | [HoodieLogFile 
{hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/.111415c3-f26d-4639-86c8-f9956f245ac3_20181002180759.log.1}]|
 [] |
- 
+
  hoodie:stock_ticks_mor->
 ```
 
@@ -224,7 +227,7 @@ This is a sequence file that contains a mapping from 
commitNumber => json with r
 
 #### Compactions
 
-To get an idea of the lag between compaction and writer applications, use the 
below command to list down all 
+To get an idea of the lag between compaction and writer applications, use the 
below command to list down all
 pending compactions.
 
 ```
@@ -316,7 +319,7 @@ hoodie:stock_ticks_mor->compaction validate --instant 
20181005222611
 ...
 
    COMPACTION PLAN VALID
-   
+
     
___________________________________________________________________________________________________________________________________________________________________________________________________________________________
     | File Id                             | Base Instant Time| Base Data File  
                                                                                
                                 | Num Delta Files| Valid| Error|
     
|==========================================================================================================================================================================================================================|
@@ -340,14 +343,15 @@ hoodie:stock_ticks_mor->compaction validate --instant 
20181005222601
 
 The following commands must be executed without any other writer/ingestion 
application running.
 
-Sometimes, it becomes necessary to remove a fileId from a compaction-plan 
inorder to speed-up or unblock compaction 
-operation. Any new log-files that happened on this file after the compaction 
got scheduled will be safely renamed 
+Sometimes, it becomes necessary to remove a fileId from a compaction-plan 
inorder to speed-up or unblock compaction
+operation. Any new log-files that happened on this file after the compaction 
got scheduled will be safely renamed
 so that are preserved. Hudi provides the following CLI to support it
 
 
 ##### UnScheduling Compaction
 
 ```
+
 hoodie:trips->compaction unscheduleFileId --fileId <FileUUID>
 ....
 No File renames needed to unschedule file from pending compaction. Operation 
successful.
@@ -356,24 +360,28 @@ No File renames needed to unschedule file from pending 
compaction. Operation suc
 
 In other cases, an entire compaction plan needs to be reverted. This is 
supported by the following CLI
 ```
+
 hoodie:trips->compaction unschedule --compactionInstant <compactionInstant>
 .....
 No File renames needed to unschedule pending compaction. Operation successful.
+
 ```
-  
+
 ##### Repair Compaction
 
 The above compaction unscheduling operations could sometimes fail partially 
(e:g -> HDFS temporarily unavailable). With
-partial failures, the compaction operation could become inconsistent with the 
state of file-slices. When you run 
+partial failures, the compaction operation could become inconsistent with the 
state of file-slices. When you run
 `compaction validate`, you can notice invalid compaction operations if there 
is one.  In these cases, the repair
 command comes to the rescue, it will rearrange the file-slices so that there 
is no loss and the file-slices are
 consistent with the compaction plan
 
 ```
+
 hoodie:stock_ticks_mor->compaction repair --instant 20181005222611
 ......
 Compaction successfully repaired
 .....
+
 ```
 
 
diff --git a/docs/api_docs.md b/docs/api_docs.md
deleted file mode 100644
index 24bfd6b..0000000
--- a/docs/api_docs.md
+++ /dev/null
@@ -1,10 +0,0 @@
----
-title: API Docs
-keywords: usecases
-sidebar: mydoc_sidebar
-permalink: api_docs.html
----
-
-Work In Progress
-
-
diff --git a/docs/code_and_design.md b/docs/code_and_design.md
deleted file mode 100644
index 3baaa97..0000000
--- a/docs/code_and_design.md
+++ /dev/null
@@ -1,38 +0,0 @@
----
-title: Code Structure
-keywords: usecases
-sidebar: mydoc_sidebar
-permalink: code_and_design.html
----
-
-## Code & Project Structure
-
- * hoodie-client     : Spark client library to take a bunch of inserts + 
updates and apply them to a Hoodie table
- * hoodie-common     : Common code shared between different artifacts of Hoodie
-
- ## HoodieLogFormat
-
- The following diagram depicts the LogFormat for Hoodie MergeOnRead. Each 
logfile consists of one or more log blocks.
- Each logblock follows the format shown below.
-
- | Field  | Description |
- |-------------- |------------------|
- | MAGIC    | A magic header that marks the start of a block |
- | VERSION  | The version of the LogFormat, this helps define how to switch 
between different log format as it evolves |
- | TYPE     | The type of the log block |
- | HEADER LENGTH | The length of the headers, 0 if no headers |
- | HEADER        | Metadata needed for a log block. For eg. INSTANT_TIME, 
TARGET_INSTANT_TIME, SCHEMA etc. |
- | CONTENT LENGTH |  The length of the content of the log block |
- | CONTENT        | The content of the log block, for example, for a 
DATA_BLOCK, the content is (number of records + actual records) in byte [] |
- | FOOTER LENGTH  | The length of the footers, 0 if no footers |
- | FOOTER         | Metadata needed for a log block. For eg. index entries, a 
bloom filter for records in a DATA_BLOCK etc. |
- | LOGBLOCK LENGTH | The total number of bytes written for a log block, 
typically the SUM(everything_above). This is a LONG. This acts as a reverse 
pointer to be able to traverse the log in reverse.|
-
-
- {% include image.html file="hoodie_log_format_v2.png" 
alt="hoodie_log_format_v2.png" %}
-
-
-
-
-
-
diff --git a/docs/community.md b/docs/community.md
index c508191..c16dc92 100644
--- a/docs/community.md
+++ b/docs/community.md
@@ -6,17 +6,35 @@ toc: false
 permalink: community.html
 ---
 
+## Engage with us
+
+There are several ways to get in touch with the Hudi community.
+
+| When? | Channel to use |
+|-------|--------|
+| For any general questions, user support, development discussions | Dev 
Mailing list ([Subscribe](mailto:[email protected]), 
[Unsubscribe](mailto:[email protected]), 
[Archives](https://lists.apache.org/[email protected])). Empty 
email works for subscribe/unsubscribe |
+| For reporting bugs or issues or discover known issues | Please use [ASF Hudi 
JIRA](https://issues.apache.org/jira/projects/HUDI/summary) |
+| For quick pings & 1-1 chats | Join our [slack 
group](https://join.slack.com/t/apache-hudi/signup) |
+| For proposing large features, changes | Start a Hudi Improvement Process 
(HIP). Instructions coming soon.|
+| For stream of commits, pull requests etc | Commits Mailing list 
([Subscribe](mailto:[email protected]), 
[Unsubscribe](mailto:[email protected]), 
[Archives](https://lists.apache.org/[email protected])) |
+
+If you wish to report a security vulnerability, please contact 
[[email protected]](mailto:[email protected]).
+Apache Hudi follows the typical Apache vulnerability handling 
[process](https://apache.org/security/committers.html#vulnerability-handling).
+
 ## Contributing
-We :heart: contributions. If you find a bug in the library or would like to 
add new features, go ahead and open
-issues or pull requests against this repo. Before you do so, please sign the
-[Apache CLA](https://www.apache.org/licenses/icla.pdf).
-Also, be sure to write unit tests for your bug fix or feature to show that it 
works as expected.
-If the reviewer feels this contributions needs to be in the release notes, 
please add it to CHANGELOG.md as well.
 
-If you want to participate in day-day conversations, please join our [slack 
group](https://join.slack.com/t/apache-hudi/signup).
-If you are from select pre-listed email domains, you can self signup. Others, 
please subscribe to [email protected]
+Apache Hudi community welcomes contributions from anyone!
+
+Here are few ways, you can get involved.
+
+ - Ask (and/or) answer questions on our support channels listed above.
+ - Review code or HIPs
+ - Help improve documentation
+ - Testing; Improving out-of-box experience by reporting bugs
+ - Share new ideas/directions to pursue or propose a new HIP
+ - Contributing code to the project
 
-## Becoming a Committer
+#### Code Contributions
 
-Hoodie has adopted a lot of guidelines set forth in [Google Chromium 
project](https://www.chromium.org/getting-involved/become-a-committer), to 
determine committership proposals. However, given this is a much younger 
project, we would have the contribution bar to be 10-15 non-trivial patches 
instead.
-Additionally, we expect active engagement with the community over a few 
months, in terms of conference/meetup talks, helping out with issues/questions 
on slack/github.
+Useful resources for contributing can be found under the "Developers" top menu.
+Specifically, please refer to the detailed [contribution 
guide](contributing.html).
diff --git a/docs/concepts.md b/docs/concepts.md
index 5ce3fc6..845228a 100644
--- a/docs/concepts.md
+++ b/docs/concepts.md
@@ -20,7 +20,7 @@ Such key activities include
  * `COMMITS` - A single commit captures information about an **atomic write** 
of a batch of records into a dataset.
        Commits are identified by a monotonically increasing timestamp, 
denoting the start of the write operation.
  * `CLEANS` - Background activity that gets rid of older versions of files in 
the dataset, that are no longer needed.
- * `DELTA_COMMITS` - A single commit captures information about an **atomic 
write** of a batch of records into a 
+ * `DELTA_COMMITS` - A single commit captures information about an **atomic 
write** of a batch of records into a
  MergeOnRead storage type of dataset
  * `COMPACTIONS` - Background activity to reconcile differential data 
structures within Hudi e.g: moving updates from row based log files to columnar 
formats.
 
@@ -37,15 +37,15 @@ only the changed files without say scanning all the time 
buckets > 07:00.
 
 ## Terminologies
 
- * `Hudi Dataset` 
-    A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables. 
- * `Commit` 
-    A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically 
+ * `Hudi Dataset`
+    A structured hive/spark dataset managed by Hudi. Hudi supports both 
partitioned and non-partitioned Hive tables.
+ * `Commit`
+    A commit marks a new batch of data applied to a dataset. Hudi maintains  
monotonically increasing timestamps to track commits and guarantees that a 
commit is atomically
     published.
  * `Commit Timeline`
-    Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime. 
- * `File Slice` 
-    Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id. 
+    Commit Timeline refers to the sequence of Commits that was applied in 
order on a dataset over its lifetime.
+ * `File Slice`
+    Hudi provides efficient handling of updates by having a fixed mapping 
between record key to a logical file Id.
     Hudi uses MVCC to provide atomicity and isolation of readers from a 
writer. This means that a logical fileId will
     have many physical versions of it. Each of these physical version of a 
file represents a complete view of the
     file as of a commit and is called File Slice
@@ -69,8 +69,6 @@ Hudi (will) supports the following storage types.
   - Copy On Write : A heavily read optimized storage type, that simply creates 
new versions of files corresponding to the records that changed.
   - Merge On Read : Also provides a near-real time datasets in the order of 5 
mins, by shifting some of the write cost, to the reads and merging incoming and 
on-disk data on-the-fly
 
-{% include callout.html content="Hudi is a young project. merge-on-read is 
currently underway. Get involved 
[here](https://github.com/uber/Hudi/projects/1)" type="info" %}
-
 Regardless of the storage type, Hudi organizes a datasets into a directory 
structure under a `basepath`,
 very similar to Hive tables. Dataset is broken up into partitions, which are 
folders containing files for that partition.
 Each partition uniquely identified by its `partitionpath`, which is relative 
to the basepath.
@@ -92,12 +90,12 @@ commit, such that only columnar data exists. As a result, 
the write amplificatio
 Following illustrates how this works conceptually, when  data written into 
copy-on-write storage  and two queries running on top of it.
 
 
-{% include image.html file="Hudi_cow.png" alt="Hudi_cow.png" %}
+{% include image.html file="hudi_cow.png" alt="hudi_cow.png" %}
 
 
 As data gets written, updates to existing file ids, produce a new version for 
that file id stamped with the commit and
 inserts allocate a new file id and write its first version for that file id. 
These file versions and their commits are color coded above.
-Normal SQL queries running against such dataset (eg: select count(*) counting 
the total records in that partition), first checks the timeline for latest 
commit
+Normal SQL queries running against such dataset (eg: `select count(*)` 
counting the total records in that partition), first checks the timeline for 
latest commit
 and filters all but latest versions of each file id. As you can see, an old 
query does not see the current inflight commit's files colored in pink,
 but a new query starting after the commit picks up the new data. Thus queries 
are immune to any write failures/partial writes and only run on committed data.
 
@@ -118,7 +116,7 @@ their columnar base data, to keep the query performance in 
check (larger append
 
 Following illustrates how the storage works, and shows queries on both 
near-real time table and read optimized table.
 
-{% include image.html file="Hudi_mor.png" alt="Hudi_mor.png" max-width="1000" 
%}
+{% include image.html file="hudi_mor.png" alt="hudi_mor.png" max-width="1000" 
%}
 
 
 There are lot of interesting things happening in this example, which bring out 
the subleties in the approach.
@@ -135,8 +133,6 @@ There are lot of interesting things happening in this 
example, which bring out t
  strategy, where we aggressively compact the latest partitions compared to 
older partitions, we could ensure the RO Table sees data
  published within X minutes in a consistent fashion.
 
-{% include callout.html content="Hudi is a young project. merge-on-read is 
currently underway. Get involved 
[here](https://github.com/uber/hoodie/projects/1)" type="info" %}
-
 The intention of merge on read storage, is to enable near real-time processing 
directly on top of Hadoop, as opposed to copying
 data out to specialized systems, which may not be able to handle the data 
volume.
 
@@ -156,4 +152,4 @@ data out to specialized systems, which may not be able to 
handle the data volume
 | Trade-off | ReadOptimized | RealTime |
 |-------------- |------------------| ------------------|
 | Data Latency | Higher   | Lower |
-| Query Latency | Lower (raw columnar performance) | Higher (merge columnar + 
row based delta) |
\ No newline at end of file
+| Query Latency | Lower (raw columnar performance) | Higher (merge columnar + 
row based delta) |
diff --git a/docs/configurations.md b/docs/configurations.md
index 50a7e5f..e6602e6 100644
--- a/docs/configurations.md
+++ b/docs/configurations.md
@@ -136,7 +136,7 @@ summary: "Here we list all possible configurations and what 
they mean"
         Actual value ontained by invoking .toString()</span>
         - [KEYGENERATOR_CLASS_OPT_KEY](#KEYGENERATOR_CLASS_OPT_KEY) (Default: 
com.uber.hoodie.SimpleKeyGenerator) <br/>
         <span style="color:grey">Key generator class, that implements will 
extract the key out of incoming `Row` object</span>
-        - 
[COMMIT_METADATA_KEYPREFIX_OPT_KEY](#COMMIT_METADATA_KEYPREFIX_OPT_KEY) 
(Default: _) <br/>
+        - 
[COMMIT_METADATA_KEYPREFIX_OPT_KEY](#COMMIT_METADATA_KEYPREFIX_OPT_KEY) 
(Default: `_`) <br/>
         <span style="color:grey">Option keys beginning with this prefix, are 
automatically added to the commit/deltacommit metadata.
         This is useful to store checkpointing information, in a consistent way 
with the hoodie timeline</span>
 
@@ -160,22 +160,33 @@ summary: "Here we list all possible configurations and 
what they mean"
 
 Writing data via Hudi happens as a Spark job and thus general rules of spark 
debugging applies here too. Below is a list of things to keep in mind, if you 
are looking to improving performance or reliability.
 
- - **Write operations** : Use `bulkinsert` to load new data into a table, and 
there on use `upsert`/`insert`. 
+**Write operations** : Use `bulkinsert` to load new data into a table, and 
there on use `upsert`/`insert`.
  Difference between them is that bulk insert uses a disk based write path to 
scale to load large inputs without need to cache it.
- - **Input Parallelism** : By default, Hoodie tends to over-partition input 
(i.e `withParallelism(1500)`), to ensure each Spark partition stays within the 
2GB limit for inputs upto 500GB. Bump this up accordingly if you have larger 
inputs. We recommend having shuffle parallelism 
`hoodie.[insert|upsert|bulkinsert].shuffle.parallelism` such that its atleast 
input_data_size/500MB
- - **Off-heap memory** : Hoodie writes parquet files and that needs good 
amount of off-heap memory proportional to schema width. Consider setting 
something like `spark.yarn.executor.memoryOverhead` or 
`spark.yarn.driver.memoryOverhead`, if you are running into such failures.
- - **Spark Memory** : Typically, hoodie needs to be able to read a single file 
into memory to perform merges or compactions and thus the executor memory 
should be sufficient to accomodate this. In addition, Hoodie caches the input 
to be able to intelligently place data and thus leaving some 
`spark.storage.memoryFraction` will generally help boost performance.
- - **Sizing files** : Set `limitFileSize` above judiciously, to balance 
ingest/write latency vs number of files & consequently metadata overhead 
associated with it.
- - **Timeseries/Log data** : Default configs are tuned for database/nosql 
changelogs where individual record sizes are large. Another very popular class 
of data is timeseries/event/log data that tends to be more volumnious with lot 
more records per partition. In such cases
+
+**Input Parallelism** : By default, Hoodie tends to over-partition input (i.e 
`withParallelism(1500)`), to ensure each Spark partition stays within the 2GB 
limit for inputs upto 500GB. Bump this up accordingly if you have larger 
inputs. We recommend having shuffle parallelism 
`hoodie.[insert|upsert|bulkinsert].shuffle.parallelism` such that its atleast 
input_data_size/500MB
+
+**Off-heap memory** : Hoodie writes parquet files and that needs good amount 
of off-heap memory proportional to schema width. Consider setting something 
like `spark.yarn.executor.memoryOverhead` or 
`spark.yarn.driver.memoryOverhead`, if you are running into such failures.
+
+**Spark Memory** : Typically, hoodie needs to be able to read a single file 
into memory to perform merges or compactions and thus the executor memory 
should be sufficient to accomodate this. In addition, Hoodie caches the input 
to be able to intelligently place data and thus leaving some 
`spark.storage.memoryFraction` will generally help boost performance.
+
+**Sizing files** : Set `limitFileSize` above judiciously, to balance 
ingest/write latency vs number of files & consequently metadata overhead 
associated with it.
+
+**Timeseries/Log data** : Default configs are tuned for database/nosql 
changelogs where individual record sizes are large. Another very popular class 
of data is timeseries/event/log data that tends to be more volumnious with lot 
more records per partition. In such cases
     - Consider tuning the bloom filter accuracy via 
`.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look 
up time
     - Consider making a key that is prefixed with time of the event, which 
will enable range pruning & significantly speeding up index lookup.
- - **GC Tuning** : Please be sure to follow garbage collection tuning tips 
from Spark tuning guide to avoid OutOfMemory errors
-    - [Must] Use G1/CMS Collector. Sample CMS Flags to add to 
spark.executor.extraJavaOptions : ``-XX:NewSize=1g -XX:SurvivorRatio=2 
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
-XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintTenuringDistribution 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/ho [...]
-    - If it keeps OOMing still, reduce spark memory conservatively: 
`spark.memory.fraction=0.2, spark.memory.storageFraction=0.2` allowing it to 
spill rather than OOM. (reliably slow vs crashing intermittently)
 
- Below is a full working production config
+**GC Tuning** : Please be sure to follow garbage collection tuning tips from 
Spark tuning guide to avoid OutOfMemory errors
+[Must] Use G1/CMS Collector. Sample CMS Flags to add to 
spark.executor.extraJavaOptions :
 
- ```
+```
+-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops 
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
+````
+
+If it keeps OOMing still, reduce spark memory conservatively: 
`spark.memory.fraction=0.2, spark.memory.storageFraction=0.2` allowing it to 
spill rather than OOM. (reliably slow vs crashing intermittently)
+
+Below is a full working production config
+
+```
  spark.driver.extraClassPath    /etc/hive/conf
  spark.driver.extraJavaOptions    -XX:+PrintTenuringDistribution 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
  spark.driver.maxResultSize    2g
@@ -200,4 +211,5 @@ Writing data via Hudi happens as a Spark job and thus 
general rules of spark deb
  spark.yarn.driver.memoryOverhead    1024
  spark.yarn.executor.memoryOverhead    3072
  spark.yarn.max.executor.failures    100
- ```
+
+````
diff --git a/docs/contributing.md b/docs/contributing.md
new file mode 100644
index 0000000..a93ba54
--- /dev/null
+++ b/docs/contributing.md
@@ -0,0 +1,101 @@
+---
+title: Developer Setup
+keywords: developer setup
+sidebar: mydoc_sidebar
+toc: false
+permalink: contributing.html
+---
+## Pre-requisites
+
+To contribute code, you need
+
+ - a GitHub account
+ - a Linux (or) macOS development environment with Java JDK 8, Apache Maven 
(3.x+) installed
+ - [Docker](https://www.docker.com/) installed for running demo, integ tests 
or building website
+ - for large contributions, a signed [Individual Contributor License
+   Agreement](https://www.apache.org/licenses/icla.pdf) (ICLA) to the Apache
+   Software Foundation (ASF).
+ - (Recommended) Create an account on 
[JIRA](https://issues.apache.org/jira/projects/HUDI/summary) to open 
issues/find similar issues.
+ - (Recommended) Join our dev mailing list & slack channel, listed on 
[community](community.html) page.
+
+
+## IDE Setup
+
+To contribute, you would need to fork the Hudi code on Github & then clone 
your own fork locally. Once cloned, we recommend building as per instructions 
on [quickstart](quickstart.html)
+
+We have embraced the code style largely based on [google 
format](https://google.github.io/styleguide/javaguide.html). Please setup your 
IDE with style files from [here](../style/).
+These instructions have been tested on IntelliJ. We also recommend setting up 
the [Save Action 
Plugin](https://plugins.jetbrains.com/plugin/7642-save-actions) to auto format 
& organize imports on save. The Maven Compilation life-cycle will fail if there 
are checkstyle violations.
+
+
+## Lifecycle
+
+Here's a typical lifecycle of events to contribute to Hudi.
+
+ - [Recommended] Share your intent on the mailing list, so that community can 
provide early feedback, point out any similar JIRAs or HIPs.
+ - [Optional] If you want to get involved, but don't have a project in mind, 
please check JIRA for small, quick-starters.
+ - [Optional] Familiarize yourself with internals of Hudi using content on 
this page, as well as [wiki](https://cwiki.apache.org/confluence/display/HUDI)
+ - Once you finalize on a project/task, please open a new JIRA or assign an 
existing one to yourself. (If you don't have perms to do this, please email the 
dev mailing list with your JIRA id and a small intro for yourself. We'd be 
happy to add you as a contributor)
+ - Make your code change
+   - Every source file needs to include the Apache license header. Every new 
dependency needs to
+     have an open source license 
[compatible](https://www.apache.org/legal/resolved.html#criteria) with Apache.
+   - Get existing tests to pass using `mvn clean install -DskipITs`
+   - Add adequate tests for your new functionality
+   - [Optional] For involved changes, its best to also run the entire 
integration test suite using `mvn clean install`
+   - For website changes, please build the site locally & test navigation, 
formatting & links thoroughly
+ - Format commit messages and the pull request title like `[HUDI-XXX] Fixes 
bug in Spark Datasource`,
+   where you replace HUDI-XXX with the appropriate JIRA issue.
+ - Push your commit to your own fork/branch & create a pull request (PR) 
against the Hudi repo.
+ - If you don't hear back within 3 days on the PR, please send an email to dev 
@ mailing list.
+ - Address code review comments & keep pushing changes to your fork/branch, 
which automatically updates the PR
+ - Before your change can be merged, it should be squashed into a single 
commit for cleaner commit history.
+
+
+## Releases
+
+ - Apache Hudi community plans to do minor version releases every 6 weeks or 
so.
+ - If your contribution merged onto `master` branch after the last release, it 
will become part of next release.
+ - Website changes are regenerated once a week (until automation in place to 
reflect immediately)
+
+
+## Accounts and Permissions
+
+ - [Hudi issue tracker 
(JIRA)](https://issues.apache.org/jira/projects/HUDI/issues):
+   Anyone can access it and browse issues. Anyone can register an account and 
login
+   to create issues or add comments. Only contributors can be assigned issues. 
If
+   you want to be assigned issues, a PMC member can add you to the project 
contributor
+   group.  Email the dev mailing list to ask to be added as a contributor, and 
include your ASF Jira username.
+
+ - [Hudi Wiki Space](https://cwiki.apache.org/confluence/display/HUDI):
+   Anyone has read access. If you wish to contribute changes, please create an 
account and
+   request edit access on the dev@ mailing list (include your Wiki account 
user ID).
+
+ - Pull requests can only be merged by a HUDI committer, listed 
[here](https://incubator.apache.org/projects/hudi.html)
+
+ - [Voting on a release](https://www.apache.org/foundation/voting.html): 
Everyone can vote.
+   Only Hudi PMC members should mark their votes as binding.
+
+## Communication
+
+All communication is expected to align with the [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct).
+Discussion about contributing code to Hudi happens on the [dev@ mailing 
list](community.html). Introduce yourself!
+
+
+## Code & Project Structure
+
+  * `docker` : Docker containers used by demo and integration tests. Brings up 
a mini data ecosystem locally
+  * `hoodie-cli` : CLI to inspect, manage and administer datasets
+  * `hoodie-client` : Spark client library to take a bunch of inserts + 
updates and apply them to a Hoodie table
+  * `hoodie-common` : Common classes used across modules
+  * `hoodie-hadoop-mr` : InputFormat implementations for ReadOptimized, 
Incremental, Realtime views
+  * `hoodie-hive` : Manage hive tables off Hudi datasets and houses the 
HiveSyncTool
+  * `hoodie-integ-test` : Longer running integration test processes
+  * `hoodie-spark` : Spark datasource for writing and reading Hudi datasets. 
Streaming sink.
+  * `hoodie-utilities` : Houses tools like DeltaStreamer, SnapshotCopier
+  * `packaging` : Poms for building out bundles for easier drop in to Spark, 
Hive, Presto
+  * `style`  : Code formatting, checkstyle files
+
+
+## Website
+
+[Apache Hudi site](https://hudi.apache.org) is hosted on a special `asf-site` 
branch. Please follow the `README` file under `docs` on that branch for
+instructions on making changes to the website.
diff --git a/docs/dev_setup.md b/docs/dev_setup.md
deleted file mode 100644
index 1bdeec7..0000000
--- a/docs/dev_setup.md
+++ /dev/null
@@ -1,13 +0,0 @@
----
-title: Developer Setup
-keywords: developer setup
-sidebar: mydoc_sidebar
-permalink: dev_setup.html
----
-
-### Code Style
-
- We have embraced the code style largely based on [google 
format](https://google.github.io/styleguide/javaguide.html).
- Please setup your IDE with style files from [here](../style/)
- We also recommend setting up the [Save Action 
Plugin](https://plugins.jetbrains.com/plugin/7642-save-actions) to auto format 
& organize imports on save.
- The Maven Compilation life-cycle will fail if there are checkstyle violations.
diff --git a/docs/images/hoodie_cow.png b/docs/images/hoodie_cow.png
deleted file mode 100644
index bad15a8..0000000
Binary files a/docs/images/hoodie_cow.png and /dev/null differ
diff --git a/docs/images/hoodie_mor.png b/docs/images/hoodie_mor.png
deleted file mode 100644
index 8d7d902..0000000
Binary files a/docs/images/hoodie_mor.png and /dev/null differ
diff --git a/docs/images/hudi_cow.png b/docs/images/hudi_cow.png
new file mode 100644
index 0000000..40aca71
Binary files /dev/null and b/docs/images/hudi_cow.png differ
diff --git a/docs/images/hudi_mor.png b/docs/images/hudi_mor.png
new file mode 100644
index 0000000..100b8f0
Binary files /dev/null and b/docs/images/hudi_mor.png differ
diff --git a/docs/images/hoodie_timeline.png b/docs/images/hudi_timeline.png
similarity index 100%
rename from docs/images/hoodie_timeline.png
rename to docs/images/hudi_timeline.png
diff --git a/docs/implementation.md b/docs/implementation.md
index 6215155..e87a541 100644
--- a/docs/implementation.md
+++ b/docs/implementation.md
@@ -23,7 +23,7 @@ Hudi upsert/insert is merely a Spark DAG, that can be broken 
into two big pieces
 
 Hudi currently provides two choices for indexes : `BloomIndex` and 
`HBaseIndex` to map a record key into the file id to which it belongs to. This 
enables
 us to speed up upserts significantly, without scanning over every record in 
the dataset. Hudi Indices can be classified based on
-their ability to lookup records across partition. A `global` index does not 
need partition information for finding the file-id for a record key 
+their ability to lookup records across partition. A `global` index does not 
need partition information for finding the file-id for a record key
 but a `non-global` does.
 
 #### HBase Index (global)
@@ -63,8 +63,8 @@ records such that
 
 In this storage, index updation is a no-op, since the bloom filters are 
already written as a part of committing data.
 
-In the case of Copy-On-Write, a single parquet file constitutes one `file 
slice` which contains one complete version of 
-the file 
+In the case of Copy-On-Write, a single parquet file constitutes one `file 
slice` which contains one complete version of
+the file
 
 {% include image.html file="hoodie_log_format_v2.png" 
alt="hoodie_log_format_v2.png" max-width="1000" %}
 
@@ -73,27 +73,27 @@ the file
 In the Merge-On-Read storage model, there are 2 logical components - one for 
ingesting data (both inserts/updates) into the dataset
  and another for creating compacted views. The former is hereby referred to as 
`Writer` while the later
  is referred as `Compactor`.
- 
+
 ##### Merge On Read Writer
- 
+
  At a high level, Merge-On-Read Writer goes through same stages as 
Copy-On-Write writer in ingesting data.
- The key difference here is that updates are appended to latest log (delta) 
file belonging to the latest file slice 
+ The key difference here is that updates are appended to latest log (delta) 
file belonging to the latest file slice
  without merging. For inserts, Hudi supports 2 modes:
 
    1. Inserts to Log Files - This is done for datasets that have an indexable 
log files (for eg global index)
    2. Inserts to parquet files - This is done for datasets that do not have 
indexable log files, for eg bloom index
       embedded in parquer files. Hudi treats writing new records in the same 
way as inserting to Copy-On-Write files.
 
-As in the case of Copy-On-Write, the input tagged records are partitioned such 
that all upserts destined to 
+As in the case of Copy-On-Write, the input tagged records are partitioned such 
that all upserts destined to
 a `file id` are grouped together. This upsert-batch is written as one or more 
log-blocks written to log-files.
 Hudi allows clients to control log file sizes (See [Storage 
Configs](../configurations))
 
 The WriteClient API is same for both Copy-On-Write and Merge-On-Read writers.
- 
+
 With Merge-On-Read, several rounds of data-writes would have resulted in 
accumulation of one or more log-files.
 All these log-files along with base-parquet (if exists) constitute a `file 
slice` which represents one complete version
-of the file. 
-  
+of the file.
+
 #### Compactor
 
 Realtime Readers will perform in-situ merge of these delta log-files to 
provide the most recent (committed) view of
@@ -106,48 +106,52 @@ Asynchronous Compaction involves 2 steps:
     to be compacted atomically in a single compaction commit. Hudi allows 
pluggable strategies for choosing
     file slices for each compaction runs. This step is typically done inline 
by Writer process as Hudi expects
     only one schedule is being generated at a time which allows Hudi to 
enforce the constraint that pending compaction
-    plans do not step on each other file-slices. This constraint allows for 
multiple concurrent `Compactors` to run at 
+    plans do not step on each other file-slices. This constraint allows for 
multiple concurrent `Compactors` to run at
     the same time. Some of the common strategies used for choosing `file 
slice` for compaction are:
-    * BoundedIO - Limit the number of file slices chosen for a compaction plan 
by expected total IO (read + write) 
-    needed to complete compaction run 
+    * BoundedIO - Limit the number of file slices chosen for a compaction plan 
by expected total IO (read + write)
+    needed to complete compaction run
     * Log File Size - Prefer file-slices with larger amounts of delta log data 
to be merged
     * Day Based - Prefer file slice belonging to latest day partitions
-    ```
-        API for scheduling compaction
-          /**
-           * Schedules a new compaction instant
-           * @param extraMetadata
-           * @return Compaction Instant timestamp if a new compaction plan is 
scheduled
-           */
-           Optional<String> scheduleCompaction(Optional<Map<String, String>> 
extraMetadata) throws IOException;
-     ```
+
   * `Compactor` : Hudi provides a separate API in Write Client to execute a 
compaction plan. The compaction
     plan (just like a commit) is identified by a timestamp. Most of the design 
and implementation complexities for Async
     Compaction is for guaranteeing snapshot isolation to readers and writer 
when
     multiple concurrent compactors are running. Typical compactor deployment 
involves launching a separate
     spark application which executes pending compactions when they become 
available. The core logic of compacting
     file slices in the Compactor is very similar to that of merging updates in 
a Copy-On-Write table. The only
-    difference being in the case of compaction, there is an additional step of 
merging the records in delta log-files. 
-    
-    Here are the main API to lookup and execute a compaction plan.
-    ```
-      Main API in HoodieWriteClient for running Compaction:
-       /**
-        * Performs Compaction corresponding to instant-time
-        * @param compactionInstantTime   Compaction Instant Time
-        * @return
-        * @throws IOException
-        */
-        public JavaRDD<WriteStatus> compact(String compactionInstantTime) 
throws IOException;
-    
-      To lookup all pending compactions, use the API defined in 
HoodieReadClient
-    
-      /**
-       * Return all pending compactions with instant time for clients to 
decide what to compact next.
-       * @return
-       */
-      public List<Pair<String, HoodieCompactionPlan>> getPendingCompactions();
-    ```
+    difference being in the case of compaction, there is an additional step of 
merging the records in delta log-files.
+
+Here are the main API to lookup and execute a compaction plan.
+
+```
+   Main API in HoodieWriteClient for running Compaction:
+   /**
+    * Performs Compaction corresponding to instant-time
+    * @param compactionInstantTime   Compaction Instant Time
+    * @return
+    * @throws IOException
+    */
+  public JavaRDD<WriteStatus> compact(String compactionInstantTime) throws 
IOException;
+
+  To lookup all pending compactions, use the API defined in HoodieReadClient
+
+  /**
+   * Return all pending compactions with instant time for clients to decide 
what to compact next.
+   * @return
+   */
+   public List<Pair<String, HoodieCompactionPlan>> getPendingCompactions();
+```
+API for scheduling compaction
+
+```
+
+          /**
+           * Schedules a new compaction instant
+           * @param extraMetadata
+           * @return Compaction Instant timestamp if a new compaction plan is 
scheduled
+           */
+           Optional<String> scheduleCompaction(Optional<Map<String, String>> 
extraMetadata) throws IOException;
+```
 
 Refer to  __hoodie-client/src/test/java/HoodieClientExample.java__ class for 
an example of how compaction
 is scheduled and executed.
@@ -172,65 +176,65 @@ plan to be run to figure out the number of file slices 
being compacted and choos
 
 ## Async Compaction Design Deep-Dive (Optional)
 
-For the purpose of this section, it is important to distinguish between 2 
types of commits as pertaining to the file-group: 
+For the purpose of this section, it is important to distinguish between 2 
types of commits as pertaining to the file-group:
 
 A commit which generates a merged and read-optimized file-slice is called 
`snapshot commit` (SC) with respect to that file-group.
-A commit which merely appended the new/updated records assigned to the 
file-group into a new log block is called `delta commit` (DC) 
+A commit which merely appended the new/updated records assigned to the 
file-group into a new log block is called `delta commit` (DC)
 with respect to that file-group.
 
 ### Algorithm
 
 The algorithm is described with an illustration. Let us assume a scenario 
where there are commits SC1, DC2, DC3 that have
-already completed on a data-set. Commit DC4 is currently ongoing with the 
writer (ingestion) process using it to upsert data. 
-Let us also imagine there are a set of file-groups (FG1 … FGn) in the data-set 
whose latest version (`File-Slice`) 
-contains the base file created by commit SC1 (snapshot-commit in columnar 
format) and a log file containing row-based 
-log blocks of 2 delta-commits (DC2 and DC3). 
+already completed on a data-set. Commit DC4 is currently ongoing with the 
writer (ingestion) process using it to upsert data.
+Let us also imagine there are a set of file-groups (FG1 … FGn) in the data-set 
whose latest version (`File-Slice`)
+contains the base file created by commit SC1 (snapshot-commit in columnar 
format) and a log file containing row-based
+log blocks of 2 delta-commits (DC2 and DC3).
 
 {% include image.html file="async_compac_1.png" alt="async_compac_1.png" 
max-width="1000" %}
 
- * Writer (Ingestion) that is going to commit "DC4" starts. The record updates 
in this batch are grouped by file-groups 
-   and appended in row formats to the corresponding log file as delta commit. 
Let us imagine a subset of file-groups has 
+ * Writer (Ingestion) that is going to commit "DC4" starts. The record updates 
in this batch are grouped by file-groups
+   and appended in row formats to the corresponding log file as delta commit. 
Let us imagine a subset of file-groups has
    this new log block (delta commit) DC4 added.
- * Before the writer job completes, it runs the compaction strategy to decide 
which file-group to compact by compactor 
-   and creates a new compaction-request commit SC5. This commit file is marked 
as “requested” with metadata denoting 
-   which fileIds to compact (based on selection policy). Writer completes 
without running compaction (will be run async). 
- 
+ * Before the writer job completes, it runs the compaction strategy to decide 
which file-group to compact by compactor
+   and creates a new compaction-request commit SC5. This commit file is marked 
as “requested” with metadata denoting
+   which fileIds to compact (based on selection policy). Writer completes 
without running compaction (will be run async).
+
    {% include image.html file="async_compac_2.png" alt="async_compac_2.png" 
max-width="1000" %}
- 
- * Writer job runs again ingesting next batch. It starts with commit DC6. It 
reads the earliest inflight compaction 
-   request marker commit in timeline order and collects the (fileId, 
Compaction Commit Id “CcId” ) pairs from meta-data. 
-   Ingestion DC6 ensures a new file-slice with base-commit “CcId” gets 
allocated for the file-group. 
-   The Writer will simply append records in row-format to the first log-file 
(as delta-commit) assuming the 
+
+ * Writer job runs again ingesting next batch. It starts with commit DC6. It 
reads the earliest inflight compaction
+   request marker commit in timeline order and collects the (fileId, 
Compaction Commit Id “CcId” ) pairs from meta-data.
+   Ingestion DC6 ensures a new file-slice with base-commit “CcId” gets 
allocated for the file-group.
+   The Writer will simply append records in row-format to the first log-file 
(as delta-commit) assuming the
    base-file (“Phantom-Base-File”) will be created eventually by the compactor.
-   
+
    {% include image.html file="async_compac_3.png" alt="async_compac_3.png" 
max-width="1000" %}
- 
- * Compactor runs at some time  and commits at “Tc” (concurrently or 
before/after Ingestion DC6). It reads the commit-timeline 
-   and finds the first unprocessed compaction request marker commit. Compactor 
reads the commit’s metadata finding the 
-   file-slices to be compacted. It compacts the file-slice and creates the 
missing base-file (“Phantom-Base-File”) 
-   with “CCId” as the commit-timestamp. Compactor then marks the compaction 
commit timestamp as completed. 
-   It is important to realize that at data-set level, there could be different 
file-groups requesting compaction at 
+
+ * Compactor runs at some time  and commits at “Tc” (concurrently or 
before/after Ingestion DC6). It reads the commit-timeline
+   and finds the first unprocessed compaction request marker commit. Compactor 
reads the commit’s metadata finding the
+   file-slices to be compacted. It compacts the file-slice and creates the 
missing base-file (“Phantom-Base-File”)
+   with “CCId” as the commit-timestamp. Compactor then marks the compaction 
commit timestamp as completed.
+   It is important to realize that at data-set level, there could be different 
file-groups requesting compaction at
    different commit timestamps.
- 
+
     {% include image.html file="async_compac_4.png" alt="async_compac_4.png" 
max-width="1000" %}
 
- * Near Real-time reader interested in getting the latest snapshot will have 2 
cases. Let us assume that the 
+ * Near Real-time reader interested in getting the latest snapshot will have 2 
cases. Let us assume that the
    incremental ingestion (writer at DC6) happened before the compaction (some 
time “Tc”’).  
-   The below description is with regards to compaction from file-group 
perspective. 
-   * `Reader querying at time between ingestion completion time for DC6 and 
compaction finish “Tc”`: 
-     Hoodie’s implementation will be changed to become aware of file-groups 
currently waiting for compaction and 
-     merge log-files corresponding to DC2-DC6 with the base-file corresponding 
to SC1. In essence, Hudi will create 
-     a pseudo file-slice by combining the 2 file-slices starting at 
base-commits SC1 and SC5 to one. 
-     For file-groups not waiting for compaction, the reader behavior is 
essentially the same - read latest file-slice 
+   The below description is with regards to compaction from file-group 
perspective.
+   * `Reader querying at time between ingestion completion time for DC6 and 
compaction finish “Tc”`:
+     Hoodie’s implementation will be changed to become aware of file-groups 
currently waiting for compaction and
+     merge log-files corresponding to DC2-DC6 with the base-file corresponding 
to SC1. In essence, Hudi will create
+     a pseudo file-slice by combining the 2 file-slices starting at 
base-commits SC1 and SC5 to one.
+     For file-groups not waiting for compaction, the reader behavior is 
essentially the same - read latest file-slice
      and merge on the fly.
-   * `Reader querying at time after compaction finished (> “Tc”)` : In this 
case, reader will not find any pending 
-     compactions in the timeline and will simply have the current behavior of 
reading the latest file-slice and 
+   * `Reader querying at time after compaction finished (> “Tc”)` : In this 
case, reader will not find any pending
+     compactions in the timeline and will simply have the current behavior of 
reading the latest file-slice and
      merging on-the-fly.
-     
- * Read-Optimized View readers will query against the latest columnar 
base-file for each file-groups. 
+
+ * Read-Optimized View readers will query against the latest columnar 
base-file for each file-groups.
 
 The above algorithm explains Async compaction w.r.t a single compaction run on 
a single file-group. It is important
-to note that multiple compaction plans can be run concurrently as they are 
essentially operating on different 
+to note that multiple compaction plans can be run concurrently as they are 
essentially operating on different
 file-groups.
 
 ## Performance
@@ -272,4 +276,3 @@ with no impact on queries. Following charts compare the 
Hudi vs non-Hudi dataset
 **Presto**
 
 {% include image.html file="hoodie_query_perf_presto.png" 
alt="hoodie_query_perf_presto.png" max-width="1000" %}
-
diff --git a/docs/index.md b/docs/index.md
index b5b9da7..ad87933 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -4,12 +4,9 @@ keywords: homepage
 tags: [getting_started]
 sidebar: mydoc_sidebar
 permalink: index.html
-summary: "Hudi lowers data latency across the board, while simultaneously 
achieving orders of magnitude of efficiency over traditional batch processing."
+summary: "Hudi brings stream processing to big data, providing fresh data 
while being an order of magnitude efficient over traditional batch processing."
 ---
 
-
-
-
 Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical 
datasets on 
[HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)
 or cloud stores and provides three logical views for query access.
 
  * **Read Optimized View** - Provides excellent query performance on pure 
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
@@ -21,4 +18,4 @@ Hudi (pronounced “Hoodie”) ingests & manages storage of large 
analytical dat
 
 By carefully managing how data is laid out in storage & how it’s exposed to 
queries, Hudi is able to power a rich data ecosystem where external sources can 
be ingested in near real-time and made available for interactive SQL Engines 
like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), 
while at the same time capable of being consumed incrementally from 
processing/ETL frameworks like [Hive](https://hive.apache.org/) & 
[Spark](https://spark.apache.org/docs/latest/) t [...]
 
-Hudi broadly consists of a self contained Spark library to build datasets and 
integrations with existing query engines for data access.
+Hudi broadly consists of a self contained Spark library to build datasets and 
integrations with existing query engines for data access. See 
[quickstart](quickstart.html) for a demo.
diff --git a/docs/migration_guide.md b/docs/migration_guide.md
index a5d5506..13c27ac 100644
--- a/docs/migration_guide.md
+++ b/docs/migration_guide.md
@@ -4,9 +4,8 @@ keywords: migration guide
 sidebar: mydoc_sidebar
 permalink: migration_guide.html
 toc: false
-summary: In this page, we will discuss some available tools for migrating your 
existing dataset into a Hudi managed 
-dataset
-
+summary: In this page, we will discuss some available tools for migrating your 
existing dataset into a Hudi dataset
+---
 
 Hudi maintains metadata such as commit timeline and indexes to manage a 
dataset. The commit timelines helps to understand the actions happening on a 
dataset as well as the current state of a dataset. Indexes are used by Hudi to 
maintain a record key to file id mapping to efficiently locate a record. At the 
moment, Hudi supports writing only parquet columnar formats.
 To be able to start using Hudi for your existing dataset, you will need to 
migrate your existing dataset into a Hudi managed dataset. There are a couple 
of ways to achieve this.
@@ -15,57 +14,60 @@ To be able to start using Hudi for your existing dataset, 
you will need to migra
 ## Approaches
 
 
-### Approach 1
+#### Use Hudi for new partitions alone
 
-Hudi can be used to manage an existing dataset without affecting/altering the 
historical data already present in the 
-dataset. Hudi has been implemented to be compatible with such a mixed dataset 
with a caveat that either the complete 
-Hive partition is Hudi managed or not. Thus the lowest granularity at which 
Hudi manages a dataset is a Hive 
-partition. Start using the datasource API or the WriteClient to write to the 
dataset and make sure you start writing 
+Hudi can be used to manage an existing dataset without affecting/altering the 
historical data already present in the
+dataset. Hudi has been implemented to be compatible with such a mixed dataset 
with a caveat that either the complete
+Hive partition is Hudi managed or not. Thus the lowest granularity at which 
Hudi manages a dataset is a Hive
+partition. Start using the datasource API or the WriteClient to write to the 
dataset and make sure you start writing
 to a new partition or convert your last N partitions into Hudi instead of the 
entire table. Note, since the historical
- partitions are not managed by HUDI, none of the primitives provided by HUDI 
work on the data in those partitions. More concretely, one cannot perform 
upserts or incremental pull on such older partitions not managed by the HUDI 
dataset. 
+ partitions are not managed by HUDI, none of the primitives provided by HUDI 
work on the data in those partitions. More concretely, one cannot perform 
upserts or incremental pull on such older partitions not managed by the HUDI 
dataset.
 Take this approach if your dataset is an append only type of dataset and you 
do not expect to perform any updates to existing (or non Hudi managed) 
partitions.
 
 
-### Approach 2
+#### Convert existing dataset to Hudi
 
 Import your existing dataset into a Hudi managed dataset. Since all the data 
is Hudi managed, none of the limitations
- of Approach 1 apply here. Updates spanning any partitions can be applied to 
this dataset and Hudi will efficiently 
- make the update available to queries. Note that not only do you get to use 
all Hoodie primitives on this dataset, 
+ of Approach 1 apply here. Updates spanning any partitions can be applied to 
this dataset and Hudi will efficiently
+ make the update available to queries. Note that not only do you get to use 
all Hoodie primitives on this dataset,
  there are other additional advantages of doing this. Hudi automatically 
manages file sizes of a Hudi managed dataset
- . You can define the desired file size when converting this dataset and Hudi 
will ensure it writes out files 
- adhering to the config. It will also ensure that smaller files later get 
corrected by routing some new inserts into 
+ . You can define the desired file size when converting this dataset and Hudi 
will ensure it writes out files
+ adhering to the config. It will also ensure that smaller files later get 
corrected by routing some new inserts into
  small files rather than writing new small ones thus maintaining the health of 
your cluster.
 
 There are a few options when choosing this approach.
+
 #### Option 1
-Use the HDFSParquetImporter tool. As the name suggests, this only works if 
your existing dataset is in 
-parquet file 
-format. This tool essentially starts a Spark Job to read the existing parquet 
dataset and converts it into a HUDI managed dataset by re-writing all the data. 
-#### Option 2 
+Use the HDFSParquetImporter tool. As the name suggests, this only works if 
your existing dataset is in
+parquet file
+format. This tool essentially starts a Spark Job to read the existing parquet 
dataset and converts it into a HUDI managed dataset by re-writing all the data.
+
+#### Option 2
 For huge datasets, this could be as simple as : for partition in [list of 
partitions in source dataset] {
         val inputDF = 
spark.read.format("any_input_format").load("partition_path")
         inputDF.write.format("com.uber.hoodie").option()....save("basePath")
         }      
+
 #### Option 3
 Write your own custom logic of how to load an existing dataset into a Hudi 
managed one. Please read about the RDD API
- [here](quickstart.md).
+ [here](quickstart.html).
 
 ```
-Using the HDFSParquetImporter Tool. Once hoodie has been built via `mvn clean 
install -DskipTests`, the shell can be 
+Using the HDFSParquetImporter Tool. Once hoodie has been built via `mvn clean 
install -DskipTests`, the shell can be
 fired by via `cd hoodie-cli && ./hoodie-cli.sh`.
 
-hoodie->hdfsparquetimport 
-        --upsert false 
-        --srcPath /user/parquet/dataset/basepath 
-        --targetPath 
-        /user/hoodie/dataset/basepath 
-        --tableName hoodie_table 
-        --tableType COPY_ON_WRITE 
-        --rowKeyField _row_key 
-        --partitionPathField partitionStr 
-        --parallelism 1500 
-        --schemaFilePath /user/table/schema 
-        --format parquet 
-        --sparkMemory 6g 
+hoodie->hdfsparquetimport
+        --upsert false
+        --srcPath /user/parquet/dataset/basepath
+        --targetPath
+        /user/hoodie/dataset/basepath
+        --tableName hoodie_table
+        --tableType COPY_ON_WRITE
+        --rowKeyField _row_key
+        --partitionPathField partitionStr
+        --parallelism 1500
+        --schemaFilePath /user/table/schema
+        --format parquet
+        --sparkMemory 6g
         --retry 2
-```
\ No newline at end of file
+```
diff --git a/docs/quickstart.md b/docs/quickstart.md
index f1516ae..1e6fa49 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -13,13 +13,14 @@ permalink: quickstart.html
 Check out code and pull it into Intellij as a normal maven project.
 
 Normally build the maven project, from command line
+
 ```
 $ mvn clean install -DskipTests -DskipITs
+```
 
 To work with older version of Hive (pre Hive-1.2.1), use
-
+```
 $ mvn clean install -DskipTests -DskipITs -Dhive11
-
 ```
 
 {% include callout.html content="You might want to add your spark jars folder 
to project dependencies under 'Module Setttings', to be able to run Spark from 
IDE" type="info" %}
@@ -31,13 +32,13 @@ $ mvn clean install -DskipTests -DskipITs -Dhive11
 
 Hudi requires Java 8 to be installed. Hudi works with Spark-2.x versions. We 
have verified that Hudi works with the following combination of 
Hadoop/Hive/Spark.
 
-| Hadoop | Hive  | Spark | Instructions to Build Hudi | 
+| Hadoop | Hive  | Spark | Instructions to Build Hudi |
 | ---- | ----- | ---- | ---- |
 | 2.6.0-cdh5.7.2 | 1.1.0-cdh5.7.2 | spark-2.[1-3].x | Use “mvn clean install 
-DskipTests -Dhadoop.version=2.6.0-cdh5.7.2 -Dhive.version=1.1.0-cdh5.7.2” |
 | Apache hadoop-2.8.4 | Apache hive-2.3.3 | spark-2.[1-3].x | Use "mvn clean 
install -DskipTests" |
 | Apache hadoop-2.7.3 | Apache hive-1.2.1 | spark-2.[1-3].x | Use "mvn clean 
install -DskipTests" |
 
-If your environment has other versions of hadoop/hive/spark, please try out 
Hudi and let us know if there are any issues. We are limited by our bandwidth 
to certify other combinations. 
+If your environment has other versions of hadoop/hive/spark, please try out 
Hudi and let us know if there are any issues. We are limited by our bandwidth 
to certify other combinations.
 It would be of great help if you can reach out to us with your setup and 
experience with hoodie.
 
 ## Generate a Hudi Dataset
@@ -60,7 +61,7 @@ export 
PATH=$JAVA_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_INSTALL/bin:$P
 
 ### Supported API's
 
-Use the DataSource API to quickly start reading or writing Hudi datasets in 
few lines of code. Ideal for most 
+Use the DataSource API to quickly start reading or writing Hudi datasets in 
few lines of code. Ideal for most
 ingestion use-cases.
 Use the RDD API to perform more involved actions on a Hudi dataset
 
@@ -132,11 +133,11 @@ This can be run as frequently as the ingestion pipeline 
to make sure new partiti
 cd hoodie-hive
 ./run_sync_tool.sh
   --user hive
-  --pass hive 
-  --database default 
-  --jdbc-url "jdbc:hive2://localhost:10010/" 
-  --base-path tmp/hoodie/sample-table/ 
-  --table hoodie_test 
+  --pass hive
+  --database default
+  --jdbc-url "jdbc:hive2://localhost:10010/"
+  --base-path tmp/hoodie/sample-table/
+  --table hoodie_test
   --partitioned-by field1,field2
 
 ```
@@ -304,7 +305,7 @@ hive>
 ## A Demo using docker containers
 
 Lets use a real world example to see how hudi works end to end. For this 
purpose, a self contained
-data infrastructure is brought up in a local docker cluster within your 
computer. 
+data infrastructure is brought up in a local docker cluster within your 
computer.
 
 The steps assume you are using Mac laptop
 
@@ -313,7 +314,7 @@ The steps assume you are using Mac laptop
   * Docker Setup :  For Mac, Please follow the steps as defined in 
[https://docs.docker.com/v17.12/docker-for-mac/install/]. For running Spark-SQL 
queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See 
Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be 
killed because of memory issues.
   * kafkacat : A command-line utility to publish/consume from kafka topics. 
Use `brew install kafkacat` to install kafkacat
   * /etc/hosts : The demo references many services running in container by the 
hostname. Add the following settings to /etc/hosts
-  
+
   ```
    127.0.0.1 adhoc-1
    127.0.0.1 adhoc-2
@@ -378,15 +379,15 @@ At this point, the docker cluster will be up and running. 
The demo cluster bring
    * HDFS Services (NameNode, DataNode)
    * Spark Master and Worker
    * Hive Services (Metastore, HiveServer2 along with PostgresDB)
-   * Kafka Broker and a Zookeeper Node (Kakfa will be used as upstream source 
for the demo) 
+   * Kafka Broker and a Zookeeper Node (Kakfa will be used as upstream source 
for the demo)
    * Adhoc containers to run Hudi/Hive CLI commands
 
 ### Demo
 
-Stock Tracker data will be used to showcase both different Hudi Views and the 
effects of Compaction. 
+Stock Tracker data will be used to showcase both different Hudi Views and the 
effects of Compaction.
 
-Take a look at the directory `docker/demo/data`. There are 2 batches of stock 
data - each at 1 minute granularity. 
-The first batch contains stocker tracker data for some stock symbols during 
the first hour of trading window 
+Take a look at the directory `docker/demo/data`. There are 2 batches of stock 
data - each at 1 minute granularity.
+The first batch contains stocker tracker data for some stock symbols during 
the first hour of trading window
 (9:30 a.m to 10:30 a.m). The second batch contains tracker data for next 30 
mins (10:30 - 11 a.m). Hudi will
 be used to ingest these batches to a dataset which will contain the latest 
stock tracker data at hour level granularity.
 The batches are windowed intentionally so that the second batch contains 
updates to some of the rows in the first batch.
@@ -396,7 +397,7 @@ The batches are windowed intentionally so that the second 
batch contains updates
 Upload the first batch to Kafka topic 'stock ticks'
 
 ```
-cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P 
+cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P
 
 To check if the new topic shows up, use
 kafkacat -b kafkabroker -L -J | jq .
@@ -443,7 +444,7 @@ kafkacat -b kafkabroker -L -J | jq .
 
 Hudi comes with a tool named DeltaStreamer. This tool can connect to variety 
of data sources (including Kafka) to
 pull changes and apply to Hudi dataset using upsert/insert primitives. Here, 
we will use the tool to download
-json data from kafka topic and ingest to both COW and MOR tables we 
initialized in the previous step. This tool 
+json data from kafka topic and ingest to both COW and MOR tables we 
initialized in the previous step. This tool
 automatically initializes the datasets in the file-system if they do not exist 
yet.
 
 ```
@@ -468,8 +469,8 @@ spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
 exit
 ```
 
-You can use HDFS web-browser to look at the datasets 
-`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow`. 
+You can use HDFS web-browser to look at the datasets
+`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow`.
 
 You can explore the new partition folder created in the dataset along with a 
"deltacommit"
 file under .hoodie which signals a successful commit.
@@ -501,7 +502,7 @@ docker exec -it adhoc-2 /bin/bash
 ....
 exit
 ```
-After executing the above command, you will notice 
+After executing the above command, you will notice
 
 1. A hive table named `stock_ticks_cow` created which provides Read-Optimized 
view for the Copy On Write dataset.
 2. Two new tables `stock_ticks_mor` and `stock_ticks_mor_rt` created for the 
Merge On Read dataset. The former
@@ -511,7 +512,7 @@ provides the ReadOptimized view for the Hudi dataset and 
the later provides the
 #### Step 4 (a): Run Hive Queries
 
 Run a hive query to find the latest timestamp ingested for stock symbol 
'GOOG'. You will notice that both read-optimized
-(for both COW and MOR dataset)and realtime views (for MOR dataset)give the 
same value "10:29 a.m" as Hudi create a 
+(for both COW and MOR dataset)and realtime views (for MOR dataset)give the 
same value "10:29 a.m" as Hudi create a
 parquet file for the first batch of data.
 
 ```
@@ -565,7 +566,7 @@ Now, run a projection query:
 # Merge-On-Read Queries:
 ==========================
 
-Lets run similar queries against M-O-R dataset. Lets look at both 
+Lets run similar queries against M-O-R dataset. Lets look at both
 ReadOptimized and Realtime views supported by M-O-R dataset
 
 # Run against ReadOptimized View. Notice that the latest timestamp is 10:29
@@ -670,7 +671,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, 
volume, open, close
 # Merge-On-Read Queries:
 ==========================
 
-Lets run similar queries against M-O-R dataset. Lets look at both 
+Lets run similar queries against M-O-R dataset. Lets look at both
 ReadOptimized and Realtime views supported by M-O-R dataset
 
 # Run against ReadOptimized View. Notice that the latest timestamp is 10:29
@@ -718,7 +719,7 @@ Upload the second batch of data and ingest this batch using 
delta-streamer. As t
 partitions, there is no need to run hive-sync
 
 ```
-cat docker/demo/data/batch_2.json | kafkacat -b kafkabroker -t stock_ticks -P 
+cat docker/demo/data/batch_2.json | kafkacat -b kafkabroker -t stock_ticks -P
 
 # Within Docker container, run the ingestion command
 docker exec -it adhoc-2 /bin/bash
@@ -734,15 +735,15 @@ exit
 With Copy-On-Write table, the second ingestion by DeltaStreamer resulted in a 
new version of Parquet file getting created.
 See 
`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow/2018/08/31`
 
-With Merge-On-Read table, the second ingestion merely appended the batch to an 
unmerged delta (log) file. 
+With Merge-On-Read table, the second ingestion merely appended the batch to an 
unmerged delta (log) file.
 Take a look at the HDFS filesystem to get an idea: 
`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_mor/2018/08/31`
 
 #### Step 6(a): Run Hive Queries
 
-With Copy-On-Write table, the read-optimized view immediately sees the changes 
as part of second batch once the batch 
-got committed as each ingestion creates newer versions of parquet files. 
+With Copy-On-Write table, the read-optimized view immediately sees the changes 
as part of second batch once the batch
+got committed as each ingestion creates newer versions of parquet files.
 
-With Merge-On-Read table, the second ingestion merely appended the batch to an 
unmerged delta (log) file. 
+With Merge-On-Read table, the second ingestion merely appended the batch to an 
unmerged delta (log) file.
 This is the time, when ReadOptimized and Realtime views will provide different 
results. ReadOptimized view will still
 return "10:29 am" as it will only read from the Parquet file. Realtime View 
will do on-the-fly merge and return
 latest committed data which is "10:59 a.m".
@@ -773,7 +774,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be 
available in the futu
 As you can notice, the above queries now reflect the changes that came as part 
of ingesting second batch.
 
 
-# Merge On Read Table: 
+# Merge On Read Table:
 
 # Read Optimized View
 0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor 
group by symbol HAVING symbol = 'GOOG';
@@ -843,7 +844,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, 
volume, open, close
 As you can notice, the above queries now reflect the changes that came as part 
of ingesting second batch.
 
 
-# Merge On Read Table: 
+# Merge On Read Table:
 
 # Read Optimized View
 scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol 
HAVING symbol = 'GOOG'").show(100, false)
@@ -909,8 +910,8 @@ To show the effects of incremental-query, let us assume 
that a reader has alread
 ingesting first batch. Now, for the reader to see effect of the second batch, 
he/she has to keep the start timestamp to
 the commit time of the first batch (20180924064621) and run incremental query
 
-`Hudi incremental mode` provides efficient scanning for incremental queries by 
filtering out files that do not have any 
-candidate rows using hudi-managed metadata. 
+`Hudi incremental mode` provides efficient scanning for incremental queries by 
filtering out files that do not have any
+candidate rows using hudi-managed metadata.
 
 ```
 docker exec -it adhoc-2 /bin/bash
@@ -1008,7 +1009,7 @@ hoodie:stock_ticks_mor->compactions show all
     ___________________________________________________________________
     | Compaction Instant Time| State    | Total FileIds to be Compacted|
     |==================================================================|
-    
+
 # Schedule a compaction. This will use Spark Launcher to schedule compaction
 hoodie:stock_ticks_mor->compaction schedule
 ....
@@ -1028,7 +1029,7 @@ hoodie:stock_ticks_mor->compactions show all
     ___________________________________________________________________
     | Compaction Instant Time| State    | Total FileIds to be Compacted|
     |==================================================================|
-    | 20180924070031         | REQUESTED| 1                            | 
+    | 20180924070031         | REQUESTED| 1                            |
 
 # Execute the compaction. The compaction instant value passed below must be 
the one displayed in the above "compactions show all" query
 hoodie:stock_ticks_mor->compaction run --compactionInstant  20180924070031 
--parallelism 2 --sparkMemory 1G  --schemaFilePath /var/demo/config/schema.avsc 
--retry 1  
@@ -1052,7 +1053,7 @@ hoodie:stock_ticks->compactions show all
     |==================================================================|
     | 20180924070031         | COMPLETED| 1                            |
 
-``` 
+```
 
 #### Step 9: Run Hive Queries including incremental queries
 
@@ -1169,9 +1170,9 @@ You can bring up a hadoop docker environment containing 
Hadoop, Hive and Spark s
 ```
 $ mvn pre-integration-test -DskipTests
 ```
-The above command builds docker images for all the services with 
-current Hudi source installed at /var/hoodie/ws and also brings up the 
services using a compose file. We 
-currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.3.1) in docker 
images. 
+The above command builds docker images for all the services with
+current Hudi source installed at /var/hoodie/ws and also brings up the 
services using a compose file. We
+currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.3.1) in docker 
images.
 
 To bring down the containers
 ```
@@ -1185,9 +1186,9 @@ $ cd hoodie-integ-test
 $  mvn docker-compose:up -DdetachedMode=true
 ```
 
-Hudi is a library that is operated in a broader data analytics/ingestion 
environment 
+Hudi is a library that is operated in a broader data analytics/ingestion 
environment
 involving Hadoop, Hive and Spark. Interoperability with all these systems is a 
key objective for us. We are
-actively adding integration-tests under __hoodie-integ-test/src/test/java__ 
that makes use of this 
+actively adding integration-tests under __hoodie-integ-test/src/test/java__ 
that makes use of this
 docker environment (See 
__hoodie-integ-test/src/test/java/com/uber/hoodie/integ/ITTestHoodieSanity.java__
 )
 
 
@@ -1202,10 +1203,10 @@ and compose scripts are carefully implemented so that 
they serve dual-purpose
    inbuilt jars by mounting local HUDI workspace over the docker location
 
 This helps avoid maintaining separate docker images and avoids the costly step 
of building HUDI docker images locally.
-But if users want to test hudi from locations with lower network bandwidth, 
they can still build local images 
-run the script 
+But if users want to test hudi from locations with lower network bandwidth, 
they can still build local images
+run the script
 `docker/build_local_docker_images.sh` to build local docker images before 
running `docker/setup_demo.sh`
- 
+
 Here are the commands:
 
 ```
diff --git a/docs/roadmap.md b/docs/roadmap.md
deleted file mode 100644
index c65c3a9..0000000
--- a/docs/roadmap.md
+++ /dev/null
@@ -1,14 +0,0 @@
----
-title: Roadmap
-keywords: usecases
-sidebar: mydoc_sidebar
-permalink: roadmap.html
----
-
-## Planned Features
-
-* Support for Self Joins - As of now, you cannot incrementally consume the 
same table more than once, since the InputFormat does not understand the 
QueryPlan.
-* Hudi Spark Datasource -  Allows for reading and writing data back using 
Apache Spark natively (without falling back to InputFormat), which can be more 
performant
-* Hudi Presto Connector - Allows for querying data managed by Hudi using 
Presto natively, which can again boost 
[performance](https://prestodb.io/docs/current/release/release-0.138.html)
-
-
diff --git a/docs/sql_queries.md b/docs/sql_queries.md
index 955e794..44848eb 100644
--- a/docs/sql_queries.md
+++ b/docs/sql_queries.md
@@ -62,7 +62,4 @@ 
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.clas
 
 ## Presto
 
-Presto requires a [patch](https://github.com/prestodb/presto/pull/7002) (until 
the PR is merged) and the hoodie-hadoop-mr-bundle jar to be placed
-into `<presto_install>/plugin/hive-hadoop2/`.
-
-{% include callout.html content="Get involved to improve this integration 
[here](https://github.com/uber/hoodie/issues/81)" type="info" %}
+Presto requires the `hoodie-presto-bundle` jar to be placed into 
`<presto_install>/plugin/hive-hadoop2/`, across the installation.

[incubator-hudi-site] 12/19: Revised community, contributing pages

Reply via email to