[incubator-hudi] branch asf-site updated: [HUDI-510] Update site documentation in sync with cWiki

bhavanisudha Tue, 21 Jan 2020 12:12:25 -0800

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new a3135e7  [HUDI-510] Update site documentation in sync with cWiki
a3135e7 is described below

commit a3135e7be641d3936b3d301443fe9f489925dfc8
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Mon Jan 20 17:12:44 2020 -0800

    [HUDI-510] Update site documentation in sync with cWiki
---
 docs/_docs/0_3_migration_guide.md   |  45 +++++++-------
 docs/_docs/1_1_quick_start_guide.md |  64 +++++++++----------
 docs/_docs/1_2_structure.md         |  12 ++--
 docs/_docs/1_3_use_cases.md         |  12 ++--
 docs/_docs/1_5_comparison.md        |   2 +-
 docs/_docs/2_1_concepts.md          | 121 ++++++++++++++++++------------------
 docs/_docs/2_2_writing_data.md      |  68 +++++++++-----------
 docs/_docs/2_3_querying_data.md     |  76 +++++++++++-----------
 docs/_docs/2_4_configurations.md    |  38 +++++------
 docs/_docs/2_5_performance.md       |   8 +--
 10 files changed, 222 insertions(+), 224 deletions(-)

diff --git a/docs/_docs/0_3_migration_guide.md 
b/docs/_docs/0_3_migration_guide.md
index 053dcf4..25c70f6 100644
--- a/docs/_docs/0_3_migration_guide.md
+++ b/docs/_docs/0_3_migration_guide.md
@@ -2,12 +2,12 @@
 title: Migration Guide
 keywords: hudi, migration, use case
 permalink: /docs/migration_guide.html
-summary: In this page, we will discuss some available tools for migrating your 
existing dataset into a Hudi dataset
+summary: In this page, we will discuss some available tools for migrating your 
existing table into a Hudi table
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Hudi maintains metadata such as commit timeline and indexes to manage a 
dataset. The commit timelines helps to understand the actions happening on a 
dataset as well as the current state of a dataset. Indexes are used by Hudi to 
maintain a record key to file id mapping to efficiently locate a record. At the 
moment, Hudi supports writing only parquet columnar formats.
-To be able to start using Hudi for your existing dataset, you will need to 
migrate your existing dataset into a Hudi managed dataset. There are a couple 
of ways to achieve this.
+Hudi maintains metadata such as commit timeline and indexes to manage a table. 
The commit timelines helps to understand the actions happening on a table as 
well as the current state of a table. Indexes are used by Hudi to maintain a 
record key to file id mapping to efficiently locate a record. At the moment, 
Hudi supports writing only parquet columnar formats.
+To be able to start using Hudi for your existing table, you will need to 
migrate your existing table into a Hudi managed table. There are a couple of 
ways to achieve this.
 
 
 ## Approaches
@@ -15,51 +15,50 @@ To be able to start using Hudi for your existing dataset, 
you will need to migra
 
 ### Use Hudi for new partitions alone
 
-Hudi can be used to manage an existing dataset without affecting/altering the 
historical data already present in the
-dataset. Hudi has been implemented to be compatible with such a mixed dataset 
with a caveat that either the complete
-Hive partition is Hudi managed or not. Thus the lowest granularity at which 
Hudi manages a dataset is a Hive
-partition. Start using the datasource API or the WriteClient to write to the 
dataset and make sure you start writing
+Hudi can be used to manage an existing table without affecting/altering the 
historical data already present in the
+table. Hudi has been implemented to be compatible with such a mixed table with 
a caveat that either the complete
+Hive partition is Hudi managed or not. Thus the lowest granularity at which 
Hudi manages a table is a Hive
+partition. Start using the datasource API or the WriteClient to write to the 
table and make sure you start writing
 to a new partition or convert your last N partitions into Hudi instead of the 
entire table. Note, since the historical
- partitions are not managed by HUDI, none of the primitives provided by HUDI 
work on the data in those partitions. More concretely, one cannot perform 
upserts or incremental pull on such older partitions not managed by the HUDI 
dataset.
-Take this approach if your dataset is an append only type of dataset and you 
do not expect to perform any updates to existing (or non Hudi managed) 
partitions.
+ partitions are not managed by HUDI, none of the primitives provided by HUDI 
work on the data in those partitions. More concretely, one cannot perform 
upserts or incremental pull on such older partitions not managed by the HUDI 
table.
+Take this approach if your table is an append only type of table and you do 
not expect to perform any updates to existing (or non Hudi managed) partitions.
 
 
-### Convert existing dataset to Hudi
+### Convert existing table to Hudi
 
-Import your existing dataset into a Hudi managed dataset. Since all the data 
is Hudi managed, none of the limitations
- of Approach 1 apply here. Updates spanning any partitions can be applied to 
this dataset and Hudi will efficiently
- make the update available to queries. Note that not only do you get to use 
all Hudi primitives on this dataset,
- there are other additional advantages of doing this. Hudi automatically 
manages file sizes of a Hudi managed dataset
- . You can define the desired file size when converting this dataset and Hudi 
will ensure it writes out files
+Import your existing table into a Hudi managed table. Since all the data is 
Hudi managed, none of the limitations
+ of Approach 1 apply here. Updates spanning any partitions can be applied to 
this table and Hudi will efficiently
+ make the update available to queries. Note that not only do you get to use 
all Hudi primitives on this table,
+ there are other additional advantages of doing this. Hudi automatically 
manages file sizes of a Hudi managed table
+ . You can define the desired file size when converting this table and Hudi 
will ensure it writes out files
  adhering to the config. It will also ensure that smaller files later get 
corrected by routing some new inserts into
  small files rather than writing new small ones thus maintaining the health of 
your cluster.
 
 There are a few options when choosing this approach.
 
 **Option 1**
-Use the HDFSParquetImporter tool. As the name suggests, this only works if 
your existing dataset is in parquet file format.
-This tool essentially starts a Spark Job to read the existing parquet dataset 
and converts it into a HUDI managed dataset by re-writing all the data.
+Use the HDFSParquetImporter tool. As the name suggests, this only works if 
your existing table is in parquet file format.
+This tool essentially starts a Spark Job to read the existing parquet table 
and converts it into a HUDI managed table by re-writing all the data.
 
 **Option 2**
-For huge datasets, this could be as simple as : 
+For huge tables, this could be as simple as : 
 ```java
-for partition in [list of partitions in source dataset] {
+for partition in [list of partitions in source table] {
         val inputDF = 
spark.read.format("any_input_format").load("partition_path")
         inputDF.write.format("org.apache.hudi").option()....save("basePath")
 }
 ```  
 
 **Option 3**
-Write your own custom logic of how to load an existing dataset into a Hudi 
managed one. Please read about the RDD API
+Write your own custom logic of how to load an existing table into a Hudi 
managed one. Please read about the RDD API
  [here](/docs/quick-start-guide.html). Using the HDFSParquetImporter Tool. 
Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
 fired by via `cd hudi-cli && ./hudi-cli.sh`.
 
 ```java
 hudi->hdfsparquetimport
         --upsert false
-        --srcPath /user/parquet/dataset/basepath
-        --targetPath
-        /user/hoodie/dataset/basepath
+        --srcPath /user/parquet/table/basepath
+        --targetPath /user/hoodie/table/basepath
         --tableName hoodie_table
         --tableType COPY_ON_WRITE
         --rowKeyField _row_key
diff --git a/docs/_docs/1_1_quick_start_guide.md 
b/docs/_docs/1_1_quick_start_guide.md
index 876a3a2..e7c7f37 100644
--- a/docs/_docs/1_1_quick_start_guide.md
+++ b/docs/_docs/1_1_quick_start_guide.md
@@ -6,8 +6,8 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
 This guide provides a quick peek at Hudi's capabilities using spark-shell. 
Using Spark datasources, we will walk through 
-code snippets that allows you to insert and update a Hudi dataset of default 
storage type: 
-[Copy on Write](/docs/concepts.html#copy-on-write-storage). 
+code snippets that allows you to insert and update a Hudi table of default 
table type: 
+[Copy on Write](/docs/concepts.html#copy-on-write-table). 
 After each write operation we will also show how to read the data both 
snapshot and incrementally.
 
 ## Setup spark-shell
@@ -30,8 +30,8 @@ import org.apache.hudi.DataSourceReadOptions._
 import org.apache.hudi.DataSourceWriteOptions._
 import org.apache.hudi.config.HoodieWriteConfig._
 
-val tableName = "hudi_cow_table"
-val basePath = "file:///tmp/hudi_cow_table"
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
 val dataGen = new DataGenerator
 ```
 
@@ -42,7 +42,7 @@ can generate sample inserts and updates based on the the 
sample trip schema [her
 
 ## Insert data
 
-Generate some new trips, load them into a DataFrame and write the DataFrame 
into the Hudi dataset as below.
+Generate some new trips, load them into a DataFrame and write the DataFrame 
into the Hudi table as below.
 
 ```scala
 val inserts = convertToStringList(dataGen.generateInserts(10))
@@ -57,12 +57,12 @@ df.write.format("org.apache.hudi").
     save(basePath);
 ``` 
 
-`mode(Overwrite)` overwrites and recreates the dataset if it already exists.
-You can check the data generated under 
`/tmp/hudi_cow_table/<region>/<country>/<city>/`. We provided a record key 
-(`uuid` in [schema](#sample-schema)), partition field (`region/county/city`) 
and combine logic (`ts` in 
-[schema](#sample-schema)) to ensure trip records are unique within each 
partition. For more info, refer to 
+`mode(Overwrite)` overwrites and recreates the table if it already exists.
+You can check the data generated under 
`/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key 
+(`uuid` in 
[schema](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)),
 partition field (`region/county/city`) and combine logic (`ts` in 
+[schema](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58))
 to ensure trip records are unique within each partition. For more info, refer 
to 
 [Modeling data stored in 
Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
-and for info on ways to ingest data into Hudi, refer to [Writing Hudi 
Datasets](/docs/writing_data.html).
+and for info on ways to ingest data into Hudi, refer to [Writing Hudi 
Tables](/docs/writing_data.html).
 Here we are using the default write operation : `upsert`. If you have a 
workload without updates, you can also issue 
 `insert` or `bulk_insert` operations which could be faster. To know more, 
refer to [Write operations](/docs/writing_data#write-operations)
 {: .notice--info}
@@ -72,24 +72,24 @@ Here we are using the default write operation : `upsert`. 
If you have a workload
 Load the data files into a DataFrame.
 
 ```scala
-val roViewDF = spark.
+val tripsSnapshotDF = spark.
     read.
     format("org.apache.hudi").
     load(basePath + "/*/*/*/*")
-roViewDF.createOrReplaceTempView("hudi_ro_table")
-spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_ro_table where 
fare > 20.0").show()
-spark.sql("select _hoodie_commit_time, _hoodie_record_key, 
_hoodie_partition_path, rider, driver, fare from  hudi_ro_table").show()
+tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
+spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot 
where fare > 20.0").show()
+spark.sql("select _hoodie_commit_time, _hoodie_record_key, 
_hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()
 ```
 
-This query provides a read optimized view of the ingested data. Since our 
partition path (`region/country/city`) is 3 levels nested 
+This query provides snapshot querying of the ingested data. Since our 
partition path (`region/country/city`) is 3 levels nested 
 from base path we ve used `load(basePath + "/*/*/*/*")`. 
-Refer to [Storage Types and Views](/docs/concepts#storage-types--views) for 
more info on all storage types and views supported.
+Refer to [Table types and queries](/docs/concepts#table-types--queries) for 
more info on all table types and querying types supported.
 {: .notice--info}
 
 ## Update data
 
 This is similar to inserting new data. Generate updates to existing trips 
using the data generator, load into a DataFrame 
-and write DataFrame into the hudi dataset.
+and write DataFrame into the hudi table.
 
 ```scala
 val updates = convertToStringList(dataGen.generateUpdates(10))
@@ -104,15 +104,15 @@ df.write.format("org.apache.hudi").
     save(basePath);
 ```
 
-Notice that the save mode is now `Append`. In general, always use append mode 
unless you are trying to create the dataset for the first time.
-[Querying](#query-data) the data again will now show updated trips. Each write 
operation generates a new 
[commit](http://hudi.incubator.apache.org/concepts.html) 
+Notice that the save mode is now `Append`. In general, always use append mode 
unless you are trying to create the table for the first time.
+[Querying](#query-data) the data again will now show updated trips. Each write 
operation generates a new 
[commit](http://hudi.incubator.apache.org/docs/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, 
`driver` fields for the same `_hoodie_record_key`s in previous commit. 
 {: .notice--info}
 
 ## Incremental query
 
 Hudi also provides capability to obtain a stream of records that changed since 
given commit timestamp. 
-This can be achieved using Hudi's incremental view and providing a begin time 
from which changes need to be streamed. 
+This can be achieved using Hudi's incremental querying and providing a begin 
time from which changes need to be streamed. 
 We do not need to specify endTime, if we want all changes after the given 
commit (as is the common case). 
 
 ```scala
@@ -121,20 +121,20 @@ spark.
     read.
     format("org.apache.hudi").
     load(basePath + "/*/*/*/*").
-    createOrReplaceTempView("hudi_ro_table")
+    createOrReplaceTempView("hudi_trips_snapshot")
 
-val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime 
from  hudi_ro_table order by commitTime").map(k => k.getString(0)).take(50)
+val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime 
from  hudi_trips_snapshot order by commitTime").map(k => 
k.getString(0)).take(50)
 val beginTime = commits(commits.length - 2) // commit time we are interested in
 
 // incrementally query data
-val incViewDF = spark.
+val tripsIncrementalDF = spark.
     read.
     format("org.apache.hudi").
-    option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
+    option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
     option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
     load(basePath);
-incViewDF.registerTempTable("hudi_incr_table")
-spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  
hudi_incr_table where fare > 20.0").show()
+tripsIncrementalDF.registerTempTable("hudi_trips_incremental")
+spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  
hudi_trips_incremental where fare > 20.0").show()
 ``` 
 
 This will give all changes that happened after the beginTime commit with the 
filter of fare > 20.0. The unique thing about this
@@ -151,13 +151,13 @@ val beginTime = "000" // Represents all commits > this 
time.
 val endTime = commits(commits.length - 2) // commit time we are interested in
 
 //incrementally query data
-val incViewDF = spark.read.format("org.apache.hudi").
-    option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
+val tripsPointInTimeDF = spark.read.format("org.apache.hudi").
+    option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
     option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
     option(END_INSTANTTIME_OPT_KEY, endTime).
     load(basePath);
-incViewDF.registerTempTable("hudi_incr_table")
-spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  
hudi_incr_table where fare > 20.0").show()
+tripsPointInTimeDF.registerTempTable("hudi_trips_point_in_time")
+spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  
hudi_trips_point_in_time where fare > 20.0").show()
 ``` 
 
 ## Where to go from here?
@@ -166,8 +166,8 @@ You can also do the quickstart by [building hudi 
yourself](https://github.com/ap
 and using `--jars <path to 
hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar`
 in the spark-shell command above
 instead of `--packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating`
 
-Also, we used Spark here to show case the capabilities of Hudi. However, Hudi 
can support multiple storage types/views and 
-Hudi datasets can be queried from query engines like Hive, Spark, Presto and 
much more. We have put together a 
+Also, we used Spark here to show case the capabilities of Hudi. However, Hudi 
can support multiple table types/query types and 
+Hudi tables can be queried from query engines like Hive, Spark, Presto and 
much more. We have put together a 
 [demo video](https://www.youtube.com/watch?v=VhNgUsxdrD0) that show cases all 
of this on a docker based setup with all 
 dependent systems running locally. We recommend you replicate the same setup 
and run the demo yourself, by following 
 steps [here](/docs/docker_demo.html) to get a taste for it. Also, if you are 
looking for ways to migrate your existing data 
diff --git a/docs/_docs/1_2_structure.md b/docs/_docs/1_2_structure.md
index bf8f373..e080fcd 100644
--- a/docs/_docs/1_2_structure.md
+++ b/docs/_docs/1_2_structure.md
@@ -6,16 +6,16 @@ summary: "Hudi brings stream processing to big data, 
providing fresh data while
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical 
datasets over DFS 
([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)
 or cloud stores) and provides three logical views for query access.
+Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical 
tables over DFS 
([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)
 or cloud stores) and provides three types of querying.
 
- * **Read Optimized View** - Provides excellent query performance on pure 
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
- * **Incremental View** - Provides a change stream out of the dataset to feed 
downstream jobs/ETLs.
- * **Near-Real time Table** - Provides queries on real-time data, using a 
combination of columnar & row based storage (e.g Parquet + 
[Avro](http://avro.apache.org/docs/current/mr.html))
+ * **Read Optimized query** - Provides excellent query performance on pure 
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
+ * **Incremental query** - Provides a change stream out of the dataset to feed 
downstream jobs/ETLs.
+ * **Snapshot query** - Provides queries on real-time data, using a 
combination of columnar & row based storage (e.g Parquet + 
[Avro](http://avro.apache.org/docs/current/mr.html))
 
 <figure>
     <img class="docimage" src="/assets/images/hudi_intro_1.png" 
alt="hudi_intro_1.png" />
 </figure>
 
-By carefully managing how data is laid out in storage & how it’s exposed to 
queries, Hudi is able to power a rich data ecosystem where external sources can 
be ingested in near real-time and made available for interactive SQL Engines 
like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), 
while at the same time capable of being consumed incrementally from 
processing/ETL frameworks like [Hive](https://hive.apache.org/) & 
[Spark](https://spark.apache.org/docs/latest/) t [...]
+By carefully managing how data is laid out in storage & how it’s exposed to 
queries, Hudi is able to power a rich data ecosystem where external sources can 
be ingested in near real-time and made available for interactive SQL Engines 
like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), 
while at the same time capable of being consumed incrementally from 
processing/ETL frameworks like [Hive](https://hive.apache.org/) & 
[Spark](https://spark.apache.org/docs/latest/) t [...]
 
-Hudi broadly consists of a self contained Spark library to build datasets and 
integrations with existing query engines for data access. See 
[quickstart](/docs/quick-start-guide) for a demo.
+Hudi broadly consists of a self contained Spark library to build tables and 
integrations with existing query engines for data access. See 
[quickstart](/docs/quick-start-guide) for a demo.
diff --git a/docs/_docs/1_3_use_cases.md b/docs/_docs/1_3_use_cases.md
index da45e81..25b35bb 100644
--- a/docs/_docs/1_3_use_cases.md
+++ b/docs/_docs/1_3_use_cases.md
@@ -20,7 +20,7 @@ or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-st
 For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
 It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+Even for immutable data sources like [Kafka](https://kafka.apache.org) , Hudi 
helps __enforces a minimum file size on HDFS__, which improves NameNode health 
by solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
 
 Across all sources, Hudi adds the much needed ability to atomically publish 
new data to consumers via notion of commits, shielding them from partial 
ingestion failures
 
@@ -32,13 +32,13 @@ This is absolutely perfect for lower scale ([relative to 
Hadoop installations li
 But, typically these systems end up getting abused for less interactive 
queries also since data on Hadoop is intolerably stale. This leads to under 
utilization & wasteful hardware/license costs.
 
 On the other hand, interactive SQL solutions on Hadoop such as Presto & 
SparkSQL excel in __queries that finish within few seconds__.
-By bringing __data freshness to a few minutes__, Hudi can provide a much 
efficient alternative, as well unlock real-time analytics on __several 
magnitudes larger datasets__ stored in DFS.
+By bringing __data freshness to a few minutes__, Hudi can provide a much 
efficient alternative, as well unlock real-time analytics on __several 
magnitudes larger tables__ stored in DFS.
 Also, Hudi has no external dependencies (like a dedicated HBase cluster, 
purely used for real-time analytics) and thus enables faster analytics on much 
fresher analytics, without increasing the operational overhead.
 
 
 ## Incremental Processing Pipelines
 
-One fundamental ability Hadoop provides is to build a chain of datasets 
derived from each other via DAGs expressed as workflows.
+One fundamental ability Hadoop provides is to build a chain of tables derived 
from each other via DAGs expressed as workflows.
 Workflows often depend on new data being output by multiple upstream workflows 
and traditionally, availability of new data is indicated by a new DFS 
Folder/Hive Partition.
 Let's take a concrete example to illustrate this. An upstream workflow `U` can 
create a Hive partition for every hour, with data for that hour (event_time) at 
the end of each hour (processing_time), providing effective freshness of 1 hour.
 Then, a downstream workflow `D`, kicks off immediately after `U` finishes, and 
does its own processing for the next hour, increasing the effective latency to 
2 hours.
@@ -48,8 +48,8 @@ Unfortunately, in today's post-mobile & pre-IoT world, __late 
data from intermit
 In such cases, the only remedy to guarantee correctness is to [reprocess the 
last few 
hours](https://falcon.apache.org/FalconDocumentation.html#Handling_late_input_data)
 worth of data,
 over and over again each hour, which can significantly hurt the efficiency 
across the entire ecosystem. For e.g; imagine reprocessing TBs worth of data 
every hour across hundreds of workflows.
 
-Hudi comes to the rescue again, by providing a way to consume new data 
(including late data) from an upsteam Hudi dataset `HU` at a record granularity 
(not folders/partitions),
-apply the processing logic, and efficiently update/reconcile late data with a 
downstream Hudi dataset `HD`. Here, `HU` and `HD` can be continuously scheduled 
at a much more frequent schedule
+Hudi comes to the rescue again, by providing a way to consume new data 
(including late data) from an upsteam Hudi table `HU` at a record granularity 
(not folders/partitions),
+apply the processing logic, and efficiently update/reconcile late data with a 
downstream Hudi table `HD`. Here, `HU` and `HD` can be continuously scheduled 
at a much more frequent schedule
 like 15 mins, and providing an end-end latency of 30 mins at `HD`.
 
 To achieve this, Hudi has embraced similar concepts from stream processing 
frameworks like [Spark 
Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations)
 , Pub/Sub systems like 
[Kafka](http://kafka.apache.org/documentation/#theconsumer)
@@ -64,4 +64,4 @@ For e.g, a Spark Pipeline can [determine hard braking events 
on Hadoop](https://
 A popular choice for this queue is Kafka and this model often results in 
__redundant storage of same data on DFS (for offline analysis on computed 
results) and Kafka (for dispersal)__
 
 Once again Hudi can efficiently solve this problem, by having the Spark 
Pipeline upsert output from
-each run into a Hudi dataset, which can then be incrementally tailed (just 
like a Kafka topic) for new data & written into the serving store.
+each run into a Hudi table, which can then be incrementally tailed (just like 
a Kafka topic) for new data & written into the serving store.
diff --git a/docs/_docs/1_5_comparison.md b/docs/_docs/1_5_comparison.md
index 3b7e739..78f2be2 100644
--- a/docs/_docs/1_5_comparison.md
+++ b/docs/_docs/1_5_comparison.md
@@ -44,7 +44,7 @@ just for analytics. Finally, HBase does not support 
incremental processing primi
 ## Stream Processing
 
 A popular question, we get is : "How does Hudi relate to stream processing 
systems?", which we will try to answer here. Simply put, Hudi can integrate with
-batch (`copy-on-write storage`) and streaming (`merge-on-read storage`) jobs 
of today, to store the computed results in Hadoop. For Spark apps, this can 
happen via direct
+batch (`copy-on-write table`) and streaming (`merge-on-read table`) jobs of 
today, to store the computed results in Hadoop. For Spark apps, this can happen 
via direct
 integration of Hudi library with Spark/Spark streaming DAGs. In case of 
Non-Spark processing systems (eg: Flink, Hive), the processing can be done in 
the respective systems
 and later sent into a Hudi table via a Kafka topic/DFS intermediate file. In 
more conceptual level, data processing
 pipelines just consist of three components : `source`, `processing`, `sink`, 
with users ultimately running queries against the sink to use the results of 
the pipeline.
diff --git a/docs/_docs/2_1_concepts.md b/docs/_docs/2_1_concepts.md
index 66205a2..c99aa41 100644
--- a/docs/_docs/2_1_concepts.md
+++ b/docs/_docs/2_1_concepts.md
@@ -1,24 +1,24 @@
 ---
 title: "Concepts"
-keywords: hudi, design, storage, views, timeline
+keywords: hudi, design, table, queries, timeline
 permalink: /docs/concepts.html
 summary: "Here we introduce some basic concepts & give a broad technical 
overview of Hudi"
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over datasets on DFS
+Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over hadoop compatible storages
 
- * Upsert                     (how do I change the dataset?)
- * Incremental pull           (how do I fetch data that changed?)
+ * Update/Delete Records      (how do I change records in a table?)
+ * Change Streams             (how do I fetch records that changed?)
 
 In this section, we will discuss key concepts & terminologies that are 
important to understand, to be able to effectively use these primitives.
 
 ## Timeline
-At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time that helps provide instantaneous views 
of the dataset,
+At its core, Hudi maintains a `timeline` of all actions performed on the table 
at different `instants` of time that helps provide instantaneous views of the 
table,
 while also efficiently supporting retrieval of data in the order of arrival. A 
Hudi instant consists of the following components 
 
- * `Action type` : Type of action performed on the dataset
+ * `Instant action` : Type of action performed on the table
  * `Instant time` : Instant time is typically a timestamp (e.g: 
20190117010349), which monotonically increases in the order of action's begin 
time.
  * `state` : current state of the instant
  
@@ -26,12 +26,12 @@ Hudi guarantees that the actions performed on the timeline 
are atomic & timeline
 
 Key actions performed include
 
- * `COMMITS` - A commit denotes an **atomic write** of a batch of records into 
a dataset.
- * `CLEANS` - Background activity that gets rid of older versions of files in 
the dataset, that are no longer needed.
- * `DELTA_COMMIT` - A delta commit refers to an **atomic write** of a batch of 
records into a  MergeOnRead storage type of dataset, where some/all of the data 
could be just written to delta logs.
+ * `COMMITS` - A commit denotes an **atomic write** of a batch of records into 
a table.
+ * `CLEANS` - Background activity that gets rid of older versions of files in 
the table, that are no longer needed.
+ * `DELTA_COMMIT` - A delta commit refers to an **atomic write** of a batch of 
records into a  MergeOnRead type table, where some/all of the data could be 
just written to delta logs.
  * `COMPACTION` - Background activity to reconcile differential data 
structures within Hudi e.g: moving updates from row based log files to columnar 
formats. Internally, compaction manifests as a special commit on the timeline
  * `ROLLBACK` - Indicates that a commit/delta commit was unsuccessful & rolled 
back, removing any partial files produced during such a write
- * `SAVEPOINT` - Marks certain file groups as "saved", such that cleaner will 
not delete them. It helps restore the dataset to a point on the timeline, in 
case of disaster/data recovery scenarios.
+ * `SAVEPOINT` - Marks certain file groups as "saved", such that cleaner will 
not delete them. It helps restore the table to a point on the timeline, in case 
of disaster/data recovery scenarios.
 
 Any given instant can be 
 in one of the following states
@@ -44,7 +44,7 @@ in one of the following states
     <img class="docimage" src="/assets/images/hudi_timeline.png" 
alt="hudi_timeline.png" />
 </figure>
 
-Example above shows upserts happenings between 10:00 and 10:20 on a Hudi 
dataset, roughly every 5 mins, leaving commit metadata on the Hudi timeline, 
along
+Example above shows upserts happenings between 10:00 and 10:20 on a Hudi 
table, roughly every 5 mins, leaving commit metadata on the Hudi timeline, along
 with other background cleaning/compactions. One key observation to make is 
that the commit time indicates the `arrival time` of the data (10:20AM), while 
the actual data
 organization reflects the actual time or `event time`, the data was intended 
for (hourly buckets from 07:00). These are two key concepts when reasoning 
about tradeoffs between latency and completeness of data.
 
@@ -53,37 +53,38 @@ With the help of the timeline, an incremental query 
attempting to get all new da
 only the changed files without say scanning all the time buckets > 07:00.
 
 ## File management
-Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+Hudi organizes a table into a directory structure under a `basepath` on DFS. 
Table is broken up into partitions, which are folders containing data files for 
that partition,
 very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
 Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
-`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+`file slices`, where each slice contains a base file (`*.parquet`) produced at 
a certain commit/compaction instant time,
  along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
 Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
 unused/older file slices to reclaim space on DFS. 
 
-Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+## Index
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file id, via an indexing mechanism. 
 This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
 mapped file group contains all versions of a group of records.
 
-## Storage Types & Views
-Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
-In turn, `views` define how the underlying data is exposed to the queries (i.e 
how data is read). 
+## Table Types & Queries
+Hudi table types define how data is indexed & laid out on the DFS and how the 
above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
+In turn, `query types` define how the underlying data is exposed to the 
queries (i.e how data is read). 
 
-| Storage Type  | Supported Views |
+| Table Type    | Supported Query types |
 |-------------- |------------------|
-| Copy On Write | Read Optimized + Incremental   |
-| Merge On Read | Read Optimized + Incremental + Near Real-time |
+| Copy On Write | Snapshot Queries + Incremental Queries  |
+| Merge On Read | Snapshot Queries + Incremental Queries + Read Optimized 
Queries |
 
-### Storage Types
-Hudi supports the following storage types.
+### Table Types
+Hudi supports the following table types.
 
-  - [Copy On Write](#copy-on-write-storage) : Stores data using exclusively 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing a synchronous merge during write.
-  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+  - [Copy On Write](#copy-on-write-table) : Stores data using exclusively 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing a synchronous merge during write.
+  - [Merge On Read](#merge-on-read-table) : Stores data using a combination of 
columnar (e.g parquet) + row based (e.g avro) file formats. Updates are logged 
to delta files & later compacted to produce new versions of columnar files 
synchronously or asynchronously.
     
-Following table summarizes the trade-offs between these two storage types
+Following table summarizes the trade-offs between these two table types
 
-| Trade-off | CopyOnWrite | MergeOnRead |
+| Trade-off     | CopyOnWrite      | MergeOnRead |
 |-------------- |------------------| ------------------|
 | Data Latency | Higher   | Lower |
 | Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
log) |
@@ -91,31 +92,31 @@ Following table summarizes the trade-offs between these two 
storage types
 | Write Amplification | Higher | Lower (depending on compaction strategy) |
 
 
-### Views
-Hudi supports the following views of stored data
+### Query types
+Hudi supports the following query types
 
- - **Read Optimized View** : Queries on this view see the latest snapshot of 
the dataset as of a given commit or compaction action. 
-    This view exposes only the base/columnar files in latest file slices to 
the queries and guarantees the same columnar query performance compared to a 
non-hudi columnar dataset. 
- - **Incremental View** : Queries on this view only see new data written to 
the dataset, since a given commit/compaction. This view effectively provides 
change streams to enable incremental data pipelines. 
- - **Realtime View** : Queries on this view see the latest snapshot of dataset 
as of a given delta commit action. This view provides near-real time datasets 
(few mins)
-     by merging the base and delta files of the latest file slice on-the-fly.
+ - **Snapshot Queries** : Queries see the latest snapshot of the table as of a 
given commit or compaction action. In case of merge on read table, it exposes 
near-real time data(few mins) by merging 
+    the base and delta files of the latest file slice on-the-fly. For copy on 
write table,  it provides a drop-in replacement for existing parquet tables, 
while providing upsert/delete and other write side features. 
+ - **Incremental Queries** : Queries only see new data written to the table, 
since a given commit/compaction. This effectively provides change streams to 
enable incremental data pipelines. 
+ - **Read Optimized Queries** : Queries see the latest snapshot of table as of 
a given commit/compaction action. Exposes only the base/columnar files in 
latest file slices and guarantees the 
+    same columnar query performance compared to a non-hudi columnar table.
 
-Following table summarizes the trade-offs between the different views.
+Following table summarizes the trade-offs between the different query types.
 
-| Trade-off | ReadOptimized | RealTime |
-|-------------- |------------------| ------------------|
-| Data Latency | Higher   | Lower |
-| Query Latency | Lower (raw columnar performance) | Higher (merge columnar + 
row based delta) |
+| Trade-off     | Snapshot    | Read Optimized |
+|-------------- |-------------| ------------------|
+| Data Latency  | Lower | Higher
+| Query Latency | Higher (merge base / columnar file + row based delta / log 
files) | Lower (raw base / columnar file performance)
 
 
-## Copy On Write Storage
+## Copy On Write Table
 
-File slices in Copy-On-Write storage only contain the base/columnar file and 
each commit produces new versions of base files. 
+File slices in Copy-On-Write table only contain the base/columnar file and 
each commit produces new versions of base files. 
 In other words, we implicitly compact on every commit, such that only columnar 
data exists. As a result, the write amplification 
 (number of bytes written for 1 byte of incoming data) is much higher, where 
read amplification is zero. 
 This is a much desired property for analytical workloads, which is 
predominantly read-heavy.
 
-Following illustrates how this works conceptually, when  data written into 
copy-on-write storage  and two queries running on top of it.
+Following illustrates how this works conceptually, when data written into 
copy-on-write table  and two queries running on top of it.
 
 
 <figure>
@@ -125,26 +126,26 @@ Following illustrates how this works conceptually, when  
data written into copy-
 
 As data gets written, updates to existing file groups produce a new slice for 
that file group stamped with the commit instant time, 
 while inserts allocate a new file group and write its first slice for that 
file group. These file slices and their commit instant times are color coded 
above.
-SQL queries running against such a dataset (eg: `select count(*)` counting the 
total records in that partition), first checks the timeline for the latest 
commit
+SQL queries running against such a table (eg: `select count(*)` counting the 
total records in that partition), first checks the timeline for the latest 
commit
 and filters all but latest file slices of each file group. As you can see, an 
old query does not see the current inflight commit's files color coded in pink,
 but a new query starting after the commit picks up the new data. Thus queries 
are immune to any write failures/partial writes and only run on committed data.
 
-The intention of copy on write storage, is to fundamentally improve how 
datasets are managed today through
+The intention of copy on write table, is to fundamentally improve how tables 
are managed today through
 
   - First class support for atomically updating data at file-level, instead of 
rewriting whole tables/partitions
   - Ability to incremental consume changes, as opposed to wasteful scans or 
fumbling with heuristics
-  - Tight control file sizes to keep query performance excellent (small files 
hurt query performance considerably).
+  - Tight control of file sizes to keep query performance excellent (small 
files hurt query performance considerably).
 
 
-## Merge On Read Storage
+## Merge On Read Table
 
-Merge on read storage is a superset of copy on write, in the sense it still 
provides a read optimized view of the dataset via the Read Optmized table.
-Additionally, it stores incoming upserts for each file group, onto a row based 
delta log, that enables providing near real-time data to the queries
- by applying the delta log, onto the latest version of each file id on-the-fly 
during query time. Thus, this storage type attempts to balance read and write 
amplication intelligently, to provide near real-time queries.
-The most significant change here, would be to the compactor, which now 
carefully chooses which delta logs need to be compacted onto
-their columnar base file, to keep the query performance in check (larger delta 
logs would incur longer merge times with merge data on query side)
+Merge on read table is a superset of copy on write, in the sense it still 
supports read optimized queries of the table by exposing only the base/columnar 
files in latest file slices.
+Additionally, it stores incoming upserts for each file group, onto a row based 
delta log, to support snapshot queries by applying the delta log, 
+onto the latest version of each file id on-the-fly during query time. Thus, 
this table type attempts to balance read and write amplication intelligently, 
to provide near real-time data.
+The most significant change here, would be to the compactor, which now 
carefully chooses which delta log files need to be compacted onto
+their columnar base file, to keep the query performance in check (larger delta 
log files would incur longer merge times with merge data on query side)
 
-Following illustrates how the storage works, and shows queries on both 
near-real time table and read optimized table.
+Following illustrates how the table works, and shows two types of querying - 
snapshot querying and read optimized querying.
 
 <figure>
     <img class="docimage" src="/assets/images/hudi_mor.png" alt="hudi_mor.png" 
style="max-width: 100%" />
@@ -152,20 +153,20 @@ Following illustrates how the storage works, and shows 
queries on both near-real
 
 There are lot of interesting things happening in this example, which bring out 
the subtleties in the approach.
 
- - We now have commits every 1 minute or so, something we could not do in the 
other storage type.
- - Within each file id group, now there is an delta log, which holds incoming 
updates to records in the base columnar files. In the example, the delta logs 
hold
+ - We now have commits every 1 minute or so, something we could not do in the 
other table type.
+ - Within each file id group, now there is an delta log file, which holds 
incoming updates to records in the base columnar files. In the example, the 
delta log files hold
  all the data from 10:05 to 10:10. The base columnar files are still versioned 
with the commit, as before.
- Thus, if one were to simply look at base files alone, then the storage layout 
looks exactly like a copy on write table.
+ Thus, if one were to simply look at base files alone, then the table layout 
looks exactly like a copy on write table.
  - A periodic compaction process reconciles these changes from the delta log 
and produces a new version of base file, just like what happened at 10:05 in 
the example.
- - There are two ways of querying the same underlying storage: ReadOptimized 
(RO) Table and Near-Realtime (RT) table, depending on whether we chose query 
performance or freshness of data.
- - The semantics around when data from a commit is available to a query 
changes in a subtle way for the RO table. Note, that such a query
- running at 10:10, wont see data after 10:05 above, while a query on the RT 
table always sees the freshest data.
+ - There are two ways of querying the same underlying table: Read Optimized 
querying and Snapshot querying, depending on whether we chose query performance 
or freshness of data.
+ - The semantics around when data from a commit is available to a query 
changes in a subtle way for a read optimized query. Note, that such a query
+ running at 10:10, wont see data after 10:05 above, while a snapshot query 
always sees the freshest data.
  - When we trigger compaction & what it decides to compact hold all the key to 
solving these hard problems. By implementing a compacting
- strategy, where we aggressively compact the latest partitions compared to 
older partitions, we could ensure the RO Table sees data
+ strategy, where we aggressively compact the latest partitions compared to 
older partitions, we could ensure the read optimized queries see data
  published within X minutes in a consistent fashion.
 
-The intention of merge on read storage is to enable near real-time processing 
directly on top of DFS, as opposed to copying
+The intention of merge on read table is to enable near real-time processing 
directly on top of DFS, as opposed to copying
 data out to specialized systems, which may not be able to handle the data 
volume. There are also a few secondary side benefits to 
-this storage such as reduced write amplification by avoiding synchronous merge 
of data, i.e, the amount of data written per 1 bytes of data in a batch
+this table such as reduced write amplification by avoiding synchronous merge 
of data, i.e, the amount of data written per 1 bytes of data in a batch
 
 
diff --git a/docs/_docs/2_2_writing_data.md b/docs/_docs/2_2_writing_data.md
index 832daa6..b407111 100644
--- a/docs/_docs/2_2_writing_data.md
+++ b/docs/_docs/2_2_writing_data.md
@@ -1,5 +1,5 @@
 ---
-title: Writing Hudi Datasets
+title: Writing Hudi Tables
 keywords: hudi, incremental, batch, stream, processing, Hive, ETL, Spark SQL
 permalink: /docs/writing_data.html
 summary: In this page, we will discuss some available tools for incrementally 
ingesting & storing data.
@@ -7,24 +7,24 @@ toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-In this section, we will cover ways to ingest new changes from external 
sources or even other Hudi datasets using the [DeltaStreamer](#deltastreamer) 
tool, as well as 
-speeding up large Spark jobs via upserts using the [Hudi 
datasource](#datasource-writer). Such datasets can then be 
[queried](/docs/querying_data.html) using various query engines.
+In this section, we will cover ways to ingest new changes from external 
sources or even other Hudi tables using the [DeltaStreamer](#deltastreamer) 
tool, as well as 
+speeding up large Spark jobs via upserts using the [Hudi 
datasource](#datasource-writer). Such tables can then be 
[queried](/docs/querying_data.html) using various query engines.
 
 
 ## Write Operations
 
 Before that, it may be helpful to understand the 3 different write operations 
provided by Hudi datasource or the delta streamer tool and how best to leverage 
them. These operations
-can be chosen/changed across each commit/deltacommit issued against the 
dataset.
+can be chosen/changed across each commit/deltacommit issued against the table.
 
 
  - **UPSERT** : This is the default operation where the input records are 
first tagged as inserts or updates by looking up the index and 
  the records are ultimately written after heuristics are run to determine how 
best to pack them on storage to optimize for things like file sizing. 
  This operation is recommended for use-cases like database change capture 
where the input almost certainly contains updates.
  - **INSERT** : This operation is very similar to upsert in terms of 
heuristics/file sizing but completely skips the index lookup step. Thus, it can 
be a lot faster than upserts 
- for use-cases like log de-duplication (in conjunction with options to filter 
duplicates mentioned below). This is also suitable for use-cases where the 
dataset can tolerate duplicates, but just 
+ for use-cases like log de-duplication (in conjunction with options to filter 
duplicates mentioned below). This is also suitable for use-cases where the 
table can tolerate duplicates, but just 
  need the transactional writes/incremental pull/storage management 
capabilities of Hudi.
  - **BULK_INSERT** : Both upsert and insert operations keep input records in 
memory to speed up storage heuristics computations faster (among other things) 
and thus can be cumbersome for 
- initial loading/bootstrapping a Hudi dataset at first. Bulk insert provides 
the same semantics as insert, while implementing a sort-based data writing 
algorithm, which can scale very well for several hundred TBs 
+ initial loading/bootstrapping a Hudi table at first. Bulk insert provides the 
same semantics as insert, while implementing a sort-based data writing 
algorithm, which can scale very well for several hundred TBs 
  of initial load. However, this just does a best-effort job at sizing files vs 
guaranteeing file sizes like inserts/upserts do. 
 
 
@@ -100,8 +100,8 @@ Usage: <main class> [options]
       spark master to use.
       Default: local[2]
   * --target-base-path
-      base path for the target Hudi dataset. (Will be created if did not
-      exist first time around. If exists, expected to be a Hudi dataset)
+      base path for the target Hudi table. (Will be created if did not
+      exist first time around. If exists, expected to be a Hudi table)
   * --target-table
       name of the target table in Hive
     --transformer-class
@@ -129,15 +129,16 @@ and then ingest it as follows.
   --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
   --source-ordering-field impresssiontime \
-  --target-base-path file:///tmp/hudi-deltastreamer-op --target-table 
uber.impressions \
+  --target-base-path file:\/\/\/tmp/hudi-deltastreamer-op \ 
+  --target-table uber.impressions \
   --op BULK_INSERT
 ```
 
-In some cases, you may want to migrate your existing dataset into Hudi 
beforehand. Please refer to [migration guide](/docs/migration_guide.html). 
+In some cases, you may want to migrate your existing table into Hudi 
beforehand. Please refer to [migration guide](/docs/migration_guide.html). 
 
 ## Datasource Writer
 
-The `hudi-spark` module offers the DataSource API to write (and also read) any 
data frame into a Hudi dataset.
+The `hudi-spark` module offers the DataSource API to write (and also read) any 
data frame into a Hudi table.
 Following is how we can upsert a dataframe, while specifying the field names 
that need to be used
 for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey 
=> timestamp`
 
@@ -156,41 +157,32 @@ inputDF.write()
 
 ## Syncing to Hive
 
-Both tools above support syncing of the dataset's latest schema to Hive 
metastore, such that queries can pick up new columns and partitions.
+Both tools above support syncing of the table's latest schema to Hive 
metastore, such that queries can pick up new columns and partitions.
 In case, its preferable to run this from commandline or in an independent jvm, 
Hudi provides a `HiveSyncTool`, which can be invoked as below, 
-once you have built the hudi-hive module.
+once you have built the hudi-hive module. Following is how we sync the above 
Datasource Writer written table to Hive metastore.
+
+```java
+cd hudi-hive
+./run_sync_tool.sh  --jdbc-url jdbc:hive2:\/\/hiveserver:10000 --user hive 
--pass hive --partitioned-by partition --base-path <basePath> --database 
default --table <tableName>
+```
+
+Starting with Hudi 0.5.1 version read optimized version of merge-on-read 
tables are suffixed '_ro' by default. For backwards compatibility with older 
Hudi versions, 
+an optional HiveSyncConfig - `--skip-ro-suffix`, has been provided to turn off 
'_ro' suffixing if desired. Explore other hive sync options using the following 
command:
 
 ```java
 cd hudi-hive
 ./run_sync_tool.sh
  [hudi-hive]$ ./run_sync_tool.sh --help
-Usage: <main class> [options]
-  Options:
-  * --base-path
-       Basepath of Hudi dataset to sync
-  * --database
-       name of the target database in Hive
-    --help, -h
-       Default: false
-  * --jdbc-url
-       Hive jdbc connect url
-  * --use-jdbc
-       Whether to use jdbc connection or hive metastore (via thrift)
-  * --pass
-       Hive password
-  * --table
-       name of the target table in Hive
-  * --user
-       Hive username
 ```
 
 ## Deletes 
 
-Hudi supports implementing two types of deletes on data stored in Hudi 
datasets, by enabling the user to specify a different record payload 
implementation. 
+Hudi supports implementing two types of deletes on data stored in Hudi tables, 
by enabling the user to specify a different record payload implementation. 
+For more info refer to [Delete support in 
Hudi](https://cwiki.apache.org/confluence/x/6IqvC).
 
  - **Soft Deletes** : With soft deletes, user wants to retain the key but just 
null out the values for all other fields. 
- This can be simply achieved by ensuring the appropriate fields are nullable 
in the dataset schema and simply upserting the dataset after setting these 
fields to null.
- - **Hard Deletes** : A stronger form of delete is to physically remove any 
trace of the record from the dataset. This can be achieved by issuing an upsert 
with a custom payload implementation
+ This can be simply achieved by ensuring the appropriate fields are nullable 
in the table schema and simply upserting the table after setting these fields 
to null.
+ - **Hard Deletes** : A stronger form of delete is to physically remove any 
trace of the record from the table. This can be achieved by issuing an upsert 
with a custom payload implementation
  via either DataSource or DeltaStreamer which always returns Optional.Empty as 
the combined value. Hudi ships with a built-in 
`org.apache.hudi.EmptyHoodieRecordPayload` class that does exactly this.
  
 ```java
@@ -203,14 +195,14 @@ Hudi supports implementing two types of deletes on data 
stored in Hudi datasets,
 ```
 
 
-## Storage Management
+## Optimized DFS Access
 
-Hudi also performs several key storage management functions on the data stored 
in a Hudi dataset. A key aspect of storing data on DFS is managing file sizes 
and counts
+Hudi also performs several key storage management functions on the data stored 
in a Hudi table. A key aspect of storing data on DFS is managing file sizes and 
counts
 and reclaiming storage space. For e.g HDFS is infamous for its handling of 
small files, which exerts memory/RPC pressure on the Name Node and can 
potentially destabilize
 the entire cluster. In general, query engines provide much better performance 
on adequately sized columnar files, since they can effectively amortize cost of 
obtaining 
 column statistics etc. Even on some cloud data stores, there is often cost to 
listing directories with large number of small files.
 
-Here are some ways to efficiently manage the storage of your Hudi datasets.
+Here are some ways to efficiently manage the storage of your Hudi tables.
 
  - The [small file handling 
feature](/docs/configurations.html#compactionSmallFileSize) in Hudi, profiles 
incoming workload 
    and distributes inserts to existing file groups instead of creating new 
file groups, which can lead to small files. 
@@ -219,4 +211,4 @@ Here are some ways to efficiently manage the storage of 
your Hudi datasets.
    such that sufficient number of inserts are grouped into the same file 
group, resulting in well sized base files ultimately.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
    once created cannot be deleted, but simply expanded as explained before.
- - For workloads with heavy updates, the [merge-on-read 
storage](/docs/concepts.html#merge-on-read-storage) provides a nice mechanism 
for ingesting quickly into smaller files and then later merging them into 
larger base files via compaction.
+ - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index be17b21..8c6d357 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -1,5 +1,5 @@
 ---
-title: Querying Hudi Datasets
+title: Querying Hudi Tables
 keywords: hudi, hive, spark, sql, presto
 permalink: /docs/querying_data.html
 summary: In this page, we go over how to enable SQL queries on Hudi built 
tables.
@@ -7,41 +7,46 @@ toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Conceptually, Hudi stores data physically once on DFS, while providing 3 
logical views on top, as explained [before](/docs/concepts.html#views). 
-Once the dataset is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the dataset can be queried by popular query engines 
like Hive, Spark and Presto.
+Conceptually, Hudi stores data physically once on DFS, while providing 3 
different ways of querying, as explained 
[before](/docs/concepts.html#query-types). 
+Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
+bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark and Presto.
 
-Specifically, there are two Hive tables named off [table 
name](/docs/configurations.html#TABLE_NAME_OPT_KEY) passed during write. 
-For e.g, if `table name = hudi_tbl`, then we get  
+Specifically, following Hive tables are registered based off [table 
name](/docs/configurations.html#TABLE_NAME_OPT_KEY) 
+and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during 
write.   
 
- - `hudi_tbl` realizes the read optimized view of the dataset backed by 
`HoodieParquetInputFormat`, exposing purely columnar data.
- - `hudi_tbl_rt` realizes the real time view of the dataset  backed by 
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get: 
+ - `hudi_trips` supports snapshot querying and incremental querying of the 
table backed by `HoodieParquetInputFormat`, exposing purely columnar data.
+
+
+If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
+ - `hudi_trips_rt` supports snapshot querying and incremental querying 
(providing near-real time data) of the table  backed by 
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+ - `hudi_trips_ro` supports read optimized querying of the table backed by 
`HoodieParquetInputFormat`, exposing purely columnar data.
+ 
 
 As discussed in the concepts section, the one key primitive needed for 
[incrementally 
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi 
datasets can be pulled incrementally, which means you can get ALL and ONLY the 
updated & new rows 
+is `incremental pulls` (to obtain a change stream/log from a table). Hudi 
tables can be pulled incrementally, which means you can get ALL and ONLY the 
updated & new rows 
 since a specified instant time. This, together with upserts, are particularly 
useful for building data pipelines where 1 or more source Hudi tables are 
incrementally pulled (streams/facts),
-joined with other tables (datasets/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi dataset. Incremental view is 
realized by querying one of the tables above, 
-with special configurations that indicates to query planning that only 
incremental data needs to be fetched out of the dataset. 
+joined with other tables (tables/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi table. Incremental view is 
realized by querying one of the tables above, 
+with special configurations that indicates to query planning that only 
incremental data needs to be fetched out of the table. 
 
-In sections, below we will discuss in detail how to access all the 3 views on 
each query engine.
+In sections, below we will discuss how to access these query types from 
different query engines.
 
 ## Hive
 
-In order for Hive to recognize Hudi datasets and query correctly, the 
HiveServer2 needs to be provided with the 
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` 
+In order for Hive to recognize Hudi tables and query correctly, the 
HiveServer2 needs to be provided with the 
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` 
 in its [aux jars 
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr).
 This will ensure the input format 
 classes with its dependencies are available for query planning & execution. 
 
-### Read Optimized table
+### Read optimized query
 In addition to setup above, for beeline cli access, the `hive.input.format` 
variable needs to be set to the  fully qualified path name of the 
 inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez, 
additionally the `hive.tez.input.format` needs to be set 
 to `org.apache.hadoop.hive.ql.io.HiveInputFormat`
 
-### Real time table
+### Snapshot query
 In addition to installing the hive bundle jar on the HiveServer2, it needs to 
be put on the hadoop/hive installation across the cluster, so that
 queries can pick up the custom RecordReader as well.
 
-### Incremental Pulling
-
+### Incremental query
 `HiveIncrementalPuller` allows incrementally extracting changes from large 
fact/dimension tables via HiveQL, combining the benefits of Hive (reliably 
process complex SQL queries) and 
 incremental primitives (speed up query by pulling tables incrementally instead 
of scanning fully). The tool uses Hive JDBC to run the hive query and saves its 
results in a temp table.
 that can later be upserted. Upsert utility (`HoodieDeltaStreamer`) has all the 
state it needs from the directory structure to know what should be the commit 
time on the target table.
@@ -67,12 +72,12 @@ The following are the configuration options for 
HiveIncrementalPuller
 |help| Utility Help |  |
 
 
-Setting fromCommitTime=0 and maxCommits=-1 will pull in the entire source 
dataset and can be used to initiate backfills. If the target dataset is a Hudi 
dataset,
-then the utility can determine if the target dataset has no commits or is 
behind more than 24 hour (this is configurable),
+Setting fromCommitTime=0 and maxCommits=-1 will pull in the entire source 
table and can be used to initiate backfills. If the target table is a Hudi 
table,
+then the utility can determine if the target table has no commits or is behind 
more than 24 hour (this is configurable),
 it will automatically use the backfill configuration, since applying the last 
24 hours incrementally could take more time than doing a backfill. The current 
limitation of the tool
-is the lack of support for self-joining the same table in mixed mode (normal 
and incremental modes).
+is the lack of support for self-joining the same table in mixed mode (snapshot 
and incremental modes).
 
-**NOTE on Hive queries that are executed using Fetch task:**
+**NOTE on Hive incremental queries that are executed using Fetch task:**
 Since Fetch tasks invoke InputFormat.listStatus() per partition, Hoodie 
metadata can be listed in
 every such listStatus() call. In order to avoid this, it might be useful to 
disable fetch tasks
 using the hive session property for incremental queries: `set 
hive.fetch.task.conversion=none;` This
@@ -81,16 +86,16 @@ separated) and calls InputFormat.listStatus() only once 
with all those partition
 
 ## Spark
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi 
datasets in Spark.
+Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
 
  - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three views, including the real time 
view, relying on the custom Hudi input formats again like Hive.
+ - **Read as Hive tables** : Supports all three query types, including the 
snapshot querying, relying on the custom Hudi input formats again like Hive.
  
  In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
  
-### Read Optimized table
+### Read optimized querying
 
-To read RO table as a Hive table using SparkSQL, simply push a path filter 
into sparkContext as follows. 
+Pushing a path filter into sparkContext as follows allows for read optimized 
querying of a Hudi hive table using SparkSQL. 
 This method retains Spark built-in optimizations for reading Parquet files 
like vectorized reading on Hudi tables.
 
 ```scala
@@ -101,22 +106,23 @@ If you prefer to glob paths on DFS via the datasource, 
you can simply do somethi
 
 ```java
 Dataset<Row> hoodieROViewDF = spark.read().format("org.apache.hudi")
-// pass any path glob, can include hudi & non-hudi datasets
+// pass any path glob, can include hudi & non-hudi tables
 .load("/glob/path/pattern");
 ```
  
-### Real time table {#spark-rt-view}
-Currently, real time table can only be queried as a Hive table in Spark. In 
order to do this, set `spark.sql.hive.convertMetastoreParquet=false`, forcing 
Spark to fallback 
+### Snapshot querying {#spark-snapshot-querying}
+Currently, near-real time data can only be queried as a Hive table in Spark 
using snapshot querying mode. In order to do this, set 
`spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback 
 to using the Hive Serde to read the data (planning/executions is still Spark). 
 
 ```java
-$ spark-shell --jars hudi-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path 
/etc/hive/conf  --packages com.databricks:spark-avro_2.11:4.0.0 --conf 
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 
7g --executor-memory 2g  --master yarn-client
+$ spark-shell --jars hudi-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path 
/etc/hive/conf  --packages org.apache.spark:spark-avro_2.11:2.4.4 --conf 
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 
7g --executor-memory 2g  --master yarn-client
 
-scala> sqlContext.sql("select count(*) from hudi_rt where datestr = 
'2016-10-02'").show()
+scala> sqlContext.sql("select count(*) from hudi_trips_rt where datestr = 
'2016-10-02'").show()
+scala> sqlContext.sql("select count(*) from hudi_trips_rt where datestr = 
'2016-10-02'").show()
 ```
 
-### Incremental Pulling {#spark-incr-pull}
-The `hudi-spark` module offers the DataSource API, a more elegant way to pull 
data from Hudi dataset and process it via Spark.
+### Incremental pulling {#spark-incr-pull}
+The `hudi-spark` module offers the DataSource API, a more elegant way to pull 
data from Hudi table and process it via Spark.
 A sample incremental pull, that will obtain all records written since 
`beginInstantTime`, looks like below.
 
 ```java
@@ -126,7 +132,7 @@ A sample incremental pull, that will obtain all records 
written since `beginInst
              DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL())
      .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
             <beginInstantTime>)
-     .load(tablePath); // For incremental view, pass in the root/base path of 
dataset
+     .load(tablePath); // For incremental view, pass in the root/base path of 
table
 ```
 
 Please refer to [configurations](/docs/configurations.html#spark-datasource) 
section, to view all datasource options.
@@ -137,10 +143,10 @@ Additionally, `HoodieReadClient` offers the following 
functionality using Hudi's
 |-------|--------|
 | read(keys) | Read out the data corresponding to the keys as a DataFrame, 
using Hudi's own index for faster lookup |
 | filterExists() | Filter out already existing records from the provided 
RDD[HoodieRecord]. Useful for de-duplication |
-| checkExists(keys) | Check if the provided keys exist in a Hudi dataset |
+| checkExists(keys) | Check if the provided keys exist in a Hudi table |
 
 
 ## Presto
 
-Presto is a popular query engine, providing interactive query performance. 
Hudi RO tables can be queries seamlessly in Presto. 
+Presto is a popular query engine, providing interactive query performance. 
Presto currently supports only read optimized querying on Hudi tables. 
 This requires the `hudi-presto-bundle` jar to be placed into 
`<presto_install>/plugin/hive-hadoop2/`, across the installation.
diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md
index e68c3e0..d62f179 100644
--- a/docs/_docs/2_4_configurations.md
+++ b/docs/_docs/2_4_configurations.md
@@ -7,14 +7,14 @@ toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-This page covers the different ways of configuring your job to write/read Hudi 
datasets. 
+This page covers the different ways of configuring your job to write/read Hudi 
tables. 
 At a high level, you can control behaviour at few levels. 
 
-- **[Spark Datasource Configs](#spark-datasource)** : These configs control 
the Hudi Spark Datasource, providing ability to define keys/partitioning, pick 
out the write operation, specify how to merge records or choosing view type to 
read.
+- **[Spark Datasource Configs](#spark-datasource)** : These configs control 
the Hudi Spark Datasource, providing ability to define keys/partitioning, pick 
out the write operation, specify how to merge records or choosing query type to 
read.
 - **[WriteClient Configs](#writeclient-configs)** : Internally, the Hudi 
datasource uses a RDD based `HoodieWriteClient` api to actually perform writes 
to storage. These configs provide deep control over lower level aspects like 
    file sizing, compression, parallelism, compaction, write schema, cleaning 
etc. Although Hudi provides sane defaults, from time-time these configs may 
need to be tweaked to optimize for specific workloads.
 - **[RecordPayload Config](#PAYLOAD_CLASS_OPT_KEY)** : This is the lowest 
level of customization offered by Hudi. Record payloads define how to produce 
new values to upsert based on incoming new record and 
-   stored old record. Hudi provides default implementations such as 
`OverwriteWithLatestAvroPayload` which simply update storage with the 
latest/last-written record. 
+   stored old record. Hudi provides default implementations such as 
`OverwriteWithLatestAvroPayload` which simply update table with the 
latest/last-written record. 
    This can be overridden to a custom class extending `HoodieRecordPayload` 
class, on both datasource and WriteClient levels.
  
 ## Talking to Cloud Storage
@@ -49,20 +49,20 @@ inputDF.write()
 .save(basePath);
 ```
 
-Options useful for writing datasets via `write.format.option(...)`
+Options useful for writing tables via `write.format.option(...)`
 
 #### TABLE_NAME_OPT_KEY {#TABLE_NAME_OPT_KEY}
   Property: `hoodie.datasource.write.table.name` [Required]<br/>
-  <span style="color:grey">Hive table name, to register the dataset 
into.</span>
+  <span style="color:grey">Hive table name, to register the table into.</span>
   
 #### OPERATION_OPT_KEY {#OPERATION_OPT_KEY}
   Property: `hoodie.datasource.write.operation`, Default: `upsert`<br/>
   <span style="color:grey">whether to do upsert, insert or bulkinsert for the 
write operation. Use `bulkinsert` to load new data into a table, and there on 
use `upsert`/`insert`. 
   bulk insert uses a disk based write path to scale to load large inputs 
without need to cache it.</span>
   
-#### STORAGE_TYPE_OPT_KEY {#STORAGE_TYPE_OPT_KEY}
-  Property: `hoodie.datasource.write.storage.type`, Default: `COPY_ON_WRITE` 
<br/>
-  <span style="color:grey">The storage type for the underlying data, for this 
write. This can't change between writes.</span>
+#### TABLE_TYPE_OPT_KEY {#TABLE_TYPE_OPT_KEY}
+  Property: `hoodie.datasource.write.table.type`, Default: `COPY_ON_WRITE` 
<br/>
+  <span style="color:grey">The table type for the underlying data, for this 
write. This can't change between writes.</span>
   
 #### PRECOMBINE_FIELD_OPT_KEY {#PRECOMBINE_FIELD_OPT_KEY}
   Property: `hoodie.datasource.write.precombine.field`, Default: `ts` <br/>
@@ -100,7 +100,7 @@ This is useful to store checkpointing information, in a 
consistent way with the
   
 #### HIVE_SYNC_ENABLED_OPT_KEY {#HIVE_SYNC_ENABLED_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.enable`, Default: `false` <br/>
-  <span style="color:grey">When set to true, register/sync the dataset to 
Apache Hive metastore</span>
+  <span style="color:grey">When set to true, register/sync the table to Apache 
Hive metastore</span>
   
 #### HIVE_DATABASE_OPT_KEY {#HIVE_DATABASE_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.database`, Default: `default` <br/>
@@ -124,7 +124,7 @@ This is useful to store checkpointing information, in a 
consistent way with the
   
 #### HIVE_PARTITION_FIELDS_OPT_KEY {#HIVE_PARTITION_FIELDS_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.partition_fields`, Default: ` ` <br/>
-  <span style="color:grey">field in the dataset to use for determining hive 
partition columns.</span>
+  <span style="color:grey">field in the table to use for determining hive 
partition columns.</span>
   
 #### HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY 
{#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.partition_extractor_class`, Default: 
`org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor` <br/>
@@ -136,13 +136,13 @@ This is useful to store checkpointing information, in a 
consistent way with the
 
 ### Read Options
 
-Options useful for reading datasets via `read.format.option(...)`
+Options useful for reading tables via `read.format.option(...)`
 
-#### VIEW_TYPE_OPT_KEY {#VIEW_TYPE_OPT_KEY}
-Property: `hoodie.datasource.view.type`, Default: `read_optimized` <br/>
+#### QUERY_TYPE_OPT_KEY {#QUERY_TYPE_OPT_KEY}
+Property: `hoodie.datasource.query.type`, Default: `snapshot` <br/>
 <span style="color:grey">Whether data needs to be read, in incremental mode 
(new data since an instantTime)
 (or) Read Optimized mode (obtain latest view, based on columnar data)
-(or) Real time mode (obtain latest view, based on row & columnar data)</span>
+(or) Snapshot mode (obtain latest view, based on row & columnar data)</span>
 
 #### BEGIN_INSTANTTIME_OPT_KEY {#BEGIN_INSTANTTIME_OPT_KEY} 
 Property: `hoodie.datasource.read.begin.instanttime`, [Required in incremental 
mode] <br/>
@@ -182,15 +182,15 @@ Property: `hoodie.base.path` [Required] <br/>
 
 #### withSchema(schema_str) {#withSchema} 
 Property: `hoodie.avro.schema` [Required]<br/>
-<span style="color:grey">This is the current reader avro schema for the 
dataset. This is a string of the entire schema. HoodieWriteClient uses this 
schema to pass on to implementations of HoodieRecordPayload to convert from the 
source format to avro record. This is also used when re-writing records during 
an update. </span>
+<span style="color:grey">This is the current reader avro schema for the table. 
This is a string of the entire schema. HoodieWriteClient uses this schema to 
pass on to implementations of HoodieRecordPayload to convert from the source 
format to avro record. This is also used when re-writing records during an 
update. </span>
 
 #### forTable(table_name) {#forTable} 
 Property: `hoodie.table.name` [Required] <br/>
- <span style="color:grey">Table name for the dataset, will be used for 
registering with Hive. Needs to be same across runs.</span>
+ <span style="color:grey">Table name that will be used for registering with 
Hive. Needs to be same across runs.</span>
 
 #### withBulkInsertParallelism(bulk_insert_parallelism = 1500) 
{#withBulkInsertParallelism} 
 Property: `hoodie.bulkinsert.shuffle.parallelism`<br/>
-<span style="color:grey">Bulk insert is meant to be used for large initial 
imports and this parallelism determines the initial number of files in your 
dataset. Tune this to achieve a desired optimal size during initial 
import.</span>
+<span style="color:grey">Bulk insert is meant to be used for large initial 
imports and this parallelism determines the initial number of files in your 
table. Tune this to achieve a desired optimal size during initial import.</span>
 
 #### withParallelism(insert_shuffle_parallelism = 1500, 
upsert_shuffle_parallelism = 1500) {#withParallelism} 
 Property: `hoodie.insert.shuffle.parallelism`, 
`hoodie.upsert.shuffle.parallelism`<br/>
@@ -310,7 +310,7 @@ Property: `hoodie.logfile.data.block.max.size` <br/>
 
 #### logFileToParquetCompressionRatio(logFileToParquetCompressionRatio = 0.35) 
{#logFileToParquetCompressionRatio} 
 Property: `hoodie.logfile.to.parquet.compression.ratio` <br/>
-<span style="color:grey">Expected additional compression as records move from 
log files to parquet. Used for merge_on_read storage to send inserts into log 
files & control the size of compacted parquet file.</span>
+<span style="color:grey">Expected additional compression as records move from 
log files to parquet. Used for merge_on_read table to send inserts into log 
files & control the size of compacted parquet file.</span>
  
 #### parquetCompressionCodec(parquetCompressionCodec = gzip) 
{#parquetCompressionCodec} 
 Property: `hoodie.parquet.compression.codec` <br/>
@@ -326,7 +326,7 @@ Property: `hoodie.cleaner.policy` <br/>
 
 #### retainCommits(no_of_commits_to_retain = 24) {#retainCommits} 
 Property: `hoodie.cleaner.commits.retained` <br/>
-<span style="color:grey">Number of commits to retain. So data will be retained 
for num_of_commits * time_between_commits (scheduled). This also directly 
translates into how much you can incrementally pull on this dataset</span>
+<span style="color:grey">Number of commits to retain. So data will be retained 
for num_of_commits * time_between_commits (scheduled). This also directly 
translates into how much you can incrementally pull on this table</span>
 
 #### archiveCommitsWith(minCommits = 96, maxCommits = 128) 
{#archiveCommitsWith} 
 Property: `hoodie.keep.min.commits`, `hoodie.keep.max.commits` <br/>
diff --git a/docs/_docs/2_5_performance.md b/docs/_docs/2_5_performance.md
index 3bcd69d..6f489fc 100644
--- a/docs/_docs/2_5_performance.md
+++ b/docs/_docs/2_5_performance.md
@@ -11,14 +11,14 @@ the conventional alternatives for achieving these tasks.
 
 ## Upserts
 
-Following shows the speed up obtained for NoSQL database ingestion, from 
incrementally upserting on a Hudi dataset on the copy-on-write storage,
+Following shows the speed up obtained for NoSQL database ingestion, from 
incrementally upserting on a Hudi table on the copy-on-write storage,
 on 5 tables ranging from small to huge (as opposed to bulk loading the tables)
 
 <figure>
     <img class="docimage" src="/assets/images/hudi_upsert_perf1.png" 
alt="hudi_upsert_perf1.png" style="max-width: 1000px" />
 </figure>
 
-Given Hudi can build the dataset incrementally, it opens doors for also 
scheduling ingesting more frequently thus reducing latency, with
+Given Hudi can build the table incrementally, it opens doors for also 
scheduling ingesting more frequently thus reducing latency, with
 significant savings on the overall compute cost.
 
 <figure>
@@ -43,8 +43,8 @@ For e.g , with 100M timestamp prefixed keys (5% updates, 95% 
inserts) on a event
 
 ## Read Optimized Queries
 
-The major design goal for read optimized view is to achieve the latency 
reduction & efficiency gains in previous section,
-with no impact on queries. Following charts compare the Hudi vs non-Hudi 
datasets across Hive/Presto/Spark queries and demonstrate this.
+The major design goal for read optimized querying is to achieve the latency 
reduction & efficiency gains in previous section,
+with no impact on queries. Following charts compare the Hudi vs non-Hudi 
tables across Hive/Presto/Spark queries and demonstrate this.
 
 **Hive**

[incubator-hudi] branch asf-site updated: [HUDI-510] Update site documentation in sync with cWiki

Reply via email to