[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-14 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379646458
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -9,7 +9,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 
 Conceptually, Hudi stores data physically once on DFS, while providing 3 
different ways of querying, as explained 
[before](/docs/concepts.html#query-types). 
 Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark and Presto.
+bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark datasource, Spark SQL and Presto.
 
 Review comment:
   it's not released yet though..but we should file a ticket for doc-ing that 
nonetheless, when a impala release does happen 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-14 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379646458
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -9,7 +9,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 
 Conceptually, Hudi stores data physically once on DFS, while providing 3 
different ways of querying, as explained 
[before](/docs/concepts.html#query-types). 
 Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark and Presto.
+bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark datasource, Spark SQL and Presto.
 
 Review comment:
   it's not released yet though..but we should file a ticket for doc-ing that 
nonetheless. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379226110
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
+For sample setup, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the 
snapshot queries, relying on the custom Hudi input formats again like Hive.
- 
- In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
+## Spark SQL
+Supports all query types across both Hudi table types, relying on the custom 
Hudi input formats again like Hive. 
+Typically notebook users and spark-shell users leverage spark sql for querying 
Hudi tables. Please add hudi-spark-bundle 
+as described above via --jars or --packages.
  
-### Read optimized query
-
-Pushing a path filter into sparkContext as follows allows for read optimized 
querying of a Hudi hive table using SparkSQL. 
-This method retains Spark built-in optimizations for reading Parquet files 
like vectorized reading on Hudi tables.
-
-```scala
-spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
 classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], 
classOf[org.apache.hadoop.fs.PathFilter]);
-```
-
-If you prefer to glob paths on DFS via the datasource, you can simply do 
something like below to get a Spark dataframe to work with. 
+### Snapshot query {#spark-snapshot-query}
+By default, Spark SQL will try to use its own parquet support instead of Hive 
SerDe when reading from Hive metastore parquet tables. 
 
 Review comment:
   own parquet reader


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379226403
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
+For sample setup, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the 
snapshot queries, relying on the custom Hudi input formats again like Hive.
- 
- In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
+## Spark SQL
+Supports all query types across both Hudi table types, relying on the custom 
Hudi input formats again like Hive. 
+Typically notebook users and spark-shell users leverage spark sql for querying 
Hudi tables. Please add hudi-spark-bundle 
+as described above via --jars or --packages.
  
-### Read optimized query
-
-Pushing a path filter into sparkContext as follows allows for read optimized 
querying of a Hudi hive table using SparkSQL. 
-This method retains Spark built-in optimizations for reading Parquet files 
like vectorized reading on Hudi tables.
-
-```scala
-spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
 classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], 
classOf[org.apache.hadoop.fs.PathFilter]);
-```
-
-If you prefer to glob paths on DFS via the datasource, you can simply do 
something like below to get a Spark dataframe to work with. 
+### Snapshot query {#spark-snapshot-query}
+By default, Spark SQL will try to use its own parquet support instead of Hive 
SerDe when reading from Hive metastore parquet tables. 
+However, for MERGE_ON_READ tables which has both parquet and avro data, this 
default setting needs to be turned off using set 
`spark.sql.hive.convertMetastoreParquet=false`. 
+This will force Spark to fallback to using the Hive Serde to read the data 
(planning/executions is still Spark). 
 
 ```java
-Dataset hoodieROViewDF = spark.read().format("org.apache.hudi")
-// pass any path glob, can include hudi & non-hudi tables
-.load("/glob/path/pattern");
+$ spark-shell --driver-class-path /etc/hive/conf  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 
--driver-memory 7g --executor-memory 2g  --master yarn-client
+
+scala> sqlContext.sql("select count(*) from hudi_trips_mor_rt where datestr = 
'2016-10-02'").show()
+scala> sqlContext.sql("select count(*) from hudi_trips_mor_rt where datestr = 
'2016-10-02'").show()
 ```
- 
-### Snapshot query {#spark-snapshot-query}
-Currently, near-real time data can only be queried as a Hive table in Spark 
using snapshot query mode. In order to do this, set 
`spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback 
-to using the Hive Serde to read the data (planning/executions is still Spark). 
 
-```java
-$ spark-shell --jars hudi-spark-bundle_2.11-x.y.z-SNAPSHOT.jar 
--driver-class-path /etc/hive/conf  --packages 
org.apache.spark:spark-avro_2.11:2.4.4 --conf 
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 
7g --executor-memory 2g  --master yarn-client
+For COPY_ON_WRITE tables, either Hive SerDe can be used by turning off 
convertMetastoreParquet as described above or Spark's built in support can be 
leveraged. 
 
 Review comment:
   `turning off convertMetastoreParquet` please usethe exact and full config 
here within ``


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379225601
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
 
 Review comment:
   use `--jars` `--packages`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379226038
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
+For sample setup, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the 
snapshot queries, relying on the custom Hudi input formats again like Hive.
- 
- In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
+## Spark SQL
+Supports all query types across both Hudi table types, relying on the custom 
Hudi input formats again like Hive. 
+Typically notebook users and spark-shell users leverage spark sql for querying 
Hudi tables. Please add hudi-spark-bundle 
 
 Review comment:
   lets also spend some time setting context and explaining how this uses 
Spark/Hive integration


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379225202
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
 
 Review comment:
   please point to quick start or some example for this


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379226205
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
+For sample setup, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the 
snapshot queries, relying on the custom Hudi input formats again like Hive.
- 
- In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
+## Spark SQL
+Supports all query types across both Hudi table types, relying on the custom 
Hudi input formats again like Hive. 
+Typically notebook users and spark-shell users leverage spark sql for querying 
Hudi tables. Please add hudi-spark-bundle 
+as described above via --jars or --packages.
  
-### Read optimized query
-
-Pushing a path filter into sparkContext as follows allows for read optimized 
querying of a Hudi hive table using SparkSQL. 
-This method retains Spark built-in optimizations for reading Parquet files 
like vectorized reading on Hudi tables.
-
-```scala
-spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
 classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], 
classOf[org.apache.hadoop.fs.PathFilter]);
-```
-
-If you prefer to glob paths on DFS via the datasource, you can simply do 
something like below to get a Spark dataframe to work with. 
+### Snapshot query {#spark-snapshot-query}
+By default, Spark SQL will try to use its own parquet support instead of Hive 
SerDe when reading from Hive metastore parquet tables. 
 
 Review comment:
   `By default` : are you talking about copy_on_write tables? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379226624
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -145,8 +161,13 @@ Additionally, `HoodieReadClient` offers the following 
functionality using Hudi's
 | filterExists() | Filter out already existing records from the provided 
RDD[HoodieRecord]. Useful for de-duplication |
 | checkExists(keys) | Check if the provided keys exist in a Hudi table |
 
+### Read optimized query
+
+For read optimized queries, either Hive SerDe can be used by turning off 
convertMetastoreParquet as described above or Spark's built in support can be 
leveraged. 
+If using spark's built in support, additionally a path filter needs to be 
pushed into sparkContext as described earlier.
 
 ## Presto
 
-Presto is a popular query engine, providing interactive query performance. 
Presto currently supports only read optimized queries on Hudi tables. 
-This requires the `hudi-presto-bundle` jar to be placed into 
`/plugin/hive-hadoop2/`, across the installation.
+Presto is a popular query engine, providing interactive query performance. 
Presto currently supports snapshot queries on
+COPY_On_WRITE and read optimized queries on MERGE_ON_READ Hudi tables. This 
requires the `hudi-presto-bundle` jar 
 
 Review comment:
   COPY_ON_WRITE: typo


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379226552
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -145,8 +161,13 @@ Additionally, `HoodieReadClient` offers the following 
functionality using Hudi's
 | filterExists() | Filter out already existing records from the provided 
RDD[HoodieRecord]. Useful for de-duplication |
 | checkExists(keys) | Check if the provided keys exist in a Hudi table |
 
+### Read optimized query
+
+For read optimized queries, either Hive SerDe can be used by turning off 
convertMetastoreParquet as described above or Spark's built in support can be 
leveraged. 
 
 Review comment:
   Remove this section ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379225899
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
 Review comment:
   For this section, can you take another stab.. it feels short and curt.. may 
be set some context on things like : Spark Datasources directly query 
underlying DFS data without Hive (and for this reason I think we should move 
`SparkSQL` up and place before this section) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379225479
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
 
 Review comment:
   can we remove this line ` Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. ` .. you don't have to build it yourself per se.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services