bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r382972249
########## File path: docs/_docs/2_3_querying_data.md ########## @@ -84,55 +102,53 @@ using the hive session property for incremental queries: `set hive.fetch.task.co would ensure Map Reduce execution is chosen for a Hive query, which combines partitions (comma separated) and calls InputFormat.listStatus() only once with all those partitions. -## Spark +## Spark datasource -Spark provides much easier deployment & management of Hudi jars and bundles into jobs/notebooks. At a high level, there are two ways to access Hudi tables in Spark. +Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how standard datasources work (e.g: `spark.read.parquet`). +Both snapshot querying and incremental querying are supported here. Typically spark jobs require adding `--jars <path to jar>/hudi-spark-bundle_2.11:0.5.1-incubating` +to classpath of drivers and executors. Refer [building Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source) for build instructions. +When using spark shell instead of --jars, --packages can also be used to fetch the hudi-spark-bundle like this: `--packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating` +For sample setup, refer to [Setup spark-shell in quickstart](/docs/quick-start-guide.html#setup-spark-shell). - - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to how standard datasources (e.g: `spark.read.parquet`) work. - - **Read as Hive tables** : Supports all three query types, including the snapshot queries, relying on the custom Hudi input formats again like Hive. - - In general, your spark job needs a dependency to `hudi-spark` or `hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & executors (hint: use `--jars` argument) +## Spark SQL +Supports all query types across both Hudi table types, relying on the custom Hudi input formats again like Hive. +Typically notebook users and spark-shell users leverage spark sql for querying Hudi tables. Please add hudi-spark-bundle +as described above via --jars or --packages. -### Read optimized query - -Pushing a path filter into sparkContext as follows allows for read optimized querying of a Hudi hive table using SparkSQL. -This method retains Spark built-in optimizations for reading Parquet files like vectorized reading on Hudi tables. - -```scala -spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]); -``` - -If you prefer to glob paths on DFS via the datasource, you can simply do something like below to get a Spark dataframe to work with. +### Snapshot query {#spark-snapshot-query} +By default, Spark SQL will try to use its own parquet support instead of Hive SerDe when reading from Hive metastore parquet tables. +However, for MERGE_ON_READ tables which has both parquet and avro data, this default setting needs to be turned off using set `spark.sql.hive.convertMetastoreParquet=false`. +This will force Spark to fallback to using the Hive Serde to read the data (planning/executions is still Spark). ```java -Dataset<Row> hoodieROViewDF = spark.read().format("org.apache.hudi") -// pass any path glob, can include hudi & non-hudi tables -.load("/glob/path/pattern"); +$ spark-shell --driver-class-path /etc/hive/conf --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 7g --executor-memory 2g --master yarn-client + +scala> sqlContext.sql("select count(*) from hudi_trips_mor_rt where datestr = '2016-10-02'").show() +scala> sqlContext.sql("select count(*) from hudi_trips_mor_rt where datestr = '2016-10-02'").show() ``` - -### Snapshot query {#spark-snapshot-query} -Currently, near-real time data can only be queried as a Hive table in Spark using snapshot query mode. In order to do this, set `spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback -to using the Hive Serde to read the data (planning/executions is still Spark). -```java -$ spark-shell --jars hudi-spark-bundle_2.11-x.y.z-SNAPSHOT.jar --driver-class-path /etc/hive/conf --packages org.apache.spark:spark-avro_2.11:2.4.4 --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 7g --executor-memory 2g --master yarn-client +For COPY_ON_WRITE tables, either Hive SerDe can be used by turning off convertMetastoreParquet as described above or Spark's built in support can be leveraged. Review comment: okay sure. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services