This is an automated email from the ASF dual-hosted git repository.
bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 9ec0704 [DOC] Add MOR Spark datasource support to Doc (#1959)
9ec0704 is described below
commit 9ec0704d23fec4d9360ac1a600dcef82b37e0ff4
Author: Gary Li <[email protected]>
AuthorDate: Sat Aug 22 01:30:15 2020 -0700
[DOC] Add MOR Spark datasource support to Doc (#1959)
---
docs/_docs/2_3_querying_data.cn.md | 23 ++++++++++++++++++++---
docs/_docs/2_3_querying_data.md | 6 +++---
2 files changed, 23 insertions(+), 6 deletions(-)
diff --git a/docs/_docs/2_3_querying_data.cn.md
b/docs/_docs/2_3_querying_data.cn.md
index 0aeb104..c72c2b7 100644
--- a/docs/_docs/2_3_querying_data.cn.md
+++ b/docs/_docs/2_3_querying_data.cn.md
@@ -46,7 +46,7 @@ language: cn
|------------|--------|-----------|--------------|
|**Hive**|Y|Y|Y|
|**Spark SQL**|Y|Y|Y|
-|**Spark Datasource**|N|N|Y|
+|**Spark Datasource**|Y|N|Y|
|**Presto**|N|N|Y|
|**Impala**|N|N|Y|
@@ -110,7 +110,7 @@ Upsert实用程序(`HoodieDeltaStreamer`)具有目录结构所需的所有
Spark可将Hudi jars和捆绑包轻松部署和管理到作业/笔记本中。简而言之,通过Spark有两种方法可以访问Hudi数据集。
- - **Hudi DataSource**:支持读取优化和增量拉取,类似于标准数据源(例如:`spark.read.parquet`)的工作方式。
+ - **Hudi DataSource**:支持实时视图,读取优化和增量拉取,类似于标准数据源(例如:`spark.read.parquet`)的工作方式。
- **以Hive表读取**:支持所有三个视图,包括实时视图,依赖于自定义的Hudi输入格式(再次类似Hive)。
通常,您的spark作业需要依赖`hudi-spark`或`hudi-spark-bundle-x.y.z.jar`,
@@ -134,7 +134,7 @@ Dataset<Row> hoodieROViewDF =
spark.read().format("org.apache.hudi")
```
### 实时表 {#spark-rt-view}
-当前,实时表只能在Spark中作为Hive表进行查询。为了做到这一点,设置`spark.sql.hive.convertMetastoreParquet =
false`,
+将实时表在Spark中作为Hive表进行查询,设置`spark.sql.hive.convertMetastoreParquet = false`,
迫使Spark回退到使用Hive Serde读取数据(计划/执行仍然是Spark)。
```scala
@@ -143,6 +143,23 @@ $ spark-shell --jars hudi-spark-bundle-x.y.z-SNAPSHOT.jar
--driver-class-path /e
scala> sqlContext.sql("select count(*) from hudi_rt where datestr =
'2016-10-02'").show()
```
+如果您希望通过数据源在DFS上使用全局路径,则只需执行以下类似操作即可得到Spark DataFrame。
+
+```scala
+Dataset<Row> hoodieRealtimeViewDF = spark.read().format("org.apache.hudi")
+// pass any path glob, can include hudi & non-hudi datasets
+.load("/glob/path/pattern");
+```
+
+如果您希望只查询实时表的读优化视图
+
+```scala
+Dataset<Row> hoodieRealtimeViewDF = spark.read().format("org.apache.hudi")
+.option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY,
DataSourceReadOptions.QUERY_TYPE_READ_OPTIMIZED_OPT_VAL)
+// pass any path glob, can include hudi & non-hudi datasets
+.load("/glob/path/pattern");
+```
+
### 增量拉取 {#spark-incr-pull}
`hudi-spark`模块提供了DataSource API,这是一种从Hudi数据集中提取数据并通过Spark处理数据的更优雅的方法。
如下所示是一个示例增量拉取,它将获取自`beginInstantTime`以来写入的所有记录。
diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index 4b62a3d..33a5c13 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -52,7 +52,7 @@ Note that `Read Optimized` queries are not applicable for
COPY_ON_WRITE tables.
|------------|--------|-----------|--------------|
|**Hive**|Y|Y|Y|
|**Spark SQL**|Y|Y|Y|
-|**Spark Datasource**|N|N|Y|
+|**Spark Datasource**|Y|N|Y|
|**Presto**|N|N|Y|
|**Impala**|N|N|Y|
@@ -132,8 +132,8 @@
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.clas
## Spark Datasource
-The Spark Datasource API is a popular way of authoring Spark ETL pipelines.
Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how
standard
-datasources work (e.g: `spark.read.parquet`). Both snapshot querying and
incremental querying are supported here. Typically spark jobs require adding
`--jars <path to jar>/hudi-spark-bundle_2.11-<hudi version>.jar` to classpath
of drivers
+The Spark Datasource API is a popular way of authoring Spark ETL pipelines.
Hudi COPY_ON_WRITE and MERGE_ON_READ tables can be queried via Spark datasource
similar to how standard
+datasources work (e.g: `spark.read.parquet`). MERGE_ON_READ table supports
snapshot querying and COPY_ON_WRITE table supports both snapshot and
incremental querying via Spark datasource. Typically spark jobs require adding
`--jars <path to jar>/hudi-spark-bundle_2.11-<hudi version>.jar` to classpath
of drivers
and executors. Alternatively, hudi-spark-bundle can also fetched via the
`--packages` options (e.g: `--packages
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3`).
### Snapshot query {#spark-snap-query}