This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new b5e0d7c Travis CI build asf-site
b5e0d7c is described below
commit b5e0d7c757dbeb49f7ab45f4d8608471eadf72a1
Author: CI <[email protected]>
AuthorDate: Sat Aug 22 08:32:13 2020 +0000
Travis CI build asf-site
---
content/cn/docs/querying_data.html | 21 ++++++++++++++++++---
content/docs/querying_data.html | 6 +++---
2 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/content/cn/docs/querying_data.html
b/content/cn/docs/querying_data.html
index fb798b9..e808682 100644
--- a/content/cn/docs/querying_data.html
+++ b/content/cn/docs/querying_data.html
@@ -472,7 +472,7 @@
</tr>
<tr>
<td><strong>Spark Datasource</strong></td>
- <td>N</td>
+ <td>Y</td>
<td>N</td>
<td>Y</td>
</tr>
@@ -615,7 +615,7 @@ Upsert实用程序(<code
class="highlighter-rouge">HoodieDeltaStreamer</code>
<p>Spark可将Hudi jars和捆绑包轻松部署和管理到作业/笔记本中。简而言之,通过Spark有两种方法可以访问Hudi数据集。</p>
<ul>
- <li><strong>Hudi DataSource</strong>:支持读取优化和增量拉取,类似于标准数据源(例如:<code
class="highlighter-rouge">spark.read.parquet</code>)的工作方式。</li>
+ <li><strong>Hudi DataSource</strong>:支持实时视图,读取优化和增量拉取,类似于标准数据源(例如:<code
class="highlighter-rouge">spark.read.parquet</code>)的工作方式。</li>
<li><strong>以Hive表读取</strong>:支持所有三个视图,包括实时视图,依赖于自定义的Hudi输入格式(再次类似Hive)。</li>
</ul>
@@ -638,7 +638,7 @@ Upsert实用程序(<code
class="highlighter-rouge">HoodieDeltaStreamer</code>
</code></pre></div></div>
<h3 id="spark-rt-view">实时表</h3>
-<p>当前,实时表只能在Spark中作为Hive表进行查询。为了做到这一点,设置<code
class="highlighter-rouge">spark.sql.hive.convertMetastoreParquet = false</code>,
+<p>将实时表在Spark中作为Hive表进行查询,设置<code
class="highlighter-rouge">spark.sql.hive.convertMetastoreParquet = false</code>,
迫使Spark回退到使用Hive Serde读取数据(计划/执行仍然是Spark)。</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="n">$</span> <span
class="n">spark</span><span class="o">-</span><span class="n">shell</span>
<span class="o">--</span><span class="n">jars</span> <span
class="n">hudi</span><span class="o">-</span><span class="n">spark</span><span
class="o">-</span><span class="n">bundle</span><span class="o">-</span><span
class="nv">x</span><span class="o">.</span><span class="py">y</span><span [...]
@@ -646,6 +646,21 @@ Upsert实用程序(<code
class="highlighter-rouge">HoodieDeltaStreamer</code>
<span class="n">scala</span><span class="o">></span> <span
class="nv">sqlContext</span><span class="o">.</span><span
class="py">sql</span><span class="o">(</span><span class="s">"select count(*)
from hudi_rt where datestr = '2016-10-02'"</span><span class="o">).</span><span
class="py">show</span><span class="o">()</span>
</code></pre></div></div>
+<p>如果您希望通过数据源在DFS上使用全局路径,则只需执行以下类似操作即可得到Spark DataFrame。</p>
+
+<div class="language-scala highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="nc">Dataset</span><span
class="o"><</span><span class="nc">Row</span><span class="o">></span>
<span class="n">hoodieRealtimeViewDF</span> <span class="k">=</span> <span
class="nv">spark</span><span class="o">.</span><span
class="py">read</span><span class="o">().</span><span
class="py">format</span><span class="o">(</span><span
class="s">"org.apache.hudi"</span><span class [...]
+<span class="c1">// pass any path glob, can include hudi & non-hudi
datasets
+</span><span class="o">.</span><span class="py">load</span><span
class="o">(</span><span class="s">"/glob/path/pattern"</span><span
class="o">);</span>
+</code></pre></div></div>
+
+<p>如果您希望只查询实时表的读优化视图</p>
+
+<div class="language-scala highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="nc">Dataset</span><span
class="o"><</span><span class="nc">Row</span><span class="o">></span>
<span class="n">hoodieRealtimeViewDF</span> <span class="k">=</span> <span
class="nv">spark</span><span class="o">.</span><span
class="py">read</span><span class="o">().</span><span
class="py">format</span><span class="o">(</span><span
class="s">"org.apache.hudi"</span><span class [...]
+<span class="o">.</span><span class="py">option</span><span
class="o">(</span><span class="nv">DataSourceReadOptions</span><span
class="o">.</span><span class="py">QUERY_TYPE_OPT_KEY</span><span
class="o">,</span> <span class="nv">DataSourceReadOptions</span><span
class="o">.</span><span
class="py">QUERY_TYPE_READ_OPTIMIZED_OPT_VAL</span><span class="o">)</span>
+<span class="c1">// pass any path glob, can include hudi & non-hudi
datasets
+</span><span class="o">.</span><span class="py">load</span><span
class="o">(</span><span class="s">"/glob/path/pattern"</span><span
class="o">);</span>
+</code></pre></div></div>
+
<h3 id="spark-incr-pull">增量拉取</h3>
<p><code class="highlighter-rouge">hudi-spark</code>模块提供了DataSource
API,这是一种从Hudi数据集中提取数据并通过Spark处理数据的更优雅的方法。
如下所示是一个示例增量拉取,它将获取自<code
class="highlighter-rouge">beginInstantTime</code>以来写入的所有记录。</p>
diff --git a/content/docs/querying_data.html b/content/docs/querying_data.html
index a280ece..3db236f 100644
--- a/content/docs/querying_data.html
+++ b/content/docs/querying_data.html
@@ -489,7 +489,7 @@ with special configurations that indicates to query
planning that only increment
</tr>
<tr>
<td><strong>Spark Datasource</strong></td>
- <td>N</td>
+ <td>Y</td>
<td>N</td>
<td>Y</td>
</tr>
@@ -647,8 +647,8 @@ If using spark’s built in support, additionally a path
filter needs to be push
<h2 id="spark-datasource">Spark Datasource</h2>
-<p>The Spark Datasource API is a popular way of authoring Spark ETL pipelines.
Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how
standard
-datasources work (e.g: <code
class="highlighter-rouge">spark.read.parquet</code>). Both snapshot querying
and incremental querying are supported here. Typically spark jobs require
adding <code class="highlighter-rouge">--jars <path to
jar>/hudi-spark-bundle_2.11-<hudi version>.jar</code> to classpath of
drivers
+<p>The Spark Datasource API is a popular way of authoring Spark ETL pipelines.
Hudi COPY_ON_WRITE and MERGE_ON_READ tables can be queried via Spark datasource
similar to how standard
+datasources work (e.g: <code
class="highlighter-rouge">spark.read.parquet</code>). MERGE_ON_READ table
supports snapshot querying and COPY_ON_WRITE table supports both snapshot and
incremental querying via Spark datasource. Typically spark jobs require adding
<code class="highlighter-rouge">--jars <path to
jar>/hudi-spark-bundle_2.11-<hudi version>.jar</code> to classpath of
drivers
and executors. Alternatively, hudi-spark-bundle can also fetched via the <code
class="highlighter-rouge">--packages</code> options (e.g: <code
class="highlighter-rouge">--packages
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3</code>).</p>
<h3 id="spark-snap-query">Snapshot query</h3>