This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 4814dff [DOCS] Update azure_hoodie.md and docker_demo.md of cn doc
(#3851)
4814dff is described below
commit 4814dff7dfc1812ba85077fc3ac1910721a81662
Author: laurieliyang <[email protected]>
AuthorDate: Mon Oct 25 06:20:56 2021 +0800
[DOCS] Update azure_hoodie.md and docker_demo.md of cn doc (#3851)
* Update cn doc azure_hoodie.md of current and 0.8.0
* Remove version matter of azure_hoodie of current
---
.../current/azure_hoodie.md | 35 ++--
.../current/docker_demo.md | 215 +++++++++------------
.../version-0.8.0/azure_hoodie.md | 35 ++--
3 files changed, 127 insertions(+), 158 deletions(-)
diff --git
a/website/i18n/cn/docusaurus-plugin-content-docs/current/azure_hoodie.md
b/website/i18n/cn/docusaurus-plugin-content-docs/current/azure_hoodie.md
index cbda98a..f7ccb84 100644
--- a/website/i18n/cn/docusaurus-plugin-content-docs/current/azure_hoodie.md
+++ b/website/i18n/cn/docusaurus-plugin-content-docs/current/azure_hoodie.md
@@ -1,41 +1,42 @@
---
-title: Azure Filesystem
+title: Azure 文件系统
keywords: [ hudi, hive, azure, spark, presto]
-summary: In this page, we go over how to configure Hudi with Azure filesystem.
+summary: 在本页中,我们讨论如何在 Azure 文件系统中配置 Hudi 。
last_modified_at: 2020-05-25T19:00:57-04:00
language: cn
---
-In this page, we explain how to use Hudi on Microsoft Azure.
+在本页中,我们解释如何在 Microsoft Azure 上使用 Hudi 。
-## Disclaimer
+## 声明
-This page is maintained by the Hudi community.
-If the information is inaccurate or you have additional information to add.
-Please feel free to create a JIRA ticket. Contribution is highly appreciated.
+本页面由 Hudi 社区维护。
+如果信息不准确,或者你有信息要补充,请尽管创建 JIRA ticket。
+对此贡献高度赞赏。
-## Supported Storage System
+## 支持的存储系统
-There are two storage systems support Hudi .
+Hudi 支持两种存储系统。
-- Azure Blob Storage
+- Azure Blob 存储
- Azure Data Lake Gen 2
-## Verified Combination of Spark and storage system
+## 经过验证的 Spark 与存储系统的组合
-#### HDInsight Spark2.4 on Azure Data Lake Storage Gen 2
+#### Azure Data Lake Storage Gen 2 上的 HDInsight Spark 2.4
This combination works out of the box. No extra config needed.
+这种组合开箱即用,不需要额外的配置。
-#### Databricks Spark2.4 on Azure Data Lake Storage Gen 2
-- Import Hudi jar to databricks workspace
+#### Azure Data Lake Storage Gen 2 上的 Databricks Spark 2.4
+- 将 Hudi jar 包导入到 databricks 工作区 。
-- Mount the file system to dbutils.
+- 将文件系统挂载到 dbutils 。
```scala
dbutils.fs.mount(
source = "abfss://[email protected]",
mountPoint = "/mountpoint",
extraConfigs = configs)
```
-- When writing Hudi dataset, use abfss URL
+- 当写入 Hudi 数据集时,使用 abfss URL
```scala
inputDF.write
.format("org.apache.hudi")
@@ -43,7 +44,7 @@ This combination works out of the box. No extra config needed.
.mode(SaveMode.Append)
.save("abfss://<<storage-account>>.dfs.core.windows.net/hudi-tables/customer")
```
-- When reading Hudi dataset, use the mounting point
+- 当读取 Hudi 数据集时,使用挂载点
```scala
spark.read
.format("org.apache.hudi")
diff --git
a/website/i18n/cn/docusaurus-plugin-content-docs/current/docker_demo.md
b/website/i18n/cn/docusaurus-plugin-content-docs/current/docker_demo.md
index 3b8d1f0..eea0e88 100644
--- a/website/i18n/cn/docusaurus-plugin-content-docs/current/docker_demo.md
+++ b/website/i18n/cn/docusaurus-plugin-content-docs/current/docker_demo.md
@@ -6,18 +6,17 @@ last_modified_at: 2019-12-30T15:59:57-04:00
language: cn
---
-## A Demo using docker containers
+## 一个使用 Docker 容器的 Demo
-Lets use a real world example to see how hudi works end to end. For this
purpose, a self contained
-data infrastructure is brought up in a local docker cluster within your
computer.
+我们来使用一个真实世界的案例,来看看 Hudi 是如何闭环运转的。 为了这个目的,在你的计算机中的本地 Docker 集群中组建了一个自包含的数据基础设施。
-The steps have been tested on a Mac laptop
+以下步骤已经在一台 Mac 笔记本电脑上测试过了。
-### Prerequisites
+### 前提条件
- * Docker Setup : For Mac, Please follow the steps as defined in
[https://docs.docker.com/v17.12/docker-for-mac/install/]. For running Spark-SQL
queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See
Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be
killed because of memory issues.
- * kafkacat : A command-line utility to publish/consume from kafka topics.
Use `brew install kafkacat` to install kafkacat
- * /etc/hosts : The demo references many services running in container by the
hostname. Add the following settings to /etc/hosts
+ * Docker 安装 : 对于 Mac ,请依照
[https://docs.docker.com/v17.12/docker-for-mac/install/] 当中定义的步骤。 为了运行
Spark-SQL 查询,请确保至少分配给 Docker 6 GB 和 4 个 CPU 。(参见 Docker -> Preferences ->
Advanced)。否则,Spark-SQL 查询可能被因为内存问题而被杀停。
+ * kafkacat : 一个用于发布/消费 Kafka Topic 的命令行工具集。使用 `brew install kafkacat` 来安装
kafkacat 。
+ * /etc/hosts : Demo 通过主机名引用了多个运行在容器中的服务。将下列设置添加到 /etc/hosts :
```java
@@ -32,24 +31,24 @@ The steps have been tested on a Mac laptop
127.0.0.1 zookeeper
```
-Also, this has not been tested on some environments like Docker on Windows.
+此外,这未在其它一些环境中进行测试,例如 Windows 上的 Docker 。
-## Setting up Docker Cluster
+## 设置 Docker 集群
-### Build Hudi
+### 构建 Hudi
-The first step is to build hudi
+构建 Hudi 的第一步:
```java
cd <HUDI_WORKSPACE>
mvn package -DskipTests
```
-### Bringing up Demo Cluster
+### 组建 Demo 集群
-The next step is to run the docker compose script and setup configs for
bringing up the cluster.
-This should pull the docker images from docker hub and setup docker cluster.
+下一步是运行 Docker 安装脚本并设置配置项以便组建集群。
+这需要从 Docker 镜像库拉取 Docker 镜像,并设置 Docker 集群。
```java
cd docker
@@ -84,29 +83,27 @@ Copying spark default config and setting up configs
$ docker ps
```
-At this point, the docker cluster will be up and running. The demo cluster
brings up the following services
+至此, Docker 集群将会启动并运行。 Demo 集群提供了下列服务:
- * HDFS Services (NameNode, DataNode)
- * Spark Master and Worker
- * Hive Services (Metastore, HiveServer2 along with PostgresDB)
- * Kafka Broker and a Zookeeper Node (Kafka will be used as upstream source
for the demo)
- * Adhoc containers to run Hudi/Hive CLI commands
+ * HDFS 服务( NameNode, DataNode )
+ * Spark Master 和 Worker
+ * Hive 服务( Metastore, HiveServer2 以及 PostgresDB )
+ * Kafka Broker 和一个 Zookeeper Node ( Kafka 将被用来当做 Demo 的上游数据源 )
+ * 用来运行 Hudi/Hive CLI 命令的 Adhoc 容器
## Demo
-Stock Tracker data will be used to showcase both different Hudi Views and the
effects of Compaction.
+Stock Tracker 数据将用来展示不同的 Hudi 视图以及压缩带来的影响。
-Take a look at the directory `docker/demo/data`. There are 2 batches of stock
data - each at 1 minute granularity.
-The first batch contains stocker tracker data for some stock symbols during
the first hour of trading window
-(9:30 a.m to 10:30 a.m). The second batch contains tracker data for next 30
mins (10:30 - 11 a.m). Hudi will
-be used to ingest these batches to a dataset which will contain the latest
stock tracker data at hour level granularity.
-The batches are windowed intentionally so that the second batch contains
updates to some of the rows in the first batch.
+看一下 `docker/demo/data` 目录。那里有 2 批股票数据——都是 1 分钟粒度的。
+第 1 批数据包含一些股票代码在交易窗口(9:30 a.m 至 10:30 a.m)的第一个小时里的行情数据数据。第 2 批包含接下来 30
分钟(10:30 - 11 a.m)的交易数据。 Hudi 将被用来将两个批次的数据采集到一个数据集中,这个数据集将会包含最新的小时级股票行情数据。
+两个批次被有意地按窗口切分,这样在第 2 批数据中包含了一些针对第 1 批数据条目的更新数据。
-### Step 1 : Publish the first batch to Kafka
+### Step 1 : 将第 1 批数据发布到 Kafka
-Upload the first batch to Kafka topic 'stock ticks' `cat
docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P`
+将第 1 批数据上传到 Kafka 的 Topic “stock ticks” 中 `cat docker/demo/data/batch_1.json |
kafkacat -b kafkabroker -t stock_ticks -P`
-To check if the new topic shows up, use
+为了检查新的 Topic 是否出现,使用
```java
kafkacat -b kafkabroker -L -J | jq .
{
@@ -148,12 +145,9 @@ kafkacat -b kafkabroker -L -J | jq .
```
-### Step 2: Incrementally ingest data from Kafka topic
+### Step 2: 从 Kafka Topic 中增量采集数据
-Hudi comes with a tool named DeltaStreamer. This tool can connect to variety
of data sources (including Kafka) to
-pull changes and apply to Hudi dataset using upsert/insert primitives. Here,
we will use the tool to download
-json data from kafka topic and ingest to both COW and MOR tables we
initialized in the previous step. This tool
-automatically initializes the datasets in the file-system if they do not exist
yet.
+Hudi 自带一个名为 DeltaStreamer 的工具。 这个工具能连接多种数据源(包括 Kafka),以便拉取变更,并通过 upsert/insert
操作应用到 Hudi 数据集。此处,我们将使用这个工具从 Kafka Topic 下载 JSON 数据,并采集到前面步骤中初始化的 COW 和 MOR
表中。如果数据集不存在,这个工具将自动初始化数据集到文件系统中。
```java
docker exec -it adhoc-2 /bin/bash
@@ -172,20 +166,18 @@ spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
exit
```
-You can use HDFS web-browser to look at the datasets
+你可以使用 HDFS 的 Web 浏览器来查看数据集
`http://namenode:50070/explorer#/user/hive/warehouse/stock_ticks_cow`.
-You can explore the new partition folder created in the dataset along with a
"deltacommit"
-file under .hoodie which signals a successful commit.
+你可以浏览在数据集中新创建的分区文件夹,同时还有一个在 .hoodie 目录下的 deltacommit 文件。
-There will be a similar setup when you browse the MOR dataset
+在 MOR 数据集中也有类似的设置
`http://namenode:50070/explorer#/user/hive/warehouse/stock_ticks_mor`
-### Step 3: Sync with Hive
+### Step 3: 与 Hive 同步
-At this step, the datasets are available in HDFS. We need to sync with Hive to
create new Hive tables and add partitions
-inorder to run Hive queries against those datasets.
+到了这一步,数据集在 HDFS 中可用。我们需要与 Hive 同步来创建新 Hive 表并添加分区,以便在那些数据集上执行 Hive 查询。
```java
docker exec -it adhoc-2 /bin/bash
@@ -205,18 +197,15 @@ docker exec -it adhoc-2 /bin/bash
....
exit
```
-After executing the above command, you will notice
+执行了以上命令后,你会发现:
-1. A hive table named `stock_ticks_cow` created which provides Read-Optimized
view for the Copy On Write dataset.
-2. Two new tables `stock_ticks_mor` and `stock_ticks_mor_rt` created for the
Merge On Read dataset. The former
-provides the ReadOptimized view for the Hudi dataset and the later provides
the realtime-view for the dataset.
+1. 一个名为 `stock_ticks_cow` 的 Hive 表被创建,它为写时复制数据集提供了读优化视图。
+2. 两个新表 `stock_ticks_mor` 和 `stock_ticks_mor_rt` 被创建用于读时合并数据集。 前者为 Hudi
数据集提供了读优化视图,而后者为数据集提供了实时视图。
-### Step 4 (a): Run Hive Queries
+### Step 4 (a): 运行 Hive 查询
-Run a hive query to find the latest timestamp ingested for stock symbol
'GOOG'. You will notice that both read-optimized
-(for both COW and MOR dataset)and realtime views (for MOR dataset)give the
same value "10:29 a.m" as Hudi create a
-parquet file for the first batch of data.
+执行一个 Hive 查询来为股票 GOOG 找到采集到的最新时间戳。你会注意到读优化视图( COW 和 MOR 数据集都是如此)和实时视图(仅对 MOR
数据集)给出了相同的值 “10:29 a.m”,这是因为 Hudi 为每个批次的数据创建了一个 Parquet 文件。
```java
docker exec -it adhoc-2 /bin/bash
@@ -317,9 +306,8 @@ exit
exit
```
-### Step 4 (b): Run Spark-SQL Queries
-Hudi support Spark as query processor just like Hive. Here are the same hive
queries
-running in spark-sql
+### Step 4 (b): 执行 Spark-SQL 查询
+Hudi 支持以 Spark 作为类似 Hive 的查询引擎。这是在 Spartk-SQL 中执行与 Hive 相同的查询
```java
docker exec -it adhoc-1 /bin/bash
@@ -415,9 +403,9 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, ts,
volume, open, close
```
-### Step 4 (c): Run Presto Queries
+### Step 4 (c): 执行 Presto 查询
-Here are the Presto queries for similar Hive and Spark queries. Currently,
Hudi does not support Presto queries on realtime views.
+这是 Presto 查询,它们与 Hive 和 Spark 的查询类似。目前 Hudi 的实时视图不支持 Presto 。
```java
docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
@@ -506,10 +494,9 @@ Splits: 17 total, 17 done (100.00%)
presto:default> exit
```
-### Step 5: Upload second batch to Kafka and run DeltaStreamer to ingest
+### Step 5: 将第 2 批次上传到 Kafka 并运行 DeltaStreamer 进行采集
-Upload the second batch of data and ingest this batch using delta-streamer. As
this batch does not bring in any new
-partitions, there is no need to run hive-sync
+上传第 2 批次数据,并使用 DeltaStreamer 采集。由于这个批次不会引入任何新分区,因此不需要执行 Hive 同步。
```java
cat docker/demo/data/batch_2.json | kafkacat -b kafkabroker -t stock_ticks -P
@@ -527,21 +514,17 @@ spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
exit
```
-With Copy-On-Write table, the second ingestion by DeltaStreamer resulted in a
new version of Parquet file getting created.
-See
`http://namenode:50070/explorer#/user/hive/warehouse/stock_ticks_cow/2018/08/31`
+使用写时复制表, DeltaStreamer 的第 2 批数据采集将导致 Parquet 文件创建一个新版本。
+参考:
`http://namenode:50070/explorer#/user/hive/warehouse/stock_ticks_cow/2018/08/31`
-With Merge-On-Read table, the second ingestion merely appended the batch to an
unmerged delta (log) file.
-Take a look at the HDFS filesystem to get an idea:
`http://namenode:50070/explorer#/user/hive/warehouse/stock_ticks_mor/2018/08/31`
+使用读时合并表, 第 2 批数据采集仅仅将数据追加到没有合并的 delta (日志) 文件中。看一下 HDFS 文件系统来了解这一点:
`http://namenode:50070/explorer#/user/hive/warehouse/stock_ticks_mor/2018/08/31`
-### Step 6(a): Run Hive Queries
+### Step 6 (a): 执行 Hive 查询
-With Copy-On-Write table, the read-optimized view immediately sees the changes
as part of second batch once the batch
-got committed as each ingestion creates newer versions of parquet files.
+使用写时复制表,在每一个批次被提交采集并创建新版本的 Parquet 文件时,读优化视图会立即发现变更,这些变更被当第 2 批次的一部分。
-With Merge-On-Read table, the second ingestion merely appended the batch to an
unmerged delta (log) file.
-This is the time, when ReadOptimized and Realtime views will provide different
results. ReadOptimized view will still
-return "10:29 am" as it will only read from the Parquet file. Realtime View
will do on-the-fly merge and return
-latest committed data which is "10:59 a.m".
+使用读时合并表,第 2 批数据采集仅仅将数据追加到没有合并的 delta (日志) 文件中。
+此时,读优化视图和实时视图将提供不同的结果。读优化视图仍会返回“10:29 am”,因为它会只会从 Parquet
文件中读取。实时视图会做即时合并并返回最新提交的数据,即“10:59 a.m”。
```java
docker exec -it adhoc-2 /bin/bash
@@ -610,9 +593,9 @@ exit
exit
```
-### Step 6(b): Run Spark SQL Queries
+### Step 6 (b): 执行 Spark SQL 查询
-Running the same queries in Spark-SQL:
+以 Spark SQL 执行类似的查询:
```java
docker exec -it adhoc-1 /bin/bash
@@ -678,9 +661,9 @@ exit
exit
```
-### Step 6(c): Run Presto Queries
+### Step 6 (c): 执行 Presto 查询
-Running the same queries on Presto for ReadOptimized views.
+在 Presto 中为读优化视图执行类似的查询:
```java
@@ -742,11 +725,11 @@ presto:default> exit
```
-### Step 7 : Incremental Query for COPY-ON-WRITE Table
+### Step 7 : 写时复制表的增量查询
-With 2 batches of data ingested, lets showcase the support for incremental
queries in Hudi Copy-On-Write datasets
+使用采集的两个批次的数据,我们展示 Hudi 写时复制数据集中支持的增量查询。
-Lets take the same projection query example
+我们使用类似的工程查询样例:
```java
docker exec -it adhoc-2 /bin/bash
@@ -761,16 +744,12 @@ beeline -u jdbc:hive2://hiveserver:10000 --hiveconf
hive.input.format=org.apache
+----------------------+---------+----------------------+---------+------------+-----------+--+
```
-As you notice from the above queries, there are 2 commits - 20180924064621 and
20180924065039 in timeline order.
-When you follow the steps, you will be getting different timestamps for
commits. Substitute them
-in place of the above timestamps.
+正如你在上面的查询中看到的,有两个提交——按时间线排列是 20180924064621 和 20180924065039 。
+当你按照这些步骤执行后,你的提交会得到不同的时间戳。将它们替换到上面时间戳的位置。
-To show the effects of incremental-query, let us assume that a reader has
already seen the changes as part of
-ingesting first batch. Now, for the reader to see effect of the second batch,
he/she has to keep the start timestamp to
-the commit time of the first batch (20180924064621) and run incremental query
+为了展示增量查询的影响,我们假设有一位读者已经在第 1 批数据中一部分看到了变化。那么,为了让读者看到第 2 批数据的影响,他/她需要保留第 1
批次提交时间中的开始时间( 20180924064621 )并执行增量查询:
-Hudi incremental mode provides efficient scanning for incremental queries by
filtering out files that do not have any
-candidate rows using hudi-managed metadata.
+Hudi 的增量模式为增量查询提供了高效的扫描,通过 Hudi 管理的元数据,过滤掉了那些不包含候选记录的文件。
```java
docker exec -it adhoc-2 /bin/bash
@@ -782,8 +761,8 @@ No rows affected (0.009 seconds)
0: jdbc:hive2://hiveserver:10000> set
hoodie.stock_ticks_cow.consume.start.timestamp=20180924064621;
```
-With the above setting, file-ids that do not have any updates from the commit
20180924065039 is filtered out without scanning.
-Here is the incremental query :
+使用上面的设置,那些在提交 20180924065039 之后没有任何更新的文件ID将被过滤掉,不进行扫描。
+以下是增量查询:
```java
0: jdbc:hive2://hiveserver:10000>
@@ -797,7 +776,7 @@ Here is the incremental query :
0: jdbc:hive2://hiveserver:10000>
```
-### Incremental Query with Spark SQL:
+### 使用 Spark SQL 做增量查询
```java
docker exec -it adhoc-1 /bin/bash
bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE
--driver-class-path $HADOOP_CONF_DIR --conf
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client
--driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1
--packages com.databricks:spark-avro_2.11:4.0.0
@@ -835,10 +814,10 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol,
ts, volume, open, close
```
-### Step 8: Schedule and Run Compaction for Merge-On-Read dataset
+### Step 8: 为读时合并数据集的调度并执行压缩
-Lets schedule and run a compaction to create a new version of columnar file
so that read-optimized readers will see fresher data.
-Again, You can use Hudi CLI to manually schedule and run compaction
+我们来调度并运行一个压缩来创建一个新版本的列式文件,以便读优化读取器能看到新数据。
+再次强调,你可以使用 Hudi CLI 来人工调度并执行压缩。
```java
docker exec -it adhoc-1 /bin/bash
@@ -926,12 +905,11 @@ hoodie:stock_ticks->compactions show all
```
-### Step 9: Run Hive Queries including incremental queries
+### Step 9: 执行包含增量查询的 Hive 查询
-You will see that both ReadOptimized and Realtime Views will show the latest
committed data.
-Lets also run the incremental query for MOR table.
-From looking at the below query output, it will be clear that the fist commit
time for the MOR table is 20180924064636
-and the second commit time is 20180924070031
+你将看到读优化视图和实时视图都会展示最新提交的数据。
+让我们也对 MOR 表执行增量查询。
+通过查看下方的查询输出,能够明确 MOR 表的第一次提交时间是 20180924064636 而第二次提交时间是 20180924070031 。
```java
docker exec -it adhoc-2 /bin/bash
@@ -992,7 +970,7 @@ exit
exit
```
-### Step 10: Read Optimized and Realtime Views for MOR with Spark-SQL after
compaction
+### Step 10: 压缩后在 MOR 的读优化视图与实时视图上使用 Spark-SQL
```java
docker exec -it adhoc-1 /bin/bash
@@ -1032,7 +1010,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol,
ts, volume, open, close
+----------------------+---------+----------------------+---------+------------+-----------+--+
```
-### Step 11: Presto queries over Read Optimized View on MOR dataset after
compaction
+### Step 11: 压缩后在 MOR 数据集的读优化视图上进行 Presto 查询
```java
docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
@@ -1066,55 +1044,44 @@ presto:default>
```
-This brings the demo to an end.
+Demo 到此结束。
-## Testing Hudi in Local Docker environment
+## 在本地 Docker 环境中测试 Hudi
-You can bring up a hadoop docker environment containing Hadoop, Hive and Spark
services with support for hudi.
+你可以组建一个包含 Hadoop 、 Hive 和 Spark 服务的 Hadoop Docker 环境,并支持 Hudi 。
```java
$ mvn pre-integration-test -DskipTests
```
-The above command builds docker images for all the services with
-current Hudi source installed at /var/hoodie/ws and also brings up the
services using a compose file. We
-currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.3.1) in docker
images.
+上面的命令为所有的服务构建了 Docker 镜像,它带有当前安装在 /var/hoodie/ws 的 Hudi
源,并使用一个部署文件引入了这些服务。我们当前在 Docker 镜像中使用 Hadoop (v2.8.4)、 Hive (v2.3.3)和 Spark
(v2.3.1)。
-To bring down the containers
+要销毁容器:
```java
$ cd hudi-integ-test
$ mvn docker-compose:down
```
-If you want to bring up the docker containers, use
+如果你想要组建 Docker 容器,使用:
```java
$ cd hudi-integ-test
$ mvn docker-compose:up -DdetachedMode=true
```
-Hudi is a library that is operated in a broader data analytics/ingestion
environment
-involving Hadoop, Hive and Spark. Interoperability with all these systems is a
key objective for us. We are
-actively adding integration-tests under __hudi-integ-test/src/test/java__ that
makes use of this
-docker environment (See
__hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestHoodieSanity.java__
)
+Hudi 是一个在包含 Hadoop 、 Hive 和 Spark 的海量数据分析/采集环境中使用的库。与这些系统的互用性是我们的一个关键目标。
我们在积极地向 __hudi-integ-test/src/test/java__ 添加集成测试,这些测试利用了这个 Docker 环境(参考:
__hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestHoodieSanity.java__
)。
-### Building Local Docker Containers:
+### 构建本地 Docker 容器:
-The docker images required for demo and running integration test are already
in docker-hub. The docker images
-and compose scripts are carefully implemented so that they serve dual-purpose
+Demo 和执行集成测试所需要的 Docker 镜像已经在 Docker 源中。 Docker 镜像和部署脚本经过了谨慎的实现以便服务与多种目的:
-1. The docker images have inbuilt hudi jar files with environment variable
pointing to those jars (HUDI_HADOOP_BUNDLE, ...)
-2. For running integration-tests, we need the jars generated locally to be
used for running services within docker. The
- docker-compose scripts (see
`docker/compose/docker-compose_hadoop284_hive233_spark231.yml`) ensures local
jars override
- inbuilt jars by mounting local HUDI workspace over the docker location
-3. As these docker containers have mounted local HUDI workspace, any changes
that happen in the workspace would automatically
- reflect in the containers. This is a convenient way for developing and
verifying Hudi for
- developers who do not own a distributed environment. Note that this is how
integration tests are run.
+1. Docker 镜像有内建的 Hudi jar 包,它包含一些指向其他 jar 包的环境变量( HUDI_HADOOP_BUNDLE 等)
+2. 为了执行集成测试,我们需要使用本地生成的 jar 包在 Docker 中运行服务。 Docker 部署脚本(参考
`docker/compose/docker-compose_hadoop284_hive233_spark231.yml`)能确保本地 jar 包通过挂载
Docker 地址上挂载本地 Hudi 工作空间,从而覆盖了内建的 jar 包。
+3. 当这些 Docker 容器挂载到本地 Hudi 工作空间之后,任何发生在工作空间中的变更将会自动反映到容器中。这对于开发者来说是一种开发和验证
Hudi 的简便方法,这些开发者没有分布式的环境。要注意的是,这是集成测试的执行方式。
-This helps avoid maintaining separate docker images and avoids the costly step
of building HUDI docker images locally.
-But if users want to test hudi from locations with lower network bandwidth,
they can still build local images
-run the script
-`docker/build_local_docker_images.sh` to build local docker images before
running `docker/setup_demo.sh`
+这避免了维护分离的 Docker 镜像,也避免了本地构建 Docker 镜像的各个步骤的消耗。
+但是如果用户想要在有更低网络带宽的地方测试 Hudi ,他们仍可以构建本地镜像。
+在执行 `docker/setup_demo.sh` 之前执行脚本 `docker/build_local_docker_images.sh` 来构建本地
Docker 镜像。
-Here are the commands:
+以下是执行的命令:
```java
cd docker
diff --git
a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.8.0/azure_hoodie.md
b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.8.0/azure_hoodie.md
index b9ff2c9..d2ce6e8 100644
---
a/website/i18n/cn/docusaurus-plugin-content-docs/version-0.8.0/azure_hoodie.md
+++
b/website/i18n/cn/docusaurus-plugin-content-docs/version-0.8.0/azure_hoodie.md
@@ -1,42 +1,43 @@
---
version: 0.8.0
-title: Azure Filesystem
+title: Azure 文件系统
keywords: [ hudi, hive, azure, spark, presto]
-summary: In this page, we go over how to configure Hudi with Azure filesystem.
+summary: 在本页中,我们讨论如何在 Azure 文件系统中配置 Hudi 。
last_modified_at: 2020-05-25T19:00:57-04:00
language: cn
---
-In this page, we explain how to use Hudi on Microsoft Azure.
+在本页中,我们解释如何在 Microsoft Azure 上使用 Hudi 。
-## Disclaimer
+## 声明
-This page is maintained by the Hudi community.
-If the information is inaccurate or you have additional information to add.
-Please feel free to create a JIRA ticket. Contribution is highly appreciated.
+本页面由 Hudi 社区维护。
+如果信息不准确,或者你有信息要补充,请尽管创建 JIRA ticket。
+对此贡献高度赞赏。
-## Supported Storage System
+## 支持的存储系统
-There are two storage systems support Hudi .
+Hudi 支持两种存储系统。
-- Azure Blob Storage
+- Azure Blob 存储
- Azure Data Lake Gen 2
-## Verified Combination of Spark and storage system
+## 经过验证的 Spark 与存储系统的组合
-#### HDInsight Spark2.4 on Azure Data Lake Storage Gen 2
+#### Azure Data Lake Storage Gen 2 上的 HDInsight Spark 2.4
This combination works out of the box. No extra config needed.
+这种组合开箱即用,不需要额外的配置。
-#### Databricks Spark2.4 on Azure Data Lake Storage Gen 2
-- Import Hudi jar to databricks workspace
+#### Azure Data Lake Storage Gen 2 上的 Databricks Spark 2.4
+- 将 Hudi jar 包导入到 databricks 工作区 。
-- Mount the file system to dbutils.
+- 将文件系统挂载到 dbutils 。
```scala
dbutils.fs.mount(
source = "abfss://[email protected]",
mountPoint = "/mountpoint",
extraConfigs = configs)
```
-- When writing Hudi dataset, use abfss URL
+- 当写入 Hudi 数据集时,使用 abfss URL
```scala
inputDF.write
.format("org.apache.hudi")
@@ -44,7 +45,7 @@ This combination works out of the box. No extra config needed.
.mode(SaveMode.Append)
.save("abfss://<<storage-account>>.dfs.core.windows.net/hudi-tables/customer")
```
-- When reading Hudi dataset, use the mounting point
+- 当读取 Hudi 数据集时,使用挂载点
```scala
spark.read
.format("org.apache.hudi")