[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #977: [HUDI-221] Translate concept page

GitBox Mon, 28 Oct 2019 23:13:22 -0700

garyli1019 commented on a change in pull request #977: [HUDI-221] Translate 
concept page
URL: https://github.com/apache/incubator-hudi/pull/977#discussion_r339904105


 ##########
 File path: docs/concepts.cn.md
 ##########
 @@ -4,168 +4,151 @@ keywords: hudi, design, storage, views, timeline
 sidebar: mydoc_sidebar
 permalink: concepts.html
 toc: false
-summary: "Here we introduce some basic concepts & give a broad technical 
overview of Hudi"
+summary: "这里我们将介绍Hudi的一些基本概念和宽泛的技术概述"
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over datasets on DFS
+Apache Hudi(发音为“Hudi”)在DFS的数据集上提供以下流原语
 
- * Upsert                     (how do I change the dataset?)
- * Incremental pull           (how do I fetch data that changed?)
+ * 插入更新                     (如何改变数据集?)
+ * 增量拉取           (如何获取变更的数据?)
 
-In this section, we will discuss key concepts & terminologies that are 
important to understand, to be able to effectively use these primitives.
+在本节中，我们将讨论重要的概念和术语，这些概念和术语有助于理解并有效使用这些原语。
 
-## Timeline
-At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time that helps provide instantaneous views 
of the dataset,
-while also efficiently supporting retrieval of data in the order of arrival. A 
Hudi instant consists of the following components 
+## 时间轴
+Hudi的核心是维护对数据集在不同的`即时`时间执行的所有操作的`时间轴`，这有助于提供数据集的即时视图，同时还有效地支持按到达顺序进行数据检索。 
Hudi瞬间包含以下组件
 
- * `Action type` : Type of action performed on the dataset
- * `Instant time` : Instant time is typically a timestamp (e.g: 
20190117010349), which monotonically increases in the order of action's begin 
time.
- * `state` : current state of the instant
- 
-Hudi guarantees that the actions performed on the timeline are atomic & 
timeline consistent based on the instant time.
+ * `操作类型` : 对数据集执行的操作类型
+ * `即时时间` : 即时时间通常是一个时间戳(例如：20190117010349)，该时间戳按操作开始时间的顺序单调增加。
+ * `状态` : 即时的状态
 
-Key actions performed include
+Hudi保证在时间轴上执行的操作的原子性和时间轴上即时时间一致性。
 
- * `COMMITS` - A commit denotes an **atomic write** of a batch of records into 
a dataset.
- * `CLEANS` - Background activity that gets rid of older versions of files in 
the dataset, that are no longer needed.
- * `DELTA_COMMIT` - A delta commit refers to an **atomic write** of a batch of 
records into a  MergeOnRead storage type of dataset, where some/all of the data 
could be just written to delta logs.
- * `COMPACTION` - Background activity to reconcile differential data 
structures within Hudi e.g: moving updates from row based log files to columnar 
formats. Internally, compaction manifests as a special commit on the timeline
- * `ROLLBACK` - Indicates that a commit/delta commit was unsuccessful & rolled 
back, removing any partial files produced during such a write
- * `SAVEPOINT` - Marks certain file groups as "saved", such that cleaner will 
not delete them. It helps restore the dataset to a point on the timeline, in 
case of disaster/data recovery scenarios.
+执行的关键操作包括
 
-Any given instant can be 
-in one of the following states
+ * `COMMITS` - 一次提交表示将一组记录**原子写入**到数据集中。
+ * `CLEANS` - 删除数据集中不再需要的旧文件版本的后台活动。
+ * `DELTA_COMMIT` - 
增量提交是指将一批记录**原子写入**到MergeOnRead存储类型的数据集中，其中一些/所有数据都可以只写到增量日志中。
+ * `COMPACTION` - 协调Hudi中的差异数据结构，例如：将更新从基于行的日志文件移动到列格式的后台活动。在内部，压缩表现为时间轴上的特殊提交。
+ * `ROLLBACK` - 表示提交/增量提交不成功且已回滚，删除在写入过程中产生的所有部分文件。
+ * `SAVEPOINT` - 
将某些文件组标记为"已保存"，以便清理程序不会将其删除。在发生灾难/数据恢复的情况下，它有助于将数据集还原到时间轴上的某个点。
 
- * `REQUESTED` - Denotes an action has been scheduled, but has not initiated
- * `INFLIGHT` - Denotes that the action is currently being performed
- * `COMPLETED` - Denotes completion of an action on the timeline
+任何给定的即时都可以处于以下状态之一
+
+ * `REQUESTED` - 表示已调度但尚未启动的操作。
+ * `INFLIGHT` - 表示当前正在执行该操作。
+ * `COMPLETED` - 表示在时间轴上完成了该操作。
 
 <figure>
     <img class="docimage" src="/images/hudi_timeline.png" 
alt="hudi_timeline.png" />
 </figure>
 
-Example above shows upserts happenings between 10:00 and 10:20 on a Hudi 
dataset, roughly every 5 mins, leaving commit metadata on the Hudi timeline, 
along
-with other background cleaning/compactions. One key observation to make is 
that the commit time indicates the `arrival time` of the data (10:20AM), while 
the actual data
-organization reflects the actual time or `event time`, the data was intended 
for (hourly buckets from 07:00). These are two key concepts when reasoning 
about tradeoffs between latency and completeness of data.
+上面的示例显示了在Hudi数据集上大约10:00到10:20之间发生的更新事件，大约每5分钟一次，将提交元数据以及其他后台清理/压缩保留在Hudi时间轴上。
+观察的关键点是：提交时间指示数据的`到达时间`（上午10:20），而实际数据组织则反映了实际时间或`事件时间`，即数据的目的（从07:00开始的每小时时段）。在权衡数据延迟和完整性时，这是两个关键概念。
 
-When there is late arriving data (data intended for 9:00 arriving >1 hr late 
at 10:20), we can see the upsert producing new data into even older time 
buckets/folders.
-With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
-only the changed files without say scanning all the time buckets > 07:00.
+如果有延迟到达的数据（原定于9:00到达的数据在10:20达到，延迟 >1 小时），我们可以看到upsert将新数据生成到更旧的时间段/文件夹中。
+在时间轴的帮助下，增量查询以尝试获取自10:00以来成功提交的所有新数据，可以非常高效地消费更改的文件，而无需扫描所有大于07:00的时间段。
 
-## File management
-Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
-very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
+## 文件组织
+Hudi将DFS上的数据集组织到`基本路径`下的目录结构中。数据集分为多个分区，这些分区是包含该分区的数据文件的文件夹，这与Hive表非常相似。
+每个分区均由相对于基本路径的`分区路径`唯一标识。
 
-Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
-`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
- along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
-Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
-unused/older file slices to reclaim space on DFS. 
+在每个分区内，文件被组织为`文件组`，由`文件id`唯一标识。
+每个文件组包含多个`文件切片`，其中每个切片包含在某个提交/压缩即时时间生成的基本列文件（`*.parquet`）以及一组日志文件（`*.log*`），该文件包含自生成基本文件以来对基本文件的插入/更新。
+Hudi采用MVCC设计，其中压缩操作将日志和基本文件合并以产生新的文件片，而清理操作则将未使用的/较旧的文件片删除以回收DFS上的空间。
 
-Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
-This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
-mapped file group contains all versions of a group of records.
+Hudi通过索引机制将给定的hoodie键（记录键+分区路径）映射到文件组，从而提供了高效的Upsert。
+一旦将记录的第一个版本写入文件，记录键和文件组/文件id之间的映射就永远不会改变。 简而言之，映射的文件组包含一组记录的所有版本。
 
-## Storage Types & Views
-Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
-In turn, `views` define how the underlying data is exposed to the queries (i.e 
how data is read). 
+## 存储类型和视图
+Hudi存储类型定义了如何在DFS上对数据进行索引和布局以及如何在这种组织之上实现上述原语和时间轴活动（即如何写入数据）。
+反过来，`视图`定义了基础数据如何暴露给查询（即如何读取数据）。
 
-| Storage Type  | Supported Views |
+| 存储类型  | 支持的视图 |
 |-------------- |------------------|
-| Copy On Write | Read Optimized + Incremental   |
-| Merge On Read | Read Optimized + Incremental + Near Real-time |
+| 写时复制 | 读优化 + 增量   |
+| 读时合并 | 读优化 + 增量 + 近实时 |
+
+### 存储类型
+Hudi支持以下存储类型。
 
-### Storage Types
-Hudi supports the following storage types.
+  - [写时复制](#copy-on-write-storage) : 
仅使用列文件格式（例如parquet）存储数据。通过在写入过程中执行同步合并以更新版本并重写文件。
 
-  - [Copy On Write](#copy-on-write-storage) : Stores data using exclusively 
columnar file formats (e.g parquet). Updates simply version & rewrite the files 
by performing a synchronous merge during write.
-  - [Merge On Read](#merge-on-read-storage) : Stores data using a combination 
of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are 
logged to delta files & later compacted to produce new versions of columnar 
files synchronously or asynchronously.
+  - [读时合并](#merge-on-read-storage) : 组合列式（例如parquet）+ 基于行（例如avro）的文件格式来存储数据。 
更新记录到增量文件中，然后进行同步或异步压缩以生成列文件的新版本。
     
-Following table summarizes the trade-offs between these two storage types
+下表总结了这两种存储类型之间的权衡
 
-| Trade-off | CopyOnWrite | MergeOnRead |
+| 权衡 | 写时复制 | 读时合并 |
 |-------------- |------------------| ------------------|
-| Data Latency | Higher   | Lower |
-| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta 
log) |
-| Parquet File Size | Smaller (high update(I/0) cost) | Larger (low update 
cost) |
-| Write Amplification | Higher | Lower (depending on compaction strategy) |
+| 数据延迟 | 更高   | 更低 |
+| 更新代价(I/O) | 更高（重写整个parquet文件） | 更低（追加到增量日志） |
+| Parquet文件大小 | 更小（高更新代价（I/o）） | 更大（低更新代价） |
+| 写放大 | 更高 | 更低（取决于压缩策略） |
 
 
-### Views
-Hudi supports the following views of stored data
+### 视图
+Hudi支持以下存储数据的视图
 
- - **Read Optimized View** : Queries on this view see the latest snapshot of 
the dataset as of a given commit or compaction action. 
-    This view exposes only the base/columnar files in latest file slices to 
the queries and guarantees the same columnar query performance compared to a 
non-hudi columnar dataset. 
- - **Incremental View** : Queries on this view only see new data written to 
the dataset, since a given commit/compaction. This view effectively provides 
change streams to enable incremental data pipelines. 
- - **Realtime View** : Queries on this view see the latest snapshot of dataset 
as of a given delta commit action. This view provides near-real time datasets 
(few mins)
-     by merging the base and delta files of the latest file slice on-the-fly.
+ - **读优化视图** : 在此视图上的查询将查看给定提交或压缩操作中数据集的最新快照。
+    该视图仅将最新文件切片中的基本/列文件暴露给查询，并保证与非Hudi列式数据集相比，具有相同的列式查询性能。
+ - **增量视图** : 由于给定了提交/压缩，对该视图的查询只能看到写入数据集的新数据。 该视图有效地提供了更改流以便启用增量数据管道。
+ - **实时视图** : 
在此视图上的查询将查看给定增量提交操作中数据集的最新快照。该视图通过动态合并最新文件切片的基本文件和增量文件来提供近实时数据集（几分钟）。
 
-Following table summarizes the trade-offs between the different views.
 
-| Trade-off | ReadOptimized | RealTime |
-|-------------- |------------------| ------------------|
-| Data Latency | Higher   | Lower |
-| Query Latency | Lower (raw columnar performance) | Higher (merge columnar + 
row based delta) |
+下表总结了不同视图之间的权衡。
 
+| 权衡 | 读优化 | 实时 |
+|-------------- |------------------| ------------------|
+| 数据延迟 | 更高   | 更低 |
+| 查询延迟 | 更低（原始列式性能）| 更高（合并列式 + 基于行的增量） |
 
-## Copy On Write Storage
 
-File slices in Copy-On-Write storage only contain the base/columnar file and 
each commit produces new versions of base files. 
-In other words, we implicitly compact on every commit, such that only columnar 
data exists. As a result, the write amplification 
-(number of bytes written for 1 byte of incoming data) is much higher, where 
read amplification is zero. 
-This is a much desired property for analytical workloads, which is 
predominantly read-heavy.
+## 写时复制存储
 
-Following illustrates how this works conceptually, when  data written into 
copy-on-write storage  and two queries running on top of it.
+写时复制存储中的文件片仅包含基本/列文件，并且每次提交都会生成新版本的基本文件。
+换句话说，我们隐式地压缩每个提交，从而只存在列数据。结果，写放大（为到来数据的1个字节写入的字节数）要高得多，而读放大为零。
+对于主要是读取繁重的分析工作负载而言，这是非常需要的特性。
 
+以下内容说明了将数据写入写时复制存储并在其上运行两个查询时，其是如何工作的。
 
 <figure>
     <img class="docimage" src="/images/hudi_cow.png" alt="hudi_cow.png" />
 </figure>
 
 
-As data gets written, updates to existing file groups produce a new slice for 
that file group stamped with the commit instant time, 
-while inserts allocate a new file group and write its first slice for that 
file group. These file slices and their commit instant times are color coded 
above.
-SQL queries running against such a dataset (eg: `select count(*)` counting the 
total records in that partition), first checks the timeline for the latest 
commit
-and filters all but latest file slices of each file group. As you can see, an 
old query does not see the current inflight commit's files color coded in pink,
-but a new query starting after the commit picks up the new data. Thus queries 
are immune to any write failures/partial writes and only run on committed data.
+随着数据的写入，对现有文件组的更新将为该文件组生成一个带有提交即时时间标记的新切片，而插入分配一个新文件组并写入该文件组的第一个切片。
+这些文件切片及其提交即时时间在上面用颜色编码。
+针对这样的数据集运行SQL查询（例如：`select 
count(*)`计算该分区中的总记录），首先检查时间轴上的最新提交并过滤每个文件组中除最新文件片以外的所有文件片。
+如您所见，旧查询不会看到以粉红色标记的当前进行中的提交的文件，但是新查询将在提交后获取新数据。因此，查询不受任何写入失败/部分写入的影响，仅运行在已提交数据上。
 
-The intention of copy on write storage, is to fundamentally improve how 
datasets are managed today through
+写时复制存储的目的是从根本上改善现在如何管理数据集
 
-  - First class support for atomically updating data at file-level, instead of 
rewriting whole tables/partitions
-  - Ability to incremental consume changes, as opposed to wasteful scans or 
fumbling with heuristics
-  - Tight control file sizes to keep query performance excellent (small files 
hurt query performance considerably).
+  - 一等公民的支持，可在文件级自动更新数据，而无需重写整个表/分区
+  - 能够增量消费变更，而不是进行浪费性的扫描或启发式搜索
+  - 严格控制文件大小可保持出色的查询性能（小的文件会严重损害查询性能）。
 
+## 读时合并策略
 
-## Merge On Read Storage
+读时合并存储是写复制的超集，从某种意义上说，它仍然可以通过读优化表提供数据集的读取优化视图。
 
 Review comment:
   读时合并存储是写时复制的升级版，从某种意义上说，它仍然可以通过读优化表提供数据集的读取优化视图（写时复制的功能）。

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #977: [HUDI-221] Translate concept page

Reply via email to