[GitHub] [incubator-hudi] yihua commented on a change in pull request #977: [HUDI-221] Translate concept page

GitBox Tue, 29 Oct 2019 11:54:24 -0700

yihua commented on a change in pull request #977: [HUDI-221] Translate concept 
page
URL: https://github.com/apache/incubator-hudi/pull/977#discussion_r340260437


 ##########
 File path: docs/concepts.cn.md
 ##########
 @@ -4,168 +4,151 @@ keywords: hudi, design, storage, views, timeline
 sidebar: mydoc_sidebar
 permalink: concepts.html
 toc: false
-summary: "Here we introduce some basic concepts & give a broad technical 
overview of Hudi"
+summary: "这里我们将介绍Hudi的一些基本概念和宽泛的技术概述"
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over datasets on DFS
+Apache Hudi(发音为“Hudi”)在DFS的数据集上提供以下流原语
 
- * Upsert                     (how do I change the dataset?)
- * Incremental pull           (how do I fetch data that changed?)
+ * 插入更新                     (如何改变数据集?)
+ * 增量拉取           (如何获取变更的数据?)
 
-In this section, we will discuss key concepts & terminologies that are 
important to understand, to be able to effectively use these primitives.
+在本节中，我们将讨论重要的概念和术语，这些概念和术语有助于理解并有效使用这些原语。
 
-## Timeline
-At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time that helps provide instantaneous views 
of the dataset,
-while also efficiently supporting retrieval of data in the order of arrival. A 
Hudi instant consists of the following components 
+## 时间轴
+在它的核心，Hudi维护在不同的`即时`时间所有对数据集操作的`时间轴`，从而提供，从不同时间点出发得到不同的视图下的数据集。Hudi瞬间包含以下组件
 
- * `Action type` : Type of action performed on the dataset
- * `Instant time` : Instant time is typically a timestamp (e.g: 
20190117010349), which monotonically increases in the order of action's begin 
time.
- * `state` : current state of the instant
- 
-Hudi guarantees that the actions performed on the timeline are atomic & 
timeline consistent based on the instant time.
+ * `操作类型` : 对数据集执行的操作类型
+ * `即时时间` : 即时时间通常是一个时间戳(例如：20190117010349)，该时间戳按操作开始时间的顺序单调增加。
+ * `状态` : 即时的状态
 
-Key actions performed include
+Hudi保证在时间轴上执行的操作的原子性和时间轴上即时时间一致性。
 
- * `COMMITS` - A commit denotes an **atomic write** of a batch of records into 
a dataset.
- * `CLEANS` - Background activity that gets rid of older versions of files in 
the dataset, that are no longer needed.
- * `DELTA_COMMIT` - A delta commit refers to an **atomic write** of a batch of 
records into a  MergeOnRead storage type of dataset, where some/all of the data 
could be just written to delta logs.
- * `COMPACTION` - Background activity to reconcile differential data 
structures within Hudi e.g: moving updates from row based log files to columnar 
formats. Internally, compaction manifests as a special commit on the timeline
- * `ROLLBACK` - Indicates that a commit/delta commit was unsuccessful & rolled 
back, removing any partial files produced during such a write
- * `SAVEPOINT` - Marks certain file groups as "saved", such that cleaner will 
not delete them. It helps restore the dataset to a point on the timeline, in 
case of disaster/data recovery scenarios.
+执行的关键操作包括
 
-Any given instant can be 
-in one of the following states
+ * `COMMITS` - 一次提交表示将一组记录**原子写入**到数据集中。
+ * `CLEANS` - 删除数据集中不再需要的旧文件版本的后台活动。
+ * `DELTA_COMMIT` - 
增量提交是指将一批记录**原子写入**到MergeOnRead存储类型的数据集中，其中一些/所有数据都可以只写到增量日志中。
+ * `COMPACTION` - 协调Hudi中的差异数据结构，例如：将更新从基于行的日志文件移动到列格式的后台活动。在内部，压缩表现为时间轴上的特殊提交。
+ * `ROLLBACK` - 表示提交/增量提交不成功且已回滚，删除在写入过程中产生的所有部分文件。
+ * `SAVEPOINT` - 
将某些文件组标记为"已保存"，以便清理程序不会将其删除。在发生灾难/数据恢复的情况下，它有助于将数据集还原到时间轴上的某个点。
 
- * `REQUESTED` - Denotes an action has been scheduled, but has not initiated
- * `INFLIGHT` - Denotes that the action is currently being performed
- * `COMPLETED` - Denotes completion of an action on the timeline
+任何给定的即时都可以处于以下状态之一
+
+ * `REQUESTED` - 表示已调度但尚未启动的操作。
+ * `INFLIGHT` - 表示当前正在执行该操作。
+ * `COMPLETED` - 表示在时间轴上完成了该操作。
 
 <figure>
     <img class="docimage" src="/images/hudi_timeline.png" 
alt="hudi_timeline.png" />
 </figure>
 
-Example above shows upserts happenings between 10:00 and 10:20 on a Hudi 
dataset, roughly every 5 mins, leaving commit metadata on the Hudi timeline, 
along
-with other background cleaning/compactions. One key observation to make is 
that the commit time indicates the `arrival time` of the data (10:20AM), while 
the actual data
-organization reflects the actual time or `event time`, the data was intended 
for (hourly buckets from 07:00). These are two key concepts when reasoning 
about tradeoffs between latency and completeness of data.
+上面的示例显示了在Hudi数据集上大约10:00到10:20之间发生的更新事件，大约每5分钟一次，将提交元数据以及其他后台清理/压缩保留在Hudi时间轴上。
+观察的关键点是：提交时间指示数据的`到达时间`（上午10:20），而实际数据组织则反映了实际时间或`事件时间`，即数据的目的（从07:00开始的每小时时段）。在权衡数据延迟和完整性时，这是两个关键概念。
 
-When there is late arriving data (data intended for 9:00 arriving >1 hr late 
at 10:20), we can see the upsert producing new data into even older time 
buckets/folders.
-With the help of the timeline, an incremental query attempting to get all new 
data that was committed successfully since 10:00 hours, is able to very 
efficiently consume
-only the changed files without say scanning all the time buckets > 07:00.
+如果有延迟到达的数据（原定于9:00到达的数据在10:20达到，延迟 >1 小时），我们可以看到upsert将新数据生成到更旧的时间段/文件夹中。
+在时间轴的帮助下，增量查询可以只提取10:00以后成功提交的新数据，并非常高效地只消费更改过的文件，且无需扫描更大的文件范围，例如07:00后到达的所有文件。
 
 Review comment:
   "07:00后到达的所有文件" => "07:00后的所有时间段"

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yihua commented on a change in pull request #977: [HUDI-221] Translate concept page

Reply via email to