[GitHub] [incubator-hudi] leesf commented on a change in pull request #958: [HUDI-273] Translate the Writing Data page into Chinese documentation

GitBox Tue, 15 Oct 2019 18:31:02 -0700

leesf commented on a change in pull request #958: [HUDI-273] Translate the 
Writing Data page into Chinese documentation
URL: https://github.com/apache/incubator-hudi/pull/958#discussion_r335244374


 ##########
 File path: docs/writing_data.cn.md
 ##########
 @@ -1,44 +1,43 @@
 ---
-title: Writing Hudi Datasets
+title: 写入 Hudi 数据集
 keywords: hudi, incremental, batch, stream, processing, Hive, ETL, Spark SQL
 sidebar: mydoc_sidebar
 permalink: writing_data.html
 toc: false
-summary: In this page, we will discuss some available tools for incrementally 
ingesting & storing data.
+summary: 这一页里，我们将讨论一些可用的工具，这些工具可用于增量摄取和存储数据。
 ---
 
-In this section, we will cover ways to ingest new changes from external 
sources or even other Hudi datasets using the [DeltaStreamer](#deltastreamer) 
tool, as well as 
-speeding up large Spark jobs via upserts using the [Hudi 
datasource](#datasource-writer). Such datasets can then be 
[queried](querying_data.html) using various query engines.
+这一节我们将介绍使用[DeltaStreamer](#deltastreamer)工具从外部源甚至其他Hudi数据集摄取新更改的方法，
+以及使用[Hudi数据源](#datasource-writer)通过upserts加快大型Spark作业的方法。
+对于此类数据集，我们可以使用各种查询引擎[查询](querying_data.html)它们。
 
+## 写入操作
 
-## Write Operations
-
-Before that, it may be helpful to understand the 3 different write operations 
provided by Hudi datasource or the delta streamer tool and how best to leverage 
them. These operations
-can be chosen/changed across each commit/deltacommit issued against the 
dataset.
-
-
- - **UPSERT** : This is the default operation where the input records are 
first tagged as inserts or updates by looking up the index and 
- the records are ultimately written after heuristics are run to determine how 
best to pack them on storage to optimize for things like file sizing. 
- This operation is recommended for use-cases like database change capture 
where the input almost certainly contains updates.
- - **INSERT** : This operation is very similar to upsert in terms of 
heuristics/file sizing but completely skips the index lookup step. Thus, it can 
be a lot faster than upserts 
- for use-cases like log de-duplication (in conjunction with options to filter 
duplicates mentioned below). This is also suitable for use-cases where the 
dataset can tolerate duplicates, but just 
- need the transactional writes/incremental pull/storage management 
capabilities of Hudi.
- - **BULK_INSERT** : Both upsert and insert operations keep input records in 
memory to speed up storage heuristics computations faster (among other things) 
and thus can be cumbersome for 
- initial loading/bootstrapping a Hudi dataset at first. Bulk insert provides 
the same semantics as insert, while implementing a sort-based data writing 
algorithm, which can scale very well for several hundred TBs 
- of initial load. However, this just does a best-effort job at sizing files vs 
guaranteeing file sizes like inserts/upserts do. 
+在此之前，了解Hudi数据源及delta streamer工具提供的三种不同的写入操作以及如何最佳利用它们可能会有所帮助。
+这些操作可以在针对数据集发出的每个提交/增量提交中进行选择/更改。
 
+ - **UPSERT（索引归并）** ：这是默认操作，在该操作中，通过查找索引，首先将输入记录标记为插入或更新。
+ 在运行探索法以确定如何最好地将这些记录放到存储上，如优化文件大小之类后，这些记录最终会被写入。
 
 Review comment:
   在运行探索法 -> 在运行启发式方法？

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-hudi] leesf commented on a change in pull request #958: [HUDI-273] Translate the Writing Data page into Chinese documentation

Reply via email to