leesf commented on a change in pull request #958: [HUDI-273] Translate the Writing Data page into Chinese documentation URL: https://github.com/apache/incubator-hudi/pull/958#discussion_r335244517
########## File path: docs/writing_data.cn.md ########## @@ -1,44 +1,43 @@ --- -title: Writing Hudi Datasets +title: 写入 Hudi 数据集 keywords: hudi, incremental, batch, stream, processing, Hive, ETL, Spark SQL sidebar: mydoc_sidebar permalink: writing_data.html toc: false -summary: In this page, we will discuss some available tools for incrementally ingesting & storing data. +summary: 这一页里,我们将讨论一些可用的工具,这些工具可用于增量摄取和存储数据。 --- -In this section, we will cover ways to ingest new changes from external sources or even other Hudi datasets using the [DeltaStreamer](#deltastreamer) tool, as well as -speeding up large Spark jobs via upserts using the [Hudi datasource](#datasource-writer). Such datasets can then be [queried](querying_data.html) using various query engines. +这一节我们将介绍使用[DeltaStreamer](#deltastreamer)工具从外部源甚至其他Hudi数据集摄取新更改的方法, +以及使用[Hudi数据源](#datasource-writer)通过upserts加快大型Spark作业的方法。 +对于此类数据集,我们可以使用各种查询引擎[查询](querying_data.html)它们。 +## 写入操作 -## Write Operations - -Before that, it may be helpful to understand the 3 different write operations provided by Hudi datasource or the delta streamer tool and how best to leverage them. These operations -can be chosen/changed across each commit/deltacommit issued against the dataset. - - - - **UPSERT** : This is the default operation where the input records are first tagged as inserts or updates by looking up the index and - the records are ultimately written after heuristics are run to determine how best to pack them on storage to optimize for things like file sizing. - This operation is recommended for use-cases like database change capture where the input almost certainly contains updates. - - **INSERT** : This operation is very similar to upsert in terms of heuristics/file sizing but completely skips the index lookup step. Thus, it can be a lot faster than upserts - for use-cases like log de-duplication (in conjunction with options to filter duplicates mentioned below). This is also suitable for use-cases where the dataset can tolerate duplicates, but just - need the transactional writes/incremental pull/storage management capabilities of Hudi. - - **BULK_INSERT** : Both upsert and insert operations keep input records in memory to speed up storage heuristics computations faster (among other things) and thus can be cumbersome for - initial loading/bootstrapping a Hudi dataset at first. Bulk insert provides the same semantics as insert, while implementing a sort-based data writing algorithm, which can scale very well for several hundred TBs - of initial load. However, this just does a best-effort job at sizing files vs guaranteeing file sizes like inserts/upserts do. +在此之前,了解Hudi数据源及delta streamer工具提供的三种不同的写入操作以及如何最佳利用它们可能会有所帮助。 +这些操作可以在针对数据集发出的每个提交/增量提交中进行选择/更改。 + - **UPSERT(索引归并)** :这是默认操作,在该操作中,通过查找索引,首先将输入记录标记为插入或更新。 + 在运行探索法以确定如何最好地将这些记录放到存储上,如优化文件大小之类后,这些记录最终会被写入。 + 对于诸如数据库更改捕获之类的用例,建议该操作,因为输入几乎肯定包含更新。 + - **INSERT(插入)** :就使用探索法确定文件大小而言,此操作与索引归并(UPSERT)非常相似,但此操作完全跳过了索引查找步骤。 Review comment: 探索法 -> 启发式? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
