This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 9cd67ba  [HUDI-273] Translate the Writing Data page into Chinese 
documentation (#958)
9cd67ba is described below

commit 9cd67ba81875d66c05a630e2948073ed33ec0548
Author: Y Ethan Guo <[email protected]>
AuthorDate: Thu Oct 17 10:15:55 2019 -0700

    [HUDI-273] Translate the Writing Data page into Chinese documentation (#958)
    
    -  Address review comments to improve translation
    - Make the translation of terms consistent with the quickstart page
---
 docs/writing_data.cn.md | 136 ++++++++++++++++++++++++------------------------
 1 file changed, 69 insertions(+), 67 deletions(-)

diff --git a/docs/writing_data.cn.md b/docs/writing_data.cn.md
index c727266..58b6c99 100644
--- a/docs/writing_data.cn.md
+++ b/docs/writing_data.cn.md
@@ -1,44 +1,43 @@
 ---
-title: Writing Hudi Datasets
+title: 写入 Hudi 数据集
 keywords: hudi, incremental, batch, stream, processing, Hive, ETL, Spark SQL
 sidebar: mydoc_sidebar
 permalink: writing_data.html
 toc: false
-summary: In this page, we will discuss some available tools for incrementally 
ingesting & storing data.
+summary: 这一页里,我们将讨论一些可用的工具,这些工具可用于增量摄取和存储数据。
 ---
 
-In this section, we will cover ways to ingest new changes from external 
sources or even other Hudi datasets using the [DeltaStreamer](#deltastreamer) 
tool, as well as 
-speeding up large Spark jobs via upserts using the [Hudi 
datasource](#datasource-writer). Such datasets can then be 
[queried](querying_data.html) using various query engines.
+这一节我们将介绍使用[DeltaStreamer](#deltastreamer)工具从外部源甚至其他Hudi数据集摄取新更改的方法,
+以及通过使用[Hudi数据源](#datasource-writer)的upserts加快大型Spark作业的方法。
+对于此类数据集,我们可以使用各种查询引擎[查询](querying_data.html)它们。
 
+## 写操作
 
-## Write Operations
-
-Before that, it may be helpful to understand the 3 different write operations 
provided by Hudi datasource or the delta streamer tool and how best to leverage 
them. These operations
-can be chosen/changed across each commit/deltacommit issued against the 
dataset.
-
-
- - **UPSERT** : This is the default operation where the input records are 
first tagged as inserts or updates by looking up the index and 
- the records are ultimately written after heuristics are run to determine how 
best to pack them on storage to optimize for things like file sizing. 
- This operation is recommended for use-cases like database change capture 
where the input almost certainly contains updates.
- - **INSERT** : This operation is very similar to upsert in terms of 
heuristics/file sizing but completely skips the index lookup step. Thus, it can 
be a lot faster than upserts 
- for use-cases like log de-duplication (in conjunction with options to filter 
duplicates mentioned below). This is also suitable for use-cases where the 
dataset can tolerate duplicates, but just 
- need the transactional writes/incremental pull/storage management 
capabilities of Hudi.
- - **BULK_INSERT** : Both upsert and insert operations keep input records in 
memory to speed up storage heuristics computations faster (among other things) 
and thus can be cumbersome for 
- initial loading/bootstrapping a Hudi dataset at first. Bulk insert provides 
the same semantics as insert, while implementing a sort-based data writing 
algorithm, which can scale very well for several hundred TBs 
- of initial load. However, this just does a best-effort job at sizing files vs 
guaranteeing file sizes like inserts/upserts do. 
+在此之前,了解Hudi数据源及delta streamer工具提供的三种不同的写操作以及如何最佳利用它们可能会有所帮助。
+这些操作可以在针对数据集发出的每个提交/增量提交中进行选择/更改。
 
+ - **UPSERT(插入更新)** :这是默认操作,在该操作中,通过查找索引,首先将输入记录标记为插入或更新。
+ 在运行启发式方法以确定如何最好地将这些记录放到存储上,如优化文件大小之类后,这些记录最终会被写入。
+ 对于诸如数据库更改捕获之类的用例,建议该操作,因为输入几乎肯定包含更新。
+ - **INSERT(插入)** :就使用启发式方法确定文件大小而言,此操作与插入更新(UPSERT)非常相似,但此操作完全跳过了索引查找步骤。
+ 因此,对于日志重复数据删除等用例(结合下面提到的过滤重复项的选项),它可以比插入更新快得多。
+ 插入也适用于这种用例,这种情况数据集可以允许重复项,但只需要Hudi的事务写/增量提取/存储管理功能。
+ - **BULK_INSERT(批插入)** :插入更新和插入操作都将输入记录保存在内存中,以加快存储优化启发式计算的速度(以及其它未提及的方面)。
+ 所以对Hudi数据集进行初始加载/引导时这两种操作会很低效。批量插入提供与插入相同的语义,但同时实现了基于排序的数据写入算法,
+ 该算法可以很好地扩展数百TB的初始负载。但是,相比于插入和插入更新能保证文件大小,批插入在调整文件大小上只能尽力而为。
 
 ## DeltaStreamer
 
-The `HoodieDeltaStreamer` utility (part of hudi-utilities-bundle) provides the 
way to ingest from different sources such as DFS or Kafka, with the following 
capabilities.
+`HoodieDeltaStreamer`实用工具 (hudi-utilities-bundle中的一部分) 
提供了从DFS或Kafka等不同来源进行摄取的方式,并具有以下功能。
 
- - Exactly once ingestion of new events from Kafka, [incremental 
imports](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
 from Sqoop or output of `HiveIncrementalPuller` or files under a DFS folder
- - Support json, avro or a custom record types for the incoming data
- - Manage checkpoints, rollback & recovery 
- - Leverage Avro schemas from DFS or Confluent [schema 
registry](https://github.com/confluentinc/schema-registry).
- - Support for plugging in transformations
+ - 从Kafka单次摄取新事件,从Sqoop、HiveIncrementalPuller输出或DFS文件夹中的多个文件
+ 
[增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
+ - 支持json、avro或自定义记录类型的传入数据
+ - 管理检查点,回滚和恢复
+ - 利用DFS或Confluent 
[schema注册表](https://github.com/confluentinc/schema-registry)的Avro模式。
+ - 支持自定义转换操作
 
-Command line options describe capabilities in more detail
+命令行选项更详细地描述了这些功能:
 
 ```
 [hoodie]$ spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
@@ -112,16 +111,18 @@ Usage: <main class> [options]
       allows a SQL query template to be passed as a transformation function)
 ```
 
-The tool takes a hierarchically composed property file and has pluggable 
interfaces for extracting data, key generation and providing schema. Sample 
configs for ingesting from kafka and dfs are
-provided under `hudi-utilities/src/test/resources/delta-streamer-config`.
+该工具采用层次结构组成的属性文件,并具有可插拔的接口,用于提取数据、生成密钥和提供模式。
+从Kafka和DFS摄取数据的示例配置在这里:`hudi-utilities/src/test/resources/delta-streamer-config`。
 
-For e.g: once you have Confluent Kafka, Schema registry up & running, produce 
some test data using 
([impressions.avro](https://docs.confluent.io/current/ksql/docs/tutorials/generate-custom-test-data.html)
 provided by schema-registry repo)
+例如:当您让Confluent Kafka、Schema注册表启动并运行后,可以用这个命令产生一些测试数据
+([impressions.avro](https://docs.confluent.io/current/ksql/docs/tutorials/generate-custom-test-data.html),
+由schema-registry代码库提供)
 
 ```
 [confluent-5.0.0]$ bin/ksql-datagen schema=../impressions.avro format=avro 
topic=impressions key=impressionid
 ```
 
-and then ingest it as follows.
+然后用如下命令摄取这些数据。
 
 ```
 [hoodie]$ spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
@@ -133,19 +134,18 @@ and then ingest it as follows.
   --op BULK_INSERT
 ```
 
-In some cases, you may want to migrate your existing dataset into Hudi 
beforehand. Please refer to [migration guide](migration_guide.html). 
+在某些情况下,您可能需要预先将现有数据集迁移到Hudi。 请参考[迁移指南](migration_guide.html)。
 
 ## Datasource Writer
 
-The `hudi-spark` module offers the DataSource API to write (and also read) any 
data frame into a Hudi dataset.
-Following is how we can upsert a dataframe, while specifying the field names 
that need to be used
-for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey 
=> timestamp`
-
+`hudi-spark`模块提供了DataSource API,可以将任何数据帧写入(也可以读取)到Hudi数据集中。
+以下是在指定需要使用的字段名称的之后,如何插入更新数据帧的方法,这些字段包括
+`recordKey => _row_key`、`partitionPath => partition`和`precombineKey => 
timestamp`
 
 ```
 inputDF.write()
        .format("org.apache.hudi")
-       .options(clientOpts) // any of the Hudi client opts can be passed in as 
well
+       .options(clientOpts) // 可以传入任何Hudi客户端参数
        .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
        .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), 
"partition")
        .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
@@ -154,11 +154,11 @@ inputDF.write()
        .save(basePath);
 ```
 
-## Syncing to Hive
+## 与Hive同步
 
-Both tools above support syncing of the dataset's latest schema to Hive 
metastore, such that queries can pick up new columns and partitions.
-In case, its preferable to run this from commandline or in an independent jvm, 
Hudi provides a `HiveSyncTool`, which can be invoked as below, 
-once you have built the hudi-hive module.
+上面的两个工具都支持将数据集的最新模式同步到Hive Metastore,以便查询新的列和分区。
+如果需要从命令行或在独立的JVM中运行它,Hudi提供了一个`HiveSyncTool`,
+在构建了hudi-hive模块之后,可以按以下方式调用它。
 
 ```
 cd hudi-hive
@@ -182,39 +182,41 @@ Usage: <main class> [options]
        Hive username
 ```
 
-## Deletes 
+## 删除数据 
 
-Hudi supports implementing two types of deletes on data stored in Hudi 
datasets, by enabling the user to specify a different record payload 
implementation. 
+通过允许用户指定不同的数据记录负载实现,Hudi支持对存储在Hudi数据集中的数据执行两种类型的删除。
 
- - **Soft Deletes** : With soft deletes, user wants to retain the key but just 
null out the values for all other fields. 
- This can be simply achieved by ensuring the appropriate fields are nullable 
in the dataset schema and simply upserting the dataset after setting these 
fields to null.
- - **Hard Deletes** : A stronger form of delete is to physically remove any 
trace of the record from the dataset. This can be achieved by issuing an upsert 
with a custom payload implementation
- via either DataSource or DeltaStreamer which always returns Optional.Empty as 
the combined value. Hudi ships with a built-in 
`org.apache.hudi.EmptyHoodieRecordPayload` class that does exactly this.
+ - **Soft Deletes(软删除)** :使用软删除时,用户希望保留键,但仅使所有其他字段的值都为空。
+ 通过确保适当的字段在数据集模式中可以为空,并在将这些字段设置为null之后直接向数据集插入更新这些记录,即可轻松实现这一点。
+ - **Hard Deletes(硬删除)** :这种更强形式的删除是从数据集中彻底删除记录在存储上的任何痕迹。 
+ 
这可以通过触发一个带有自定义负载实现的插入更新来实现,这种实现可以使用总是返回Optional.Empty作为组合值的DataSource或DeltaStreamer。
 
+ Hudi附带了一个内置的`org.apache.hudi.EmptyHoodieRecordPayload`类,它就是实现了这一功能。
  
 ```
- deleteDF // dataframe containing just records to be deleted
+ deleteDF // 仅包含要删除的记录的数据帧
    .write().format("org.apache.hudi")
-   .option(...) // Add HUDI options like record-key, partition-path and others 
as needed for your setup
-   // specify record_key, partition_key, precombine_fieldkey & usual params
+   .option(...) // 根据设置需要添加HUDI参数,例如记录键、分区路径和其他参数
+   // 指定record_key,partition_key,precombine_fieldkey和常规参数
    .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY, 
"org.apache.hudi.EmptyHoodieRecordPayload")
  
 ```
 
-
-## Storage Management
-
-Hudi also performs several key storage management functions on the data stored 
in a Hudi dataset. A key aspect of storing data on DFS is managing file sizes 
and counts
-and reclaiming storage space. For e.g HDFS is infamous for its handling of 
small files, which exerts memory/RPC pressure on the Name Node and can 
potentially destabilize
-the entire cluster. In general, query engines provide much better performance 
on adequately sized columnar files, since they can effectively amortize cost of 
obtaining 
-column statistics etc. Even on some cloud data stores, there is often cost to 
listing directories with large number of small files.
-
-Here are some ways to efficiently manage the storage of your Hudi datasets.
-
- - The [small file handling 
feature](configurations.html#compactionSmallFileSize) in Hudi, profiles 
incoming workload 
-   and distributes inserts to existing file groups instead of creating new 
file groups, which can lead to small files. 
- - Cleaner can be [configured](configurations.html#retainCommits) to clean up 
older file slices, more or less aggressively depending on maximum time for 
queries to run & lookback needed for incremental pull
- - User can also tune the size of the [base/parquet 
file](configurations.html#limitFileSize), [log 
files](configurations.html#logFileMaxSize) & expected [compression 
ratio](configurations.html#parquetCompressionRatio), 
-   such that sufficient number of inserts are grouped into the same file 
group, resulting in well sized base files ultimately.
- - Intelligently tuning the [bulk insert 
parallelism](configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
-   once created cannot be deleted, but simply expanded as explained before.
- - For workloads with heavy updates, the [merge-on-read 
storage](concepts.html#merge-on-read-storage) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+## 存储管理
+
+Hudi还对存储在Hudi数据集中的数据执行几个关键的存储管理功能。在DFS上存储数据的关键方面是管理文件大小和数量以及回收存储空间。 
+例如,HDFS在处理小文件上性能很差,这会对Name Node的内存及RPC施加很大的压力,并可能破坏整个集群的稳定性。
+通常,查询引擎可在较大的列文件上提供更好的性能,因为它们可以有效地摊销获得列统计信息等的成本。
+即使在某些云数据存储上,列出具有大量小文件的目录也常常比较慢。
+
+以下是一些有效管理Hudi数据集存储的方法。
+
+ - 
Hudi中的[小文件处理功能](configurations.html#compactionSmallFileSize),可以分析传入的工作负载并将插入内容分配到现有文件组中,
+ 而不是创建新文件组。新文件组会生成小文件。
+ - 可以[配置](configurations.html#retainCommits)Cleaner来清理较旧的文件片,清理的程度可以调整,
+ 具体取决于查询所需的最长时间和增量拉取所需的回溯。
+ - 
用户还可以调整[基础/parquet文件](configurations.html#limitFileSize)、[日志文件](configurations.html#logFileMaxSize)的大小
+ 
和预期的[压缩率](configurations.html#parquetCompressionRatio),使足够数量的插入被分到同一个文件组中,最终产生大小合适的基础文件。
+ - 智能调整[批插入并行度](configurations.html#withBulkInsertParallelism),可以产生大小合适的初始文件组。
+ 实际上,正确执行此操作非常关键,因为文件组一旦创建后就不能删除,只能如前所述对其进行扩展。
+ - 对于具有大量更新的工作负载,[读取时合并存储](concepts.html#merge-on-read-storage)提供了一种很好的机制,
+ 可以快速将其摄取到较小的文件中,之后通过压缩将它们合并为较大的基础文件。

Reply via email to