leesf commented on a change in pull request #958: [HUDI-273] Translate the
Writing Data page into Chinese documentation
URL: https://github.com/apache/incubator-hudi/pull/958#discussion_r335245974
##########
File path: docs/writing_data.cn.md
##########
@@ -182,39 +182,41 @@ Usage: <main class> [options]
Hive username
```
-## Deletes
+## 删除数据
-Hudi supports implementing two types of deletes on data stored in Hudi
datasets, by enabling the user to specify a different record payload
implementation.
+通过允许用户指定不同的数据记录负载实现,Hudi支持对存储在Hudi数据集中的数据执行两种类型的删除。
- - **Soft Deletes** : With soft deletes, user wants to retain the key but just
null out the values for all other fields.
- This can be simply achieved by ensuring the appropriate fields are nullable
in the dataset schema and simply upserting the dataset after setting these
fields to null.
- - **Hard Deletes** : A stronger form of delete is to physically remove any
trace of the record from the dataset. This can be achieved by issuing an upsert
with a custom payload implementation
- via either DataSource or DeltaStreamer which always returns Optional.Empty as
the combined value. Hudi ships with a built-in
`org.apache.hudi.EmptyHoodieRecordPayload` class that does exactly this.
+ - **Soft Deletes(软删除)** :使用软删除时,用户希望保留键,但仅使所有其他字段的值都为空。
+ 通过确保适当的字段在数据集模式中可以为空,并在将这些字段设置为null之后直接向数据集索引归并这些记录,即可轻松实现这一点。
+ - **Hard Deletes(硬删除)** :这种更强形式的删除是从数据集中彻底删除记录在存储上的任何痕迹。
+
这可以通过触发一个带有自定义负载实现的索引归并来实现,这种实现可以使用总是返回Optional.Empty作为组合值的DataSource或DeltaStreamer。
+ Hudi附带了一个内置的`org.apache.hudi.EmptyHoodieRecordPayload`类,它就是实现了这一功能。
```
- deleteDF // dataframe containing just records to be deleted
+ deleteDF // 仅包含要删除的记录的数据帧
.write().format("org.apache.hudi")
- .option(...) // Add HUDI options like record-key, partition-path and others
as needed for your setup
- // specify record_key, partition_key, precombine_fieldkey & usual params
+ .option(...) // 根据设置需要添加HUDI参数,例如记录键、分区路径和其他参数
+ // 指定record_key,partition_key,precombine_fieldkey和常规参数
.option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY,
"org.apache.hudi.EmptyHoodieRecordPayload")
```
-
-## Storage Management
-
-Hudi also performs several key storage management functions on the data stored
in a Hudi dataset. A key aspect of storing data on DFS is managing file sizes
and counts
-and reclaiming storage space. For e.g HDFS is infamous for its handling of
small files, which exerts memory/RPC pressure on the Name Node and can
potentially destabilize
-the entire cluster. In general, query engines provide much better performance
on adequately sized columnar files, since they can effectively amortize cost of
obtaining
-column statistics etc. Even on some cloud data stores, there is often cost to
listing directories with large number of small files.
-
-Here are some ways to efficiently manage the storage of your Hudi datasets.
-
- - The [small file handling
feature](configurations.html#compactionSmallFileSize) in Hudi, profiles
incoming workload
- and distributes inserts to existing file groups instead of creating new
file groups, which can lead to small files.
- - Cleaner can be [configured](configurations.html#retainCommits) to clean up
older file slices, more or less aggressively depending on maximum time for
queries to run & lookback needed for incremental pull
- - User can also tune the size of the [base/parquet
file](configurations.html#limitFileSize), [log
files](configurations.html#logFileMaxSize) & expected [compression
ratio](configurations.html#parquetCompressionRatio),
- such that sufficient number of inserts are grouped into the same file
group, resulting in well sized base files ultimately.
- - Intelligently tuning the [bulk insert
parallelism](configurations.html#withBulkInsertParallelism), can again in
nicely sized initial file groups. It is in fact critical to get this right,
since the file groups
- once created cannot be deleted, but simply expanded as explained before.
- - For workloads with heavy updates, the [merge-on-read
storage](concepts.html#merge-on-read-storage) provides a nice mechanism for
ingesting quickly into smaller files and then later merging them into larger
base files via compaction.
+## 存储管理
+
+Hudi还对存储在Hudi数据集中的数据执行几个关键的存储管理功能。在DFS上存储数据的关键方面是管理文件大小和数量以及回收存储空间。
+例如,HDFS在处理小文件上性能很差,这会对名称节点的内存及RPC施加很大的压力,并可能破坏整个集群的稳定性。
Review comment:
名称节点 -> Name Node, could be better to not translate?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services