yihua commented on a change in pull request #926: [HUDI-278] Translate
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328912878
##########
File path: docs/admin_guide.cn.md
##########
@@ -374,71 +365,69 @@ Compaction successfully repaired
```
-## Metrics {#metrics}
+## 指标 {#metrics}
-Once the Hudi Client is configured with the right datasetname and environment
for metrics, it produces the following graphite metrics, that aid in debugging
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
- - **Commit Duration** - This is amount of time it took to successfully commit
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial
data left over by a failed commit (happens everytime automatically after a
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions,
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
-These metrics can then be plotted on a standard tool like grafana. Below is a
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
<figure>
<img class="docimage" src="/images/hudi_commit_duration.png"
alt="hudi_commit_duration.png" style="max-width: 1000px" />
</figure>
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the
following metadata is added to every record to help triage issues easily using
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the
partition containing this record
+## 故障排除 {#troubleshooting}
-Note that as of now, Hudi assumes the application passes in the same
deterministic partitionpath for a given recordKey. i.e the uniqueness of record
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
-#### Missing records
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
-Please check if there were any write errors using the admin commands above,
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hudi, but
handed back to the application to decide what to do with it.
+#### 缺失记录
-#### Duplicates
+请在可以写入记录的窗口中,使用上面的admin命令检查是否存在任何写入错误。
+如果确实发现错误,那么记录实际上不是由Hudi写入的,而是交还给应用程序来决定如何处理。
-First of all, please confirm if you do indeed have duplicates **AFTER**
ensuring the query is accessing the Hudi datasets [properly](sql_queries.html) .
+#### 重复
- - If confirmed, please use the metadata fields above, to identify the
physical files & partition files containing the records .
- - If duplicates span files across partitionpath, then this means your
application is generating different partitionPaths for same recordKey, Please
fix your app
- - if duplicates span multiple files within the same partitionpath, please
engage with mailing list. This should not happen. You can use the `records
deduplicate` command to fix your data.
+首先,请确认是否确实存在重复**AFTER**,以确保查询可以[正确](sql_queries.html)访问Hudi数据集。
-#### Spark failures {#spark-ui}
+ - 如果确认,请使用上面的元数据字段来标识包含记录的物理文件和分区文件。
+ - 如果重复的文件跨越整个分区路径,则意味着您的应用程序正在为同一recordKey生成不同的分区路径,请修复您的应用程序.
+ - 如果重复跨越同一分区路径中的多个文件,请使用邮件列表。这不应该发生。您可以使用`records deduplicate`命令修复数据。
-Typical upsert() DAG looks like below. Note that Hudi client also caches
intermediate RDDs to intelligently profile workload and size files and spark
parallelism.
-Also Spark UI shows sortByKey twice due to the probe job also being shown,
nonetheless its just a single sort.
+#### Spark故障 {#spark-ui}
+典型的upsert() DAG如下所示。请注意,Hudi客户端会缓存中间的RDD,以智能地分析工作负载和大小文件和并行度。
+另外,由于还显示了探针作业,Spark UI两次显示了sortByKey,但它只是一个排序。
<figure>
<img class="docimage" src="/images/hudi_upsert_dag.png"
alt="hudi_upsert_dag.png" style="max-width: 1000px" />
</figure>
-At a high level, there are two steps
+概括地说,有两个步骤
-**Index Lookup to identify files to be changed**
+**索引查找以标识要更改的文件**
- - Job 1 : Triggers the input data read, converts to HoodieRecord object and
then stops at obtaining a spread of input records to target partition paths
- - Job 2 : Load the set of file names which we need check against
- - Job 3 & 4 : Actual lookup after smart sizing of spark join parallelism, by
joining RDDs in 1 & 2 above
- - Job 5 : Have a tagged RDD of recordKeys with locations
+ - Job 1 : 触发输入数据读取,转换为HoodieRecord对象,然后停止将获取的输入记录写入到目标分区路径。
Review comment:
“然后停止将获取的输入记录写入到目标分区路径”
=>
“然后根据输入记录拿到目标分区路径”
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services