[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

GitBox Thu, 26 Sep 2019 22:06:07 -0700

yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328914068


 ##########
 File path: docs/admin_guide.cn.md
 ##########
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后，它将生成以下graphite指标，以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样，撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 <figure>
     <img class="docimage" src="/images/hudi_commit_duration.png" 
alt="hudi_commit_duration.png" style="max-width: 1000px" />
 </figure>
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中，以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键，是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径，该路径标识包含此记录的分区
 
-#### Missing records
+请注意，到目前为止，Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
 
-Please check if there were any write errors using the admin commands above, 
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hudi, but 
handed back to the application to decide what to do with it.
+#### 缺失记录
 
-#### Duplicates
+请在可以写入记录的窗口中，使用上面的admin命令检查是否存在任何写入错误。
+如果确实发现错误，那么记录实际上不是由Hudi写入的，而是交还给应用程序来决定如何处理。
 
-First of all, please confirm if you do indeed have duplicates **AFTER** 
ensuring the query is accessing the Hudi datasets [properly](sql_queries.html) .
+#### 重复
 
- - If confirmed, please use the metadata fields above, to identify the 
physical files & partition files containing the records .
- - If duplicates span files across partitionpath, then this means your 
application is generating different partitionPaths for same recordKey, Please 
fix your app
- - if duplicates span multiple files within the same partitionpath, please 
engage with mailing list. This should not happen. You can use the `records 
deduplicate` command to fix your data.
+首先，请确认是否确实存在重复**AFTER**，以确保查询可以[正确](sql_queries.html)访问Hudi数据集。
 
-#### Spark failures {#spark-ui}
+ - 如果确认，请使用上面的元数据字段来标识包含记录的物理文件和分区文件。
+ - 如果重复的文件跨越整个分区路径，则意味着您的应用程序正在为同一recordKey生成不同的分区路径，请修复您的应用程序.
+ - 如果重复跨越同一分区路径中的多个文件，请使用邮件列表。这不应该发生。您可以使用`records deduplicate`命令修复数据。
 
-Typical upsert() DAG looks like below. Note that Hudi client also caches 
intermediate RDDs to intelligently profile workload and size files and spark 
parallelism.
-Also Spark UI shows sortByKey twice due to the probe job also being shown, 
nonetheless its just a single sort.
+#### Spark故障 {#spark-ui}
 
+典型的upsert() DAG如下所示。请注意，Hudi客户端会缓存中间的RDD，以智能地分析工作负载和大小文件和并行度。
+另外，由于还显示了探针作业，Spark UI两次显示了sortByKey，但它只是一个排序。
 <figure>
     <img class="docimage" src="/images/hudi_upsert_dag.png" 
alt="hudi_upsert_dag.png" style="max-width: 1000px" />
 </figure>
 
 
-At a high level, there are two steps
+概括地说，有两个步骤
 
-**Index Lookup to identify files to be changed**
+**索引查找以标识要更改的文件**
 
- - Job 1 : Triggers the input data read, converts to HoodieRecord object and 
then stops at obtaining a spread of input records to target partition paths
- - Job 2 : Load the set of file names which we need check against
- - Job 3  & 4 : Actual lookup after smart sizing of spark join parallelism, by 
joining RDDs in 1 & 2 above
- - Job 5 : Have a tagged RDD of recordKeys with locations
+ - Job 1 : 触发输入数据读取，转换为HoodieRecord对象，然后停止将获取的输入记录写入到目标分区路径。
+ - Job 2 : 加载我们需要检查的文件名集。
+ - Job 3  & 4 : 通过联合上面1和2中的RDD，智能调整spark大小之后进行的实际查找。
+ - Job 5 : 具有带有位置的recordKeys的标记的RDD。
 
-**Performing the actual writing of data**
+**执行数据的实际写入**
 
- - Job 6 : Lazy join of incoming records against recordKey, location to 
provide a final set of HoodieRecord which now contain the information about 
which file/partitionpath they are found at (or null if insert). Then also 
profile the workload again to determine sizing of files
- - Job 7 : Actual writing of data (update + insert + insert turned to updates 
to maintain file size)
+ - Job 6 : 
将记录与recordKey(位置)进行懒惰连接，以提供最终的HoodieRecord集，现在它包含有关在何处找到它们的文件/分区路径的信息(如果插入，则为null)。然后还要再次分析工作负载以确定文件的大小。
+ - Job 7 : 实际写入数据(更新 + 插入 + 插入转为更新以保持文件大小)
 
-Depending on the exception source (Hudi/Spark), the above knowledge of the DAG 
can be used to pinpoint the actual issue. The most often encountered failures 
result from YARN/DFS temporary failures.
-In the future, a more sophisticated debug/management UI would be added to the 
project, that can help automate some of this debugging.
+根据异常源(Hudi/Spark)，DAG的上述知识可用于查明实际问题。最常遇到的故障是由YARN/DFS临时故障引起的。
 
 Review comment:
   “DAG的上述知识” => “上述关于DAG的信息”

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

Reply via email to