wusamzong commented on code in PR #244: URL: https://github.com/apache/yunikorn-site/pull/244#discussion_r1059282163
########## i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/troubleshooting.md: ########## @@ -96,97 +90,100 @@ Available logging levels: | 4 | Panic | | 5 | Fatal | -## Pods are stuck at Pending state +## Pods卡在`Pending`状态 -If some pods are stuck at Pending state, that means the scheduler could not find a node to allocate the pod. There are -several possibilities to cause this: +如果Pod卡在Pending状态,则意味着调度程序找不到分配Pod的节点。造成这种情况有以下几种可能: -### 1. Non of the nodes satisfy pod placement requirement +### 1.没有节点满足pod的放置要求 -A pod can be configured with some placement constraints, such as [node-selector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector), -[affinity/anti-affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity), -do not have certain toleration for node [taints](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/), etc. -To debug such issues, you can describe the pod by: +可以在Pod中配置一些放置限制,例如[节点选择器(node-selector)](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector)、[亲合/反亲合性(affinity/anti-affinity)](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity)、对节点的[污点(taints)](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)没有一定的容忍度等。若要修正此类问题,你可以通过以下方法观察pod: ```shell script -kubectl describe pod <pod-name> -n <namespace> +kubectl describe pod <pod名称> -n <命名空间> ``` -the pod events will contain the predicate failures and that explains why nodes are not qualified for allocation. +pod事件中包含预测失败,而这解释了为什么节点不符合分配条件 -### 2. The queue is running out of capacity +### 2.队列的可用资源不足 -If the queue is running out of capacity, pods will be pending for available queue resources. To check if a queue is still -having enough capacity for the pending pods, there are several approaches: +如果队列的可用资源不足,Pod将等待队列资源。检查队列是否还有空间可以给Pending pod的方法有以下几种: -1) check the queue usage from yunikorn UI +1 ) 从Yunikorn UI检查队列使用情况 -If you do not know how to access the UI, you can refer the document [here](../get_started/get_started.md#访问-web-ui). Go -to the `Queues` page, navigate to the queue where this job is submitted to. You will be able to see the available capacity -left for the queue. +如果你不知道如何访问UI,可以参考[这里](../get_started/get_started.md#访问-web-ui)的文档。在`Queues`页面中,寻找Pod对应到的队列。你将能够看到队列中剩馀的可用资源。 -2) check the pod events +2 ) 检查pod事件 -Run the `kubectl describe pod` to get the pod events. If you see some event like: -`Application <appID> does not fit into <queuePath> queue`. That means the pod could not get allocated because the queue -is running out of capacity. +运行`kubectl describe pod`以获取pod事件。如果你看到类似以下的事件:`Application <appID> does not fit into <队列路径> queue`。则代表pod无法分配,因为队列的资源用完了。 -The pod will be allocated if some other pods in this queue is completed or removed. If the pod remains pending even -the queue has capacity, that may because it is waiting for the cluster to scale up. +当队列中的其他Pod完成工作或被删除时,代表目前Pending的pod能得到分配,如果Pod依旧在有足够的剩馀资源下,保持pending状态,则可能是因为他正在等待集群扩展。 -## Restart the scheduler +## 获取完整的状态 -YuniKorn can recover its state upon a restart. YuniKorn scheduler pod is deployed as a deployment, restart the scheduler -can be done by scale down and up the replica: +Yunikorn状态储存中,包含了对每个进程中每个对象的状态。透过端点来检索,我们可以有很多有用的信息,举一个故障排除的例子:分区列表、应用程序列表(包括正在运行的、已完成的以及历史应用程序的详细信息)、节点数量、节点利用率、通用集群信息、集群利用率的详细信息、容器历史纪录和队列信息。 -```shell script -kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=0 -kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=1 -``` +状态是Yunikorn提供的用于故障排除的宝贵资源。 + +有几种方法可以获得完整的状态: +### 1.调度器URL + +步骤: +*在浏览器中打开Yunikorn UI,且在URL中编辑: +*将`/#/dashboard`取代为`/ws/v1/fullstatedump`,(例如,`http://localhost:9889/ws/v1/fullstatedump`) +*按下回车键。 -## Gang Scheduling +透过这个简单的方法来观看即时且完整的状态。 Review Comment: The old troubleshooting.md is in English, and most of the lines are messed up after translation. The translation of "Restart the scheduler" is placed on line 145 The translation of "the gang scheduling" is placed on line 157 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
