This is an automated email from the ASF dual-hosted git repository. astefanutti pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/camel-k.git
commit dbda6a360786f631de85da89800a691e162e529f Author: Antonin Stefanutti <[email protected]> AuthorDate: Wed Dec 2 11:07:19 2020 +0100 chore(doc): Add standard operation procedures to troubleshooting guide --- docs/modules/ROOT/nav-end.adoc | 1 + .../ROOT/pages/troubleshooting/operating.adoc | 80 ++++++++++++++++++++++ 2 files changed, 81 insertions(+) diff --git a/docs/modules/ROOT/nav-end.adoc b/docs/modules/ROOT/nav-end.adoc index 11d29c4..03d45be 100644 --- a/docs/modules/ROOT/nav-end.adoc +++ b/docs/modules/ROOT/nav-end.adoc @@ -1,5 +1,6 @@ * Troubleshooting ** xref:troubleshooting/debugging.adoc[Debugging] +** xref:troubleshooting/operating.adoc[Operating] ** xref:troubleshooting/known-issues.adoc[Known Issues] * xref:kamelets/kamelets.adoc[Kamelets] * xref:architecture/architecture.adoc[Architecture] diff --git a/docs/modules/ROOT/pages/troubleshooting/operating.adoc b/docs/modules/ROOT/pages/troubleshooting/operating.adoc new file mode 100644 index 0000000..e65db0e --- /dev/null +++ b/docs/modules/ROOT/pages/troubleshooting/operating.adoc @@ -0,0 +1,80 @@ +[[operating]] += Operating + +NOTE: The following guide uses the terminology from the https://sre.google/sre-book/service-level-objectives/[Site Reliability Engineer] book. + +The Camel K operator exposes a monitoring endpoint, that publishes xref:observability/operator.adoc#metrics[metrics] indicating the _level of service_ provided to its users. +These metrics materialize the Service Level Indicators (SLIs) for the Camel K operator. + +Service Level Objectives (SLOs) can be defined based on these SLIs. +The xref:observability/operator.adoc#alerting[default alerts] created for the Camel K operator query the SLIs corresponding metrics, and match the SLOs for the Camel K operator, so that they fire up as soon as the _level of service_ is not met, and preemptive measures can be taken before beaching the Service Level Agreement (SLA) for the Camel K operator. + +[[operator-sops]] +== Operator SOPs + +The following section lists the Standard Operating Procedures (SOPs), corresponding to the xref:observability/operator.adoc#alerting[default alerts], created for the Camel K operator. +It assumes the operator has been installed according to the xref:observability/operator.adoc#installation[installation] section from the operator monitoring documentation. + +It documents the recommended troubleshooting actions to be performed when a particular alert fires. +It is meant to be a living document, to be improved iteratively over time, as users face problematic situations, and actions to troubleshoot and solve them are perfected. + +=== CamelKReconciliationDuration + +==== Description + +This alert has severity level of "warning". +It's firing when more than 10% of the reconciliation requests have their duration above 0.5s. + +==== Troubleshooting + +* Check the `rate(camel_k_reconciliation_duration_seconds_bucket{le="0.5"}[5m])` SLI, and identify the resource kinds for which the duration is longer than 0.5s. + +* Improve this SOP if there's anything missing, and contact engineering if there are any changes they could make to make this easier in the future. + +=== CamelKReconciliationFailure + +==== Description + +This alert has severity level of "warning". +It's firing when some reconciliation requests have failed. + +==== Troubleshooting + +* Check the `camel_k_reconciliation_duration_seconds_count{result="Errored"}` SLI, and identify the `kind` label(s) for which the value is not zero. + +* Search the operator logs for errors, e.g.: ++ +[source,sh] +---- +$ kubectl logs deployment/camel-k-operator --since=1h | jq -R 'fromjson? | select(.level == "error")' +---- +Check the `error`, `errorVerbose` and `stacktrace` fields. + +* Inspect the resources corresponding to the errors, e.g.: ++ +[source,sh] +---- +$ kubectl logs deployment/camel-k-operator --since=1h | jq -rR 'fromjson? | select(.level == "error") | [{namespace, name, controller}] | unique | .[] | "-n \(.namespace) \(.controller | rtrimstr("-controller"))/\(.name)"' | xargs kubectl describe +---- +Check the resource events. + +* Improve this SOP if there's anything missing, and contact engineering if there are any changes they could make to make this easier in the future. + +=== CamelKSuccessBuildDuration2m + +==== Description + +This alert has severity level of "warning". +It's firing when more than 1% of the successful builds have their duration above 2 min. + +==== Troubleshooting + +* Inspect the successful Builds whose duration is longer than 2 minutes, e.g.: ++ +[source,sh] +---- +$ kubectl get builds.camel.apache.org -o json | jq -r '.items[] | select(.status.phase == "Succeeded") | select(.status.duration | "01-Jan-1970 \(sub("(?<time>.*)\\..*"; "\(.time)s"))" | strptime("%d-%b-%Y %Mm%Ss")? // strptime("%d-%b-%Y %Ss") | mktime > 120) | "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' | xargs kubectl describe +---- +Check the resource events. + +* Improve this SOP if there's anything missing, and contact engineering if there are any changes they could make to make this easier in the future.
