This is an automated email from the ASF dual-hosted git repository.

astefanutti pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/camel-k.git

commit dbda6a360786f631de85da89800a691e162e529f
Author: Antonin Stefanutti <[email protected]>
AuthorDate: Wed Dec 2 11:07:19 2020 +0100

    chore(doc): Add standard operation procedures to troubleshooting guide
---
 docs/modules/ROOT/nav-end.adoc                     |  1 +
 .../ROOT/pages/troubleshooting/operating.adoc      | 80 ++++++++++++++++++++++
 2 files changed, 81 insertions(+)

diff --git a/docs/modules/ROOT/nav-end.adoc b/docs/modules/ROOT/nav-end.adoc
index 11d29c4..03d45be 100644
--- a/docs/modules/ROOT/nav-end.adoc
+++ b/docs/modules/ROOT/nav-end.adoc
@@ -1,5 +1,6 @@
 * Troubleshooting
 ** xref:troubleshooting/debugging.adoc[Debugging]
+** xref:troubleshooting/operating.adoc[Operating]
 ** xref:troubleshooting/known-issues.adoc[Known Issues]
 * xref:kamelets/kamelets.adoc[Kamelets]
 * xref:architecture/architecture.adoc[Architecture]
diff --git a/docs/modules/ROOT/pages/troubleshooting/operating.adoc 
b/docs/modules/ROOT/pages/troubleshooting/operating.adoc
new file mode 100644
index 0000000..e65db0e
--- /dev/null
+++ b/docs/modules/ROOT/pages/troubleshooting/operating.adoc
@@ -0,0 +1,80 @@
+[[operating]]
+= Operating
+
+NOTE: The following guide uses the terminology from the 
https://sre.google/sre-book/service-level-objectives/[Site Reliability 
Engineer] book.
+
+The Camel K operator exposes a monitoring endpoint, that publishes 
xref:observability/operator.adoc#metrics[metrics] indicating the _level of 
service_ provided to its users.
+These metrics materialize the Service Level Indicators (SLIs) for the Camel K 
operator.
+
+Service Level Objectives (SLOs) can be defined based on these SLIs.
+The xref:observability/operator.adoc#alerting[default alerts] created for the 
Camel K operator query the SLIs corresponding metrics, and match the SLOs for 
the Camel K operator, so that they fire up as soon as the _level of service_ is 
not met, and preemptive measures can be taken before beaching the Service Level 
Agreement (SLA) for the Camel K operator.
+
+[[operator-sops]]
+== Operator SOPs
+
+The following section lists the Standard Operating Procedures (SOPs), 
corresponding to the xref:observability/operator.adoc#alerting[default alerts], 
created for the Camel K operator.
+It assumes the operator has been installed according to the 
xref:observability/operator.adoc#installation[installation] section from the 
operator monitoring documentation.
+
+It documents the recommended troubleshooting actions to be performed when a 
particular alert fires.
+It is meant to be a living document, to be improved iteratively over time, as 
users face problematic situations, and actions to troubleshoot and solve them 
are perfected.
+
+=== CamelKReconciliationDuration
+
+==== Description
+
+This alert has severity level of "warning".
+It's firing when more than 10% of the reconciliation requests have their 
duration above 0.5s.
+
+==== Troubleshooting
+
+* Check the 
`rate(camel_k_reconciliation_duration_seconds_bucket{le="0.5"}[5m])` SLI, and 
identify the resource kinds for which the duration is longer than 0.5s.
+
+* Improve this SOP if there's anything missing, and contact engineering if 
there are any changes they could make to make this easier in the future.
+
+=== CamelKReconciliationFailure
+
+==== Description
+
+This alert has severity level of "warning".
+It's firing when some reconciliation requests have failed.
+
+==== Troubleshooting
+
+* Check the `camel_k_reconciliation_duration_seconds_count{result="Errored"}` 
SLI, and identify the `kind` label(s) for which the value is not zero.
+
+* Search the operator logs for errors, e.g.:
++
+[source,sh]
+----
+$ kubectl logs deployment/camel-k-operator --since=1h | jq -R 'fromjson? | 
select(.level == "error")'
+----
+Check the `error`, `errorVerbose` and `stacktrace` fields.
+
+* Inspect the resources corresponding to the errors, e.g.:
++
+[source,sh]
+----
+$ kubectl logs deployment/camel-k-operator --since=1h | jq -rR 'fromjson? | 
select(.level == "error") | [{namespace, name, controller}] | unique | .[] | 
"-n \(.namespace) \(.controller | rtrimstr("-controller"))/\(.name)"' | xargs 
kubectl describe
+----
+Check the resource events.
+
+* Improve this SOP if there's anything missing, and contact engineering if 
there are any changes they could make to make this easier in the future.
+
+=== CamelKSuccessBuildDuration2m
+
+==== Description
+
+This alert has severity level of "warning".
+It's firing when more than 1% of the successful builds have their duration 
above 2 min.
+
+==== Troubleshooting
+
+* Inspect the successful Builds whose duration is longer than 2 minutes, e.g.:
++
+[source,sh]
+----
+$ kubectl get builds.camel.apache.org -o json | jq -r '.items[] | 
select(.status.phase == "Succeeded") | select(.status.duration | "01-Jan-1970 
\(sub("(?<time>.*)\\..*"; "\(.time)s"))" | strptime("%d-%b-%Y %Mm%Ss")? // 
strptime("%d-%b-%Y %Ss") | mktime > 120) | "-n \(.metadata.namespace) 
builds.camel.apache.org/\(.metadata.name)"' | xargs kubectl describe
+----
+Check the resource events.
+
+* Improve this SOP if there's anything missing, and contact engineering if 
there are any changes they could make to make this easier in the future.

Reply via email to