@bwplotka, Thanks for your response I see errors in cortex distributer 400 and 500 errors 400 will NOT be sent again however 500 will be resend and it caused outage
this is two types of errors that i see in distributor, no error logs in ingesters (only 400 errors in ingesters) ``` level=warn ts=2020-09-11T15:14:15.55091129Z caller=logging.go:62 traceID=1e80d0d72c7dfb18 msg="POST /api/prom/push (500) 11.40001159s Response: \"context canceled\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 74202; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.16.0; X-Forwarded-For: 10.254.178.57; X-Forwarded-Host: cortex.devops.app.umusic.net; X-Forwarded-Port: 80; X-Forwarded-Proto: http; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 10.254.178.57; X-Request-Id: aa786f8ba1483741acdcbb8503f9fb0d; X-Scheme: http; X-Scope-Orgid: eks-11; " level=warn ts=2020-09-11T15:14:09.942532161Z caller=logging.go:62 traceID=69a628f39a21de24 msg="POST /api/prom/push (500) 6.100572749s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 5908; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.13.1; X-Forwarded-For: 10.104.33.77; X-Forwarded-Host: cortex.devops.app.umusic.net; X-Forwarded-Port: 80; X-Forwarded-Proto: http; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 10.104.33.77; X-Request-Id: 3859a4b2f0e3b3badc281b95c9d7b852; X-Scheme: http; X-Scope-Orgid: eks-13; " ``` On prom log i see 400 so cortex gateway is not hiding real status code Prometheus logs: ts=2020-09-11T15:32:05.667Z caller=dedupe.go:112 component=remote level=error remote_name=435af2 url= http://cortex.devops.local.int/api/prom/push/aws10-eks msg="non-recoverable error" count=361 err="context canceled" ts=2020-09-11T15:32:05.667Z caller=dedupe.go:112 component=remote level=error remote_name=435af2 url= http://cortex.devops.local.int/api/prom/push/aws10-eks msg="non-recoverable error" count=60 err="context canceled" ts=2020-09-11T15:32:05.635Z caller=dedupe.go:112 component=remote level=error remote_name=435af2 url= http://cortex.devops.local.int/api/prom/push/aws10-eks msg="Failed to flush all samples on shutdown" ts=2020-09-11T15:32:02.947Z caller=dedupe.go:112 component=remote level=error remote_name=435af2 url= http://cortex.devops.local.int/api/prom/push/aws10-eks msg="non-recoverable error" count=1000 err="server returned HTTP status 400 Bad Request: user=aws10-eks: sample timestamp out of order; last timestamp: 1599838222.874, incoming timestamp: 1599838162.874 for series {__name__=\"kube_pod_status_ready\", app_kubernetes_io_instance=\"kube-state-metrics\", app_kubernetes_io_managed_by=\"H" ts=2020-09-11T15:32:02.665Z caller=dedupe.go:112 component=remote level=error remote_name=435af2 url= http://cortex.devops.local.int/api/prom/push/aws10-eks msg="non-recoverable error" count=1000 err="server returned HTTP status 400 Bad Request: user=aws10-eks: sample timestamp out of order; last timestamp: 1599838222.874, incoming timestamp: 1599838162.874 for series {__name__=\"kube_secret_info\", app_kubernetes_io_instance=\"kube-state-metrics\", app_kubernetes_io_managed_by=\"Helm\"," ........ ts=2020-09-11T15:01:22.707Z caller=dedupe.go:112 component=remote level=error remote_name=435af2 url= http://cortex.devops.local.int/api/prom/push/aws10-eks msg="Remote storage resharding" from=3 to=5 level=info ts=2020-09-11T15:00:08.014Z caller=head.go:731 component=tsdb msg="WAL checkpoint complete" first=232 last=234 duration=1.254153897s level=info ts=2020-09-11T15:00:06.759Z caller=head.go:661 component=tsdb msg="head GC completed" duration=77.995686ms level=info ts=2020-09-11T15:00:06.314Z caller=compact.go:496 component=tsdb msg="write block" mint=1599825600000 maxt=1599832800000 ulid=01EHYTWDPEX8SSCGBQT4PVCP95 duration=2.908463458s ts=2020-09-11T14:36:42.706Z caller=dedupe.go:112 component=remote level=info remote_name=435af2 url= http://cortex.devops.local.int/api/prom/push/aws10-eks msg="Remote storage resharding" from=2 to=3 I will be troubleshooting cortex installation and configuration But i also want to increase resend retries time so I don't end up in same situation. What is right value for 30 min in prom config ( min_backoff: 30m ) is this right ? Im open if you have any recommendation for cortex (what can be misconfigured so i'm getting messages above in distributer ) Thank you, Ruben On Saturday, September 12, 2020 at 11:52:12 PM UTC-7 [email protected] wrote: > Hey, > > Unless there is some bug on the receiving side (maybe your front proxy > masking the actual status code) or Cortex - both Cortex and Thanos Receive > in cases of not accepting write for reasons like this (something that there > is no point retrying for) returns the status code that tells Prometheus to > drop those requests and not retry. > > Kind Regards, > Bartek Płotka (@bwplotka) > > > On Sat, 12 Sep 2020 at 22:33, Ruben Papovyan <[email protected]> wrote: > >> Hi team, >> What are the options to disable remote write retry ? >> Can I use following config to disable remote write retry ? >> ``` >> remote_write: >> url: http://cortex.local.int >> queue_config: >> min_backoff: 2h >> max_backoff: 2h >> ``` >> or if I need to retry 4 times can I use config ? >> ``` >> remote_write: >> url: http://cortex.local.int >> queue_config: >> min_backoff: 30m >> max_backoff: 2h >> ``` >> >> What are recommendations ? >> >> My guess here that after 2h WAL will be compacted and data will not be >> resend ? >> Movement for this that i had network outage and cortex will not accept >> metrics(sample timestamp out of order) and it end up where prometheus >> ddosed cortex. >> >> >> Thank you, >> Ruben >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Prometheus Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-users/ac8567aa-de1d-4e34-8074-8e8a924a9c30n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/prometheus-users/ac8567aa-de1d-4e34-8074-8e8a924a9c30n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0004498c-003d-4324-a5b3-9dddc7579abcn%40googlegroups.com.

