Re: [prometheus-users] Disable remote write retry

Ruben Papovyan Sun, 13 Sep 2020 14:32:10 -0700

@bwplotka,
Thanks for your response 
I see errors in cortex distributer 400 and 500 errors
400 will NOT be sent again however 500 will be resend and it caused outage

this is two types of errors that i see in distributor, no error logs in 
ingesters (only 400 errors in ingesters)

```
level=warn ts=2020-09-11T15:14:15.55091129Z caller=logging.go:62 
traceID=1e80d0d72c7dfb18 msg="POST /api/prom/push (500) 11.40001159s 
Response: \"context canceled\\n\" ws: false; Connection: close; 
Content-Encoding: snappy; Content-Length: 74202; Content-Type: 
application/x-protobuf; User-Agent: Prometheus/2.16.0; X-Forwarded-For: 
10.254.178.57; X-Forwarded-Host: cortex.devops.app.umusic.net; 
X-Forwarded-Port: 80; X-Forwarded-Proto: http; 
X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 10.254.178.57; 
X-Request-Id: aa786f8ba1483741acdcbb8503f9fb0d; X-Scheme: http; 
X-Scope-Orgid: eks-11; "
level=warn ts=2020-09-11T15:14:09.942532161Z caller=logging.go:62 
traceID=69a628f39a21de24 msg="POST /api/prom/push (500) 6.100572749s 
Response: \"rpc error: code = DeadlineExceeded desc = context deadline 
exceeded\\n\" ws: false; Connection: close; Content-Encoding: snappy; 
Content-Length: 5908; Content-Type: application/x-protobuf; User-Agent: 
Prometheus/2.13.1; X-Forwarded-For: 10.104.33.77; X-Forwarded-Host: 
cortex.devops.app.umusic.net; X-Forwarded-Port: 80; X-Forwarded-Proto: 
http; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 10.104.33.77; 
X-Request-Id: 3859a4b2f0e3b3badc281b95c9d7b852; X-Scheme: http; 
X-Scope-Orgid: eks-13; "
```

On prom log i see 400 so cortex gateway is not hiding real status code 
Prometheus logs:
ts=2020-09-11T15:32:05.667Z caller=dedupe.go:112 component=remote 
level=error remote_name=435af2 url=
http://cortex.devops.local.int/api/prom/push/aws10-eks msg="non-recoverable 
error" count=361 err="context canceled"
ts=2020-09-11T15:32:05.667Z caller=dedupe.go:112 component=remote 
level=error remote_name=435af2 url=
http://cortex.devops.local.int/api/prom/push/aws10-eks msg="non-recoverable 
error" count=60 err="context canceled"
ts=2020-09-11T15:32:05.635Z caller=dedupe.go:112 component=remote 
level=error remote_name=435af2 url=
http://cortex.devops.local.int/api/prom/push/aws10-eks msg="Failed to flush 
all samples on shutdown"
ts=2020-09-11T15:32:02.947Z caller=dedupe.go:112 component=remote 
level=error remote_name=435af2 url=
http://cortex.devops.local.int/api/prom/push/aws10-eks msg="non-recoverable 
error" count=1000 err="server returned HTTP status 400 Bad Request: 
user=aws10-eks: sample timestamp out of order; last timestamp: 
1599838222.874, incoming timestamp: 1599838162.874 for series 
{__name__=\"kube_pod_status_ready\", 
app_kubernetes_io_instance=\"kube-state-metrics\", 
app_kubernetes_io_managed_by=\"H"
ts=2020-09-11T15:32:02.665Z caller=dedupe.go:112 component=remote 
level=error remote_name=435af2 url=
http://cortex.devops.local.int/api/prom/push/aws10-eks msg="non-recoverable 
error" count=1000 err="server returned HTTP status 400 Bad Request: 
user=aws10-eks: sample timestamp out of order; last timestamp: 
1599838222.874, incoming timestamp: 1599838162.874 for series 
{__name__=\"kube_secret_info\", 
app_kubernetes_io_instance=\"kube-state-metrics\", 
app_kubernetes_io_managed_by=\"Helm\","
........
ts=2020-09-11T15:01:22.707Z caller=dedupe.go:112 component=remote 
level=error remote_name=435af2 url=
http://cortex.devops.local.int/api/prom/push/aws10-eks msg="Remote storage 
resharding" from=3 to=5
level=info ts=2020-09-11T15:00:08.014Z caller=head.go:731 component=tsdb 
msg="WAL checkpoint complete" first=232 last=234 duration=1.254153897s
level=info ts=2020-09-11T15:00:06.759Z caller=head.go:661 component=tsdb 
msg="head GC completed" duration=77.995686ms
level=info ts=2020-09-11T15:00:06.314Z caller=compact.go:496 component=tsdb 
msg="write block" mint=1599825600000 maxt=1599832800000 
ulid=01EHYTWDPEX8SSCGBQT4PVCP95 duration=2.908463458s
ts=2020-09-11T14:36:42.706Z caller=dedupe.go:112 component=remote 
level=info remote_name=435af2 url=
http://cortex.devops.local.int/api/prom/push/aws10-eks msg="Remote storage 
resharding" from=2 to=3

I will be troubleshooting cortex installation and configuration 

But i also want to increase resend retries time so I don't end up in same 
situation.

What is right value for 30 min in prom config (    min_backoff: 30m ) is 
this right ? 

Im open if you have any recommendation for cortex (what can be 
misconfigured so i'm getting messages above in distributer )

Thank you,
Ruben

On Saturday, September 12, 2020 at 11:52:12 PM UTC-7 [email protected] 
wrote:

> Hey, 
>
> Unless there is some bug on the receiving side (maybe your front proxy 
> masking the actual status code) or Cortex - both Cortex and Thanos Receive 
> in cases of not accepting write for reasons like this (something that there 
> is no point retrying for) returns the status code that tells Prometheus to 
> drop those requests and not retry.
>
> Kind Regards,
> Bartek Płotka (@bwplotka)
>
>
> On Sat, 12 Sep 2020 at 22:33, Ruben Papovyan <[email protected]> wrote:
>
>> Hi team,
>> What are the options to disable remote write retry ?
>> Can I use following config to disable remote write retry ?
>> ```
>> remote_write:
>>   url: http://cortex.local.int
>>   queue_config: 
>>     min_backoff: 2h
>>     max_backoff: 2h
>> ```
>> or if I need to retry 4 times can I use config ? 
>> ```
>> remote_write:
>>   url: http://cortex.local.int
>>   queue_config: 
>>     min_backoff: 30m
>>     max_backoff: 2h
>> ```
>>
>> What are recommendations ?
>>
>> My guess here that after 2h WAL will be compacted and data will not be 
>> resend ?
>> Movement for this that i had network outage and cortex will not accept 
>> metrics(sample timestamp out of order) and it end up where prometheus 
>> ddosed cortex.
>>
>>
>> Thank you,
>> Ruben
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/ac8567aa-de1d-4e34-8074-8e8a924a9c30n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/ac8567aa-de1d-4e34-8074-8e8a924a9c30n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0004498c-003d-4324-a5b3-9dddc7579abcn%40googlegroups.com.

Re: [prometheus-users] Disable remote write retry

Reply via email to