Hello
We are seeing remote_write is crashing frequently. Any suggestions ?
Prometheus Version: v.2.32.0
Cluster has close to 190 nodes . We are NOT Running in "Agent" mode.
remoteWrite:
- basicAuth:
password:
key: password
name: kuberemotewrite
username:
key: username
name: kuberemotewrite
queueConfig:
batchSendDeadline: 1s
capacity: 100000
maxBackoff: 30s
maxSamplesPerSend: 5000
maxShards: 300
minBackoff: 1s
minShards: 20
ts=2022-01-24T17:48:12.343Z caller=dedupe.go:112 component=remote
level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="Starting WAL watcher" queue=284bcb
ts=2022-01-24T17:48:12.343Z caller=dedupe.go:112 component=remote
level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="Starting scraped metadata watcher"
ts=2022-01-24T17:48:12.345Z caller=dedupe.go:112 component=remote
level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="Replaying WAL" queue=284bcb
ts=2022-01-24T17:48:12.575Z caller=main.go:1166 level=info msg="Completed
loading of configuration file"
filename=/etc/prometheus/config_out/prometheus.env.yaml
totalDuration=260.590898ms db_storage=1.7µs remote_storage=8.009608ms
web_handler=900ns query_engine=1.5µs scrape=488.694µs scrape_sd=67.369525ms
notify=27.199µs notify_sd=1.457283ms rules=162.536628ms
ts=2022-01-24T17:48:12.810Z caller=main.go:1166 level=info msg="Completed
loading of configuration file"
filename=/etc/prometheus/config_out/prometheus.env.yaml
totalDuration=235.052893ms db_storage=1.5µs remote_storage=185.498µs
web_handler=600ns query_engine=1µs scrape=171.198µs scrape_sd=68.57841ms
notify=16.5µs notify_sd=1.452883ms rules=141.813767ms
ts=2022-01-24T17:50:22.464Z caller=dedupe.go:112 component=remote
level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Done
replaying WAL" duration=2m10.120135725s
ts=2022-01-24T17:52:32.345Z caller=dedupe.go:112 component=remote
level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Remote
storage resharding" from=20 to=30
ts=2022-01-24T17:53:12.346Z caller=dedupe.go:112 component=remote
level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Remote
storage resharding" from=30 to=40
ts=2022-01-24T17:54:02.345Z caller=dedupe.go:112 component=remote
level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Remote
storage resharding" from=40 to=53
ts=2022-01-24T17:54:32.345Z caller=dedupe.go:112 component=remote
level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="Skipping resharding, last successful send was beyond threshold"
lastSendTimestamp=1643046869 minSendTimestamp=1643046870
ts=2022-01-24T17:54:42.345Z caller=dedupe.go:112 component=remote
level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="Skipping resharding, last successful send was beyond threshold"
lastSendTimestamp=1643046869 minSendTimestamp=1643046880
ts=2022-01-24T17:54:52.345Z caller=dedupe.go:112 component=remote
level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="Skipping resharding, last successful send was beyond threshold"
lastSendTimestamp=1643046884 minSendTimestamp=1643046890
ts=2022-01-24T17:55:02.347Z caller=dedupe.go:112 component=remote
level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Failed
to send batch, retrying" err="Post \"https://DUMMY.IO/api/v1/write\":
context canceled"
ts=2022-01-24T17:55:02.347Z caller=dedupe.go:112 component=remote
level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="non-recoverable error" count=5000 exemplarCount=0 err="context
canceled"
ts=2022-01-24T17:55:02.348Z caller=dedupe.go:112 component=remote
level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="non-recoverable error" count=198 exemplarCount=0 err="context canceled"
ts=2022-01-24T17:55:02.348Z caller=dedupe.go:112 component=remote
level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="Failed to flush all samples on shutdown" count=167431393
ts=2022-01-24T17:55:02.349Z caller=dedupe.go:112 component=remote
level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="Currently resharding, skipping."
ts=2022-01-24T17:55:12.345Z caller=dedupe.go:112 component=remote
level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Remote
storage resharding" from=53 to=90
ts=2022-01-24T17:55:52.344Z caller=dedupe.go:112 component=remote
level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="Skipping resharding, last successful send was beyond threshold"
lastSendTimestamp=1643046945 minSendTimestamp=1643046950
ts=2022-01-24T17:56:12.344Z caller=dedupe.go:112 component=remote
level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="Skipping resharding, last successful send was beyond threshold"
lastSendTimestamp=1643046966 minSendTimestamp=1643046970
ts=2022-01-24T17:56:12.346Z caller=dedupe.go:112 component=remote
level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Failed
to send batch, retrying" err="Post \"https://DUMMY.IO/api/v1/write\":
context canceled"
ts=2022-01-24T17:56:12.346Z caller=dedupe.go:112 component=remote
level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="non-recoverable error" count=5000 exemplarCount=0 err="context
canceled"
ts=2022-01-24T17:56:12.346Z caller=dedupe.go:112 component=remote
level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="non-recoverable error" count=4448 exemplarCount=0 err="context
canceled"
ts=2022-01-24T17:56:12.346Z caller=dedupe.go:112 component=remote
level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="non-recoverable error" count=4862 exemplarCount=0 err="context
canceled"
ts=2022-01-24T17:56:12.346Z caller=dedupe.go:112 component=remote
level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="non-recoverable error" count=4573 exemplarCount=0 err="context
canceled"
ts=2022-01-24T17:56:12.346Z caller=dedupe.go:112 component=remote
level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="Failed to flush all samples on shutdown" count=1234089315
ts=2022-01-24T17:56:22.344Z caller=dedupe.go:112 component=remote
level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="Skipping resharding, last successful send was beyond threshold"
lastSendTimestamp=1643046979 minSendTimestamp=1643046980
ts=2022-01-24T17:56:32.344Z caller=dedupe.go:112 component=remote
level=info remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Remote
storage resharding" from=90 to=209
ts=2022-01-24T17:57:32.346Z caller=dedupe.go:112 component=remote
level=warn remote_name=284bcb url=https://DUMMY.IO/api/v1/write msg="Failed
to send batch, retrying" err="Post \"https://DUMMY.IO/api/v1/write\":
context canceled"
ts=2022-01-24T17:57:32.346Z caller=dedupe.go:112 component=remote
level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="non-recoverable error" count=5000 exemplarCount=0 err="context
canceled"
ts=2022-01-24T17:57:32.347Z caller=dedupe.go:112 component=remote
level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="non-recoverable error" count=4546 exemplarCount=0 err="context
canceled"
ts=2022-01-24T17:57:32.349Z caller=dedupe.go:112 component=remote
level=error remote_name=284bcb url=https://DUMMY.IO/api/v1/write
msg="Failed to flush all samples on shutdown" count=2884134481
On Sun, 23 Jan 2022 at 11:53, Ramachandra Bhaskar Ayalavarapu <
[email protected]> wrote:
> Please advise what all i can do to prevent/investigate remote writes from
> getting stopped.
>
> On Fri, 21 Jan 2022 at 22:59, Ramachandra Bhaskar Ayalavarapu <
> [email protected]> wrote:
>
>> we are not using as agent
>>
>> On Fri, 21 Jan 2022 at 17:49, Julien Pivotto <[email protected]>
>> wrote:
>>
>>> Hello,
>>>
>>> Are you using the "Prometheus agent" feature? What are the command line
>>> flags for Prometheus?
>>>
>>> Regards,
>>>
>>> On 21 Jan 17:44, Ramachandra Bhaskar Ayalavarapu wrote:
>>> > Hello
>>> >
>>> > Some of our prometheus instances running inside kubernetes are
>>> frequently
>>> > not sending metrics to the remote_write destination. the remote_write
>>> > destination is healthy and doesn't have problems from other clusters.
>>> The
>>> > container is stable and not crashing and neither we are seeing any
>>> errors.
>>> > please advise.
>>> >
>>> > prometheus version: v2.32.0
>>> >
>>> > RAM given to container: 12-48gb cpu: 2-12 cores. Kubernetes has close
>>> to
>>> > 130+ nodes
>>> >
>>> > remoteWrite:
>>> > - url: "https://<redacted>/api/v1/write"
>>> > queueConfig:
>>> > capacity: 10000
>>> > batchSendDeadline: 5s
>>> > minShards: 1
>>> > maxSamplesPerSend: 1000
>>> > maxShards: 10
>>> > basicAuth:
>>> > username:
>>> > name: kuberemotewrite
>>> > key: username
>>> > password:
>>> > name: kuberemotewrite
>>> > key: password
>>> >
>>> > Please advise on how to debug further.
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google
>>> Groups "Prometheus Users" group.
>>> > To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> > To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/prometheus-users/CAHcJzTnQj9UvrG9JTEQJVpDsNRBYiiRtqn5etUdA7DOrE18uvA%40mail.gmail.com
>>> .
>>>
>>> --
>>> Julien Pivotto
>>> @roidelapluie
>>>
>>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/CAHcJzT%3D9YbbJa2N_F9wBFd9JT3nsuDZhBsgbY3gYZkm51tYFMg%40mail.gmail.com.