[prometheus-users] CrashLoopBackoff: Disk ran out of storage? Two conflicting instances of prom?

Agnivesh Adhikari Mon, 02 Nov 2020 01:50:17 -0800

Prometheus was acting fine, and then it stopped. Not sure what triggered 
this, but when I investigated, I found the following:


**Logs from prometheus-server:**  
```
level=info ts=2020-11-02T06:54:50.694Z caller=main.go:343 msg="Starting 
Prometheus" version="(version=2.20.1, branch=HEAD, 
revision=983ebb4a513302315a8117932ab832815f85e3d2)"
level=info ts=2020-11-02T06:54:50.694Z caller=main.go:344 
build_context="(go=go1.14.6, user=root@7cbd4d1c15e0, 
date=20200805-17:26:58)"
level=info ts=2020-11-02T06:54:50.694Z caller=main.go:345 
host_details="(Linux 5.4.0-1026-azure #26~18.04.1-Ubuntu SMP Thu Sep 10 
16:19:25 UTC 2020 x86_64 core-charts-prometheus-server-66fb9b87fb-vb4nx 
(none))"
level=info ts=2020-11-02T06:54:50.694Z caller=main.go:346 
fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-11-02T06:54:50.694Z caller=main.go:347 
vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2020-11-02T06:54:50.697Z caller=main.go:684 msg="Starting 
TSDB ..."
level=info ts=2020-11-02T06:54:50.697Z caller=web.go:524 component=web 
msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-11-02T06:54:50.700Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603451272010 maxt=1603476000000 
ulid=01ENBV1MSN88C71G1D7CHQPXAA
level=info ts=2020-11-02T06:54:50.700Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603476000000 maxt=1603540800000 
ulid=01ENDHZP0PZYS2Q4BA9BT9YQQP
level=info ts=2020-11-02T06:54:50.700Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603540800000 maxt=1603605600000 
ulid=01ENFFS5EXXZ5B88ECS1M5VMRK
level=info ts=2020-11-02T06:54:50.700Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603605600000 maxt=1603670400000 
ulid=01ENHDJR6FCBYWQFSMMWA2FC5X
level=info ts=2020-11-02T06:54:50.700Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603670400000 maxt=1603735200000 
ulid=01ENKBC7Z5T8XP5D25VVYDYNZA
level=info ts=2020-11-02T06:54:50.700Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603735200000 maxt=1603800000000 
ulid=01ENN95VHH6J3AC1Z58YGW82VD
level=info ts=2020-11-02T06:54:50.700Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603800000000 maxt=1603864800000 
ulid=01ENQ6ZBGXAHDA1HMKJQJFWRRA
level=info ts=2020-11-02T06:54:50.700Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603864800000 maxt=1603929600000 
ulid=01ENS4SD4G3YYEP9MCZYS5JCPS
level=info ts=2020-11-02T06:54:50.700Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603929600000 maxt=1603994400000 
ulid=01ENV2S9H6RJY0R9RVRV4WC10E
level=info ts=2020-11-02T06:54:50.700Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603994400000 maxt=1604059200000 
ulid=01ENX0XP6D12NFQX37W8EYAMPQ
level=info ts=2020-11-02T06:54:50.701Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604059200000 maxt=1604124000000 
ulid=01ENYY805WJKABY59AV3YATN1Q
level=info ts=2020-11-02T06:54:50.701Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604124000000 maxt=1604188800000 
ulid=01EP0W1F5AHBM92W6B3771B0FS
level=info ts=2020-11-02T06:54:50.701Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604210400000 maxt=1604217600000 
ulid=01EP1GHTZ4Z6SYQ14H3CPSS0ZP
level=info ts=2020-11-02T06:54:50.701Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604188800000 maxt=1604210400000 
ulid=01EP1GPRE0N3M80EQM68YXD2W2
level=info ts=2020-11-02T06:54:50.701Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604217600000 maxt=1604224800000 
ulid=01EP1QDJ35HGFS6TXG148CE3X9
level=info ts=2020-11-02T06:54:50.702Z caller=main.go:553 msg="Stopping 
scrape discovery manager..."
level=info ts=2020-11-02T06:54:50.702Z caller=main.go:567 msg="Stopping 
notify discovery manager..."
level=info ts=2020-11-02T06:54:50.702Z caller=main.go:549 msg="Scrape 
discovery manager stopped"
level=info ts=2020-11-02T06:54:50.702Z caller=main.go:563 msg="Notify 
discovery manager stopped"
level=info ts=2020-11-02T06:54:50.702Z caller=main.go:589 msg="Stopping 
scrape manager..."
level=info ts=2020-11-02T06:54:50.702Z caller=manager.go:888 
component="rule manager" msg="Stopping rule manager..."
level=info ts=2020-11-02T06:54:50.702Z caller=manager.go:898 
component="rule manager" msg="Rule manager stopped"
level=info ts=2020-11-02T06:54:50.702Z caller=notifier.go:601 
component=notifier msg="Stopping notification manager..."
level=info ts=2020-11-02T06:54:50.703Z caller=main.go:755 msg="Notifier 
manager stopped"
level=info ts=2020-11-02T06:54:50.703Z caller=main.go:583 msg="Scrape 
manager stopped"
level=error ts=2020-11-02T06:54:50.704Z caller=main.go:764 err="opening 
storage failed: lock DB directory: resource temporarily unavailable"
```

**Then I noticed that somehow another instance of prometheus was running 
which had these logs:**
```
level=info ts=2020-11-02T06:54:03.213Z caller=main.go:343 msg="Starting 
Prometheus" version="(version=2.20.1, branch=HEAD, 
revision=983ebb4a513302315a8117932ab832815f85e3d2)"
level=info ts=2020-11-02T06:54:03.213Z caller=main.go:344 
build_context="(go=go1.14.6, user=root@7cbd4d1c15e0, 
date=20200805-17:26:58)"
level=info ts=2020-11-02T06:54:03.213Z caller=main.go:345 
host_details="(Linux 5.4.0-1026-azure #26~18.04.1-Ubuntu SMP Thu Sep 10 
16:19:25 UTC 2020 x86_64 core-charts-prometheus-server-5d855654cd-9btpg 
(none))"
level=info ts=2020-11-02T06:54:03.213Z caller=main.go:346 
fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-11-02T06:54:03.213Z caller=main.go:347 
vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2020-11-02T06:54:03.218Z caller=main.go:684 msg="Starting 
TSDB ..."
level=info ts=2020-11-02T06:54:03.218Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603451272010 maxt=1603476000000 
ulid=01ENBV1MSN88C71G1D7CHQPXAA
level=info ts=2020-11-02T06:54:03.218Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603476000000 maxt=1603540800000 
ulid=01ENDHZP0PZYS2Q4BA9BT9YQQP
level=info ts=2020-11-02T06:54:03.218Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603540800000 maxt=1603605600000 
ulid=01ENFFS5EXXZ5B88ECS1M5VMRK
level=info ts=2020-11-02T06:54:03.218Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603605600000 maxt=1603670400000 
ulid=01ENHDJR6FCBYWQFSMMWA2FC5X
level=info ts=2020-11-02T06:54:03.219Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603670400000 maxt=1603735200000 
ulid=01ENKBC7Z5T8XP5D25VVYDYNZA
level=info ts=2020-11-02T06:54:03.219Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603735200000 maxt=1603800000000 
ulid=01ENN95VHH6J3AC1Z58YGW82VD
level=info ts=2020-11-02T06:54:03.219Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603800000000 maxt=1603864800000 
ulid=01ENQ6ZBGXAHDA1HMKJQJFWRRA
level=info ts=2020-11-02T06:54:03.219Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603864800000 maxt=1603929600000 
ulid=01ENS4SD4G3YYEP9MCZYS5JCPS
level=info ts=2020-11-02T06:54:03.219Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603929600000 maxt=1603994400000 
ulid=01ENV2S9H6RJY0R9RVRV4WC10E
level=info ts=2020-11-02T06:54:03.219Z caller=web.go:524 component=web 
msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-11-02T06:54:03.219Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603994400000 maxt=1604059200000 
ulid=01ENX0XP6D12NFQX37W8EYAMPQ
level=info ts=2020-11-02T06:54:03.219Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604059200000 maxt=1604124000000 
ulid=01ENYY805WJKABY59AV3YATN1Q
level=info ts=2020-11-02T06:54:03.219Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604124000000 maxt=1604188800000 
ulid=01EP0W1F5AHBM92W6B3771B0FS
level=info ts=2020-11-02T06:54:03.219Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604210400000 maxt=1604217600000 
ulid=01EP1GHTZ4Z6SYQ14H3CPSS0ZP
level=info ts=2020-11-02T06:54:03.219Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604188800000 maxt=1604210400000 
ulid=01EP1GPRE0N3M80EQM68YXD2W2
level=info ts=2020-11-02T06:54:03.219Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604217600000 maxt=1604224800000 
ulid=01EP1QDJ35HGFS6TXG148CE3X9
level=info ts=2020-11-02T06:54:03.671Z caller=head.go:641 component=tsdb 
msg="Replaying on-disk memory mappable chunks if any"
level=error ts=2020-11-02T06:54:04.959Z caller=head.go:646 component=tsdb 
msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of 
sequence m-mapped chunk for series ref 48449"
level=info ts=2020-11-02T06:54:04.959Z caller=head.go:757 component=tsdb 
msg="Deleting mmapped chunk files"
level=info ts=2020-11-02T06:54:04.959Z caller=head.go:760 component=tsdb 
msg="Deletion of mmap chunk files failed, discarding chunk files 
completely" err="cannot handle error: iterate on on-disk chunks: out of 
sequence m-mapped chunk for series ref 48449"
level=info ts=2020-11-02T06:54:04.959Z caller=head.go:655 component=tsdb 
msg="On-disk memory mappable chunks replay completed" duration=1.288173414s
level=info ts=2020-11-02T06:54:04.959Z caller=head.go:661 component=tsdb 
msg="Replaying WAL, this may take a while"
level=info ts=2020-11-02T06:54:35.906Z caller=head.go:687 component=tsdb 
msg="WAL checkpoint loaded"
level=info ts=2020-11-02T06:54:44.615Z caller=head.go:713 component=tsdb 
msg="WAL segment loaded" segment=166 maxSegment=364
level=info ts=2020-11-02T06:54:51.762Z caller=head.go:713 component=tsdb 
msg="WAL segment loaded" segment=167 maxSegment=364
panic: write header: write /data/chunks_head/000123.tmp: no space left on 
device

goroutine 122 [running]:
github.com/prometheus/prometheus/tsdb.(*memSeries).mmapCurrentHeadChunk(0xc00b428a80,
 
0xc000645d40)
        /app/tsdb/head.go:2013 +0x218
github.com/prometheus/prometheus/tsdb.(*memSeries).cutNewHeadChunk(0xc00b428a80,
 
0x175835b59e7, 0xc000645d40, 0x1ea4416)
        /app/tsdb/head.go:1984 +0x39
github.com/prometheus/prometheus/tsdb.(*memSeries).append(0xc00b428a80, 
0x175835b59e7, 0x4040a774dfefe28f, 0x0, 0xc000645d40, 0x1)
        /app/tsdb/head.go:2140 +0x384
github.com/prometheus/prometheus/tsdb.(*Head).processWALSamples(0xc00049d860, 
0x175833fd500, 0xc1d0439800, 0xc1d04397a0, 0x0)
        /app/tsdb/head.go:365 +0x2ae
github.com/prometheus/prometheus/tsdb.(*Head).loadWAL.func5(0xc00049d860, 
0xc1d152fc70, 0xc1d152fc80, 0xc1d0439800, 0xc1d04397a0)
        /app/tsdb/head.go:459 +0x48
created by github.com/prometheus/prometheus/tsdb.(*Head).loadWAL
        /app/tsdb/head.go:458 +0x37f
```
**Investigations around two instances of prometheus:**
```
$ kubectl get deploy
NAME                                      READY   UP-TO-DATE   AVAILABLE  
 AGE
core-charts-prometheus-server             0/1     1            0          
 9d

$ kubectl get rs
NAME                                                 DESIRED   CURRENT  
 READY   AGE
core-charts-prometheus-server-5d855654cd             1         1         0  
     13h
core-charts-prometheus-server-66fb9b87fb             1         1         0  
     9d
```
I think somehow kubernetes accidentally created two replicasets, leading to 
two prometheus instances, leading to some kind of fight over the storage.

I then deleted both the replicasets, causing the deployment to create a new 
replicaset as expected. That solved the "two prometheus" issue. Now, the 
new pod that got created is also failing with:
```
level=info ts=2020-11-02T07:17:40.252Z caller=main.go:343 msg="Starting 
Prometheus" version="(version=2.20.1, branch=HEAD, 
revision=983ebb4a513302315a8117932ab832815f85e3d2)"
level=info ts=2020-11-02T07:17:40.252Z caller=main.go:344 
build_context="(go=go1.14.6, user=root@7cbd4d1c15e0, 
date=20200805-17:26:58)"
level=info ts=2020-11-02T07:17:40.252Z caller=main.go:345 
host_details="(Linux 5.4.0-1026-azure #26~18.04.1-Ubuntu SMP Thu Sep 10 
16:19:25 UTC 2020 x86_64 core-charts-prometheus-server-5d855654cd-dc4k9 
(none))"
level=info ts=2020-11-02T07:17:40.252Z caller=main.go:346 
fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-11-02T07:17:40.252Z caller=main.go:347 
vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2020-11-02T07:17:40.264Z caller=main.go:684 msg="Starting 
TSDB ..."
level=info ts=2020-11-02T07:17:40.265Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603451272010 maxt=1603476000000 
ulid=01ENBV1MSN88C71G1D7CHQPXAA
level=info ts=2020-11-02T07:17:40.265Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603476000000 maxt=1603540800000 
ulid=01ENDHZP0PZYS2Q4BA9BT9YQQP
level=info ts=2020-11-02T07:17:40.265Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603540800000 maxt=1603605600000 
ulid=01ENFFS5EXXZ5B88ECS1M5VMRK
level=info ts=2020-11-02T07:17:40.265Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603605600000 maxt=1603670400000 
ulid=01ENHDJR6FCBYWQFSMMWA2FC5X
level=info ts=2020-11-02T07:17:40.265Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603670400000 maxt=1603735200000 
ulid=01ENKBC7Z5T8XP5D25VVYDYNZA
level=info ts=2020-11-02T07:17:40.265Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603735200000 maxt=1603800000000 
ulid=01ENN95VHH6J3AC1Z58YGW82VD
level=info ts=2020-11-02T07:17:40.265Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603800000000 maxt=1603864800000 
ulid=01ENQ6ZBGXAHDA1HMKJQJFWRRA
level=info ts=2020-11-02T07:17:40.265Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603864800000 maxt=1603929600000 
ulid=01ENS4SD4G3YYEP9MCZYS5JCPS
level=info ts=2020-11-02T07:17:40.265Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603929600000 maxt=1603994400000 
ulid=01ENV2S9H6RJY0R9RVRV4WC10E
level=info ts=2020-11-02T07:17:40.265Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1603994400000 maxt=1604059200000 
ulid=01ENX0XP6D12NFQX37W8EYAMPQ
level=info ts=2020-11-02T07:17:40.265Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604059200000 maxt=1604124000000 
ulid=01ENYY805WJKABY59AV3YATN1Q
level=info ts=2020-11-02T07:17:40.265Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604124000000 maxt=1604188800000 
ulid=01EP0W1F5AHBM92W6B3771B0FS
level=info ts=2020-11-02T07:17:40.266Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604210400000 maxt=1604217600000 
ulid=01EP1GHTZ4Z6SYQ14H3CPSS0ZP
level=info ts=2020-11-02T07:17:40.266Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604188800000 maxt=1604210400000 
ulid=01EP1GPRE0N3M80EQM68YXD2W2
level=info ts=2020-11-02T07:17:40.266Z caller=repair.go:59 component=tsdb 
msg="Found healthy block" mint=1604217600000 maxt=1604224800000 
ulid=01EP1QDJ35HGFS6TXG148CE3X9
level=info ts=2020-11-02T07:17:40.275Z caller=web.go:524 component=web 
msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-11-02T07:17:40.672Z caller=head.go:641 component=tsdb 
msg="Replaying on-disk memory mappable chunks if any"
level=error ts=2020-11-02T07:17:42.108Z caller=head.go:646 component=tsdb 
msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of 
sequence m-mapped chunk for series ref 48449"
level=info ts=2020-11-02T07:17:42.108Z caller=head.go:757 component=tsdb 
msg="Deleting mmapped chunk files"
level=info ts=2020-11-02T07:17:42.108Z caller=head.go:760 component=tsdb 
msg="Deletion of mmap chunk files failed, discarding chunk files 
completely" err="cannot handle error: iterate on on-disk chunks: out of 
sequence m-mapped chunk for series ref 48449"
level=info ts=2020-11-02T07:17:42.108Z caller=head.go:655 component=tsdb 
msg="On-disk memory mappable chunks replay completed" duration=1.435886582s
level=info ts=2020-11-02T07:17:42.108Z caller=head.go:661 component=tsdb 
msg="Replaying WAL, this may take a while"
level=info ts=2020-11-02T07:18:10.209Z caller=head.go:687 component=tsdb 
msg="WAL checkpoint loaded"
level=info ts=2020-11-02T07:18:18.265Z caller=head.go:713 component=tsdb 
msg="WAL segment loaded" segment=166 maxSegment=371
level=info ts=2020-11-02T07:18:25.553Z caller=head.go:713 component=tsdb 
msg="WAL segment loaded" segment=167 maxSegment=371
panic: write header: write /data/chunks_head/000123.tmp: no space left on 
device

goroutine 281 [running]:
github.com/prometheus/prometheus/tsdb.(*memSeries).mmapCurrentHeadChunk(0xc00a766540,
 
0xc000618840)
        /app/tsdb/head.go:2013 +0x218
github.com/prometheus/prometheus/tsdb.(*memSeries).cutNewHeadChunk(0xc00a766540,
 
0x175835b59e7, 0xc000618840, 0xc04f4cddc0)
        /app/tsdb/head.go:1984 +0x39
github.com/prometheus/prometheus/tsdb.(*memSeries).append(0xc00a766540, 
0x175835b59e7, 0x40412aeeeeeef556, 0x0, 0xc000618840, 0x1)
        /app/tsdb/head.go:2140 +0x384
github.com/prometheus/prometheus/tsdb.(*Head).processWALSamples(0xc0001ea1a0, 
0x175833fd500, 0xc1c9dff560, 0xc1c9dff500, 0x0)
        /app/tsdb/head.go:365 +0x2ae
github.com/prometheus/prometheus/tsdb.(*Head).loadWAL.func5(0xc0001ea1a0, 
0xc1ca20e660, 0xc1ca20e670, 0xc1c9dff560, 0xc1c9dff500)
        /app/tsdb/head.go:459 +0x48
created by github.com/prometheus/prometheus/tsdb.(*Head).loadWAL
        /app/tsdb/head.go:458 +0x37f
```

I need help with understanding:
- How do I solve this "no space left on device" issue?
- How the cluster reached this state
- How to prevent it from happening in the future

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ee839df6-fb2a-4c53-8c16-6fd04cd23825n%40googlegroups.com.

[prometheus-users] CrashLoopBackoff: Disk ran out of storage? Two conflicting instances of prom?

Reply via email to