Hi all.
Today I have started getting errors like below in logs (seems that I have not
changed anything for last week). When it happens the db gets lots of
connections in state active, eats 100% cpu and clients get errors (due to
timeout).
2014-02-12 15:44:24.562
MSK,"rpop","rpopdb_p6",30061,"localhost:58350",52fb5e53.756d,1,"SELECT
waiting",2014-02-12 15:43:15 MSK,143/264877,1002850566,LOG,00000,"process 30061
still waiting for ExclusiveLock on extension of relation 26118 of database
24590 after 1000.082 ms",,,,,"SQL statement ""insert into rpop.rpop_imap_uidls
(folder_id, uidl) values (i_folder_id, i_uidl)""
I have read several topics [1, 2, 3, 4] with similar problems but haven't find
a good solution. Below is some more diagnostics.
I am running PostgreSQL 9.3.2 installed from RPM packages on RHEL 6.4. Host is
running with the following CPU (32 cores) and memory:
root@rpopdb01e ~ # fgrep -m1 'model name' /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
root@rpopdb01e ~ # free -m
total used free shared buffers cached
Mem: 129028 123558 5469 0 135 119504
-/+ buffers/cache: 3918 125110
Swap: 16378 0 16378
root@rpopdb01e ~ #
PGDATA lives on RAID6 array of 8 ssd-disks with ext4, iostat and atop say the
disks are really free. Right now PGDATA takes only 95G.
The settings changed in postgresql.conf are here [5].
When it happens the last query from here [6] shows that almost all queries are
waiting for ExclusiveLock, but they do a simple insert.
(extend,26647,26825,,,,,,,) | 5459 | ExclusiveLock | 1 |
(extend,26647,26825,,,,,,,) | 8053 | ExclusiveLock | 5459,8053
(extend,26647,26828,,,,,,,) | 5567 | ExclusiveLock | 1 |
(extend,26647,26828,,,,,,,) | 5490 | ExclusiveLock | 5567,5490
(extend,24584,25626,,,,,,,) | 5611 | ExclusiveLock | 1 |
(extend,24584,25626,,,,,,,) | 3963 | ExclusiveLock | 5611,3963
I have several databases running on one host with one postmaster process and
ExclusiveLock is being waited by many oids. I suppose the only common thing for
all of them is that they are bigger than others and they almost do not get
updates and deletes (only inserts and reads). Some more info about one of such
tables is here [7].
I have tried to look at the source code (src/backend/access/heap/hio.c) to
understand when the exclusive lock can be taken, but I could only read comments
:) I have also examined FSM for this tables and their indexes and found that
for most of them there are free pages but there are, for example, such cases:
rpopdb_p0=# select count(*) from pg_freespace('rpop.rpop_uidl') where avail !=
0;
count
--------
115953
(1 row)
rpopdb_p0=# select count(*) from pg_freespace('rpop.pk_rpop_uidl') where avail
!= 0;
count
-------
0
(1 row)
rpopdb_p0=# \dS+ rpop.rpop_uidl
Table "rpop.rpop_uidl"
Column | Type | Modifiers | Storage | Stats target |
Description
--------+------------------------+-----------+----------+--------------+-------------
popid | bigint | not null | plain | |
uidl | character varying(200) | not null | extended | |
Indexes:
"pk_rpop_uidl" PRIMARY KEY, btree (popid, uidl)
Has OIDs: no
rpopdb_p0=#
My questions are:
1. Do we consume 100% cpu (in system) trying to get page from FSM? Or does it
happen during exclusive lock acquiring? How can I dig it?
2. How much space do we extend to the relation when we get exclusive lock on it?
3. Why extended page is not visible for other backends?
4. Is there any possibility of situation where backend A got exclusive lock on
some relation to extend it. Then OS CPU scheduler made a context switch to
backend B while backend B is waiting for exclusive lock on the same relation.
And so on for many backends.
5. (and the main question) what can I do to get rid of such situations? It is a
production cluster and I do not have any ideas what to do with this situation
:( Any help would be really appropriate.
[1]
http://www.postgresql.org/message-id/[email protected]
[2]
http://pgsql.performance.narkive.com/IrkPbl3f/postgresql-9-2-3-performance-problem-caused-exclusive-locks
[3] http://www.postgresql.org/message-id/[email protected]
[4]
http://www.postgresql.org/message-id/cal_0b1sypyeoynkynv95nnv2d+4jxtug3hkkf6fahfw7gvg...@mail.gmail.com
[5] http://pastebin.com/raw.php?i=Bd40Vn6h
[6] http://wiki.postgresql.org/wiki/Lock_dependency_information
[7 http://pastebin.com/raw.php?i=eGrtG524]
--
Vladimir