[Firebird-devel] Engine crashes repeatedly when lock table exceeds 2 gigabyte limit

Nikolay Samofatov Fri, 10 Feb 2012 17:43:09 -0800

Hello, All!

We run Firebird to power larger systems (for 12 government agencies and 3 
banks).


We have had pretty spectacular crash on one of our systems today.

It has approximately 100000 end users multiplexed through 2500 (max) pooled 
connections.

Here is the snapshot of nearly idle system at night:

top - 03:20:39 up 10 days,  8:39,  7 users,  load average: 2.08, 1.87, 2.15
Tasks: 1732 total,   1 running, 1730 sleeping,   1 stopped,   0 zombie
Cpu(s): 11.9%us,  4.0%sy,  0.0%ni, 83.5%id,  0.0%wa,  0.0%hi,  0.6%si,  0.0%st
Mem:  529177288k total, 378587600k used, 150589688k free,   761532k buffers
Swap: 1073741816k total,   130612k used, 1073611204k free, 333281232k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15840 t-mvv     20   0 33.3g 9.1g  25m S 209.9  1.8 932:23.24 java
15931 root      20   0  578m 226m 165m S 75.3  0.0 286:12.21 rdb_inet_server
16101 root      20   0  486m 198m 164m S 41.4  0.0  60:34.22 rdb_inet_server
15897 root      20   0  956m 509m 166m S 21.5  0.1 126:36.86 rdb_inet_server
46960 qemu      20   0 1365m 1.0g 2156 S  5.2  0.2 973:33.28 qemu-kvm
61680 qemu      20   0 1366m 1.0g 2536 S  4.6  0.2 934:21.36 qemu-kvm
24615 root      20   0  466m 112m  96m S  3.6  0.0   0:08.07 rdb_inet_server
...

[root@mvv bin]# ps aux | grep -c rdb_inet_server
719

Database is on a small FusionIO drive:

mount:
/dev/fiob on /mnt/db type ext2 (rw,noatime)

df -h:
/dev/fiob             587G  423G  135G  76% /mnt/db

ls -l:
-rw-rw---- 1 root root 453031493632 Feb 11 03:26 ncore-mvv.fdb

We run Firebird as root because we hit 5000 thread count limit in the kernel 
per user, and we 
haven't figured how to disable it yet.
pthread_create returns 11 (EAGAIN) in this case.

Lock table is configured to grow in 35 MB increments, and after system warm-up 
grows to about 150 MB:

LOCK_HEADER BLOCK
         Version: 145, Active owner:      0, Length: 175000000, Used: 157142248
         Flags: 0x0001
         Enqs: 1820518593, Converts: 8135269, Rejects: 2192413, Blocks: 3961057
         Deadlock scans:      0, Deadlocks:      0, Scan interval: 100
         Acquires: 1916210916, Acquire blocks: 217985004, Spin count:   0
         Mutex wait: 11.4%
         Hash slots: 1009, Hash lengths (min/avg/max):    0/   4/  12
         Remove node:      0, Insert queue:      0, Insert prior:      0
         Owners (675):   forward: 3489832, backward: 4286776
         Free owners (2):        forward: 4610504, backward: 1464048
         Free locks (646):       forward:  22816, backward: 115160
         Free requests (2333726):        forward: 112908000, backward: 143407712
         Lock Ordering: Enabled

The core dump shows that lock manager table is 2170000000 bytes at crash time, 
and the crash 
happened due to SLONG offsets wrapping in lock.cpp.
There were 2500 connections active during crash period. Each lock request 
consumes 64 bytes of RAM 
thus there should have been ~13000 locks per connection on average.

We had coredumps enabled on the system (but without coredump_filter=0x37 kernel 
option unfortunately 
:-((( ).
The system with overflowed lock table continued to serve requests for 3 hours 
(at night).
During this time the system produced ~2 TB of core dumps and exhausted space on 
root partition 
(causing minor data loss).
Unidentified bug in the wire protocol implementation made one of the clients 
think that cursor 
returned EOF instead of fetch error (causing massive data loss at consumer 
side).

We'll workaround this problem via tuning our connections pool. But I think to 
support large 
installations:
1) Engine should not crash when lock memory size limit is reached, and fail 
gracefully instead
2) Various offsets inside lock manager should be made 64-bit?

Nikolay Samofatov


------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

[Firebird-devel] Engine crashes repeatedly when lock table exceeds 2 gigabyte limit

Reply via email to