On Sat, Feb 11, 2012 at 2:59 AM, Nikolay Samofatov
<nikolay.samofa...@red-soft.biz> wrote:
> Hello, All!
>
> We run Firebird to power larger systems (for 12 government agencies and 3 
> banks).
>
> We have had pretty spectacular crash on one of our systems today.
>
> It has approximately 100000 end users multiplexed through 2500 (max) pooled 
> connections.
>
> Here is the snapshot of nearly idle system at night:
>
> top - 03:20:39 up 10 days,  8:39,  7 users,  load average: 2.08, 1.87, 2.15
> Tasks: 1732 total,   1 running, 1730 sleeping,   1 stopped,   0 zombie
> Cpu(s): 11.9%us,  4.0%sy,  0.0%ni, 83.5%id,  0.0%wa,  0.0%hi,  0.6%si,  0.0%st
> Mem:  529177288k total, 378587600k used, 150589688k free,   761532k buffers
> Swap: 1073741816k total,   130612k used, 1073611204k free, 333281232k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 15840 t-mvv     20   0 33.3g 9.1g  25m S 209.9  1.8 932:23.24 java
> 15931 root      20   0  578m 226m 165m S 75.3  0.0 286:12.21 rdb_inet_server
> 16101 root      20   0  486m 198m 164m S 41.4  0.0  60:34.22 rdb_inet_server
> 15897 root      20   0  956m 509m 166m S 21.5  0.1 126:36.86 rdb_inet_server
> 46960 qemu      20   0 1365m 1.0g 2156 S  5.2  0.2 973:33.28 qemu-kvm
> 61680 qemu      20   0 1366m 1.0g 2536 S  4.6  0.2 934:21.36 qemu-kvm
> 24615 root      20   0  466m 112m  96m S  3.6  0.0   0:08.07 rdb_inet_server
> ...
>
> [root@mvv bin]# ps aux | grep -c rdb_inet_server
> 719
>
> Database is on a small FusionIO drive:
>
> mount:
> /dev/fiob on /mnt/db type ext2 (rw,noatime)
>
> df -h:
> /dev/fiob             587G  423G  135G  76% /mnt/db
>
> ls -l:
> -rw-rw---- 1 root root 453031493632 Feb 11 03:26 ncore-mvv.fdb
>
> We run Firebird as root because we hit 5000 thread count limit in the kernel 
> per user, and we
> haven't figured how to disable it yet.
> pthread_create returns 11 (EAGAIN) in this case.
>
> Lock table is configured to grow in 35 MB increments, and after system 
> warm-up grows to about 150 MB:
>
> LOCK_HEADER BLOCK
>         Version: 145, Active owner:      0, Length: 175000000, Used: 157142248
>         Flags: 0x0001
>         Enqs: 1820518593, Converts: 8135269, Rejects: 2192413, Blocks: 3961057
>         Deadlock scans:      0, Deadlocks:      0, Scan interval: 100
>         Acquires: 1916210916, Acquire blocks: 217985004, Spin count:   0
>         Mutex wait: 11.4%
>         Hash slots: 1009, Hash lengths (min/avg/max):    0/   4/  12
>         Remove node:      0, Insert queue:      0, Insert prior:      0
>         Owners (675):   forward: 3489832, backward: 4286776
>         Free owners (2):        forward: 4610504, backward: 1464048
>         Free locks (646):       forward:  22816, backward: 115160
>         Free requests (2333726):        forward: 112908000, backward: 
> 143407712
>         Lock Ordering: Enabled
>
> The core dump shows that lock manager table is 2170000000 bytes at crash 
> time, and the crash
> happened due to SLONG offsets wrapping in lock.cpp.
> There were 2500 connections active during crash period. Each lock request 
> consumes 64 bytes of RAM
> thus there should have been ~13000 locks per connection on average.
>
> We had coredumps enabled on the system (but without coredump_filter=0x37 
> kernel option unfortunately
> :-((( ).
> The system with overflowed lock table continued to serve requests for 3 hours 
> (at night).
> During this time the system produced ~2 TB of core dumps and exhausted space 
> on root partition
> (causing minor data loss).
> Unidentified bug in the wire protocol implementation made one of the clients 
> think that cursor
> returned EOF instead of fetch error (causing massive data loss at consumer 
> side).
>
> We'll workaround this problem via tuning our connections pool. But I think to 
> support large
> installations:
> 1) Engine should not crash when lock memory size limit is reached, and fail 
> gracefully instead
> 2) Various offsets inside lock manager should be made 64-bit?
I will respond for 2) at first changing the offsets to 64bit doesn't
look too hard  replace SLONG with SINT64
in the lock.cpp functions
but i might be wrong so i leave the core dev if is easy
My guess is that you can send the patches
>
> Nikolay Samofatov
PS: nice production system :) 512G of ram and 1T of swap

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Reply via email to