On Sat, Feb 11, 2012 at 2:59 AM, Nikolay Samofatov <nikolay.samofa...@red-soft.biz> wrote: > Hello, All! > > We run Firebird to power larger systems (for 12 government agencies and 3 > banks). > > We have had pretty spectacular crash on one of our systems today. > > It has approximately 100000 end users multiplexed through 2500 (max) pooled > connections. > > Here is the snapshot of nearly idle system at night: > > top - 03:20:39 up 10 days, 8:39, 7 users, load average: 2.08, 1.87, 2.15 > Tasks: 1732 total, 1 running, 1730 sleeping, 1 stopped, 0 zombie > Cpu(s): 11.9%us, 4.0%sy, 0.0%ni, 83.5%id, 0.0%wa, 0.0%hi, 0.6%si, 0.0%st > Mem: 529177288k total, 378587600k used, 150589688k free, 761532k buffers > Swap: 1073741816k total, 130612k used, 1073611204k free, 333281232k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 15840 t-mvv 20 0 33.3g 9.1g 25m S 209.9 1.8 932:23.24 java > 15931 root 20 0 578m 226m 165m S 75.3 0.0 286:12.21 rdb_inet_server > 16101 root 20 0 486m 198m 164m S 41.4 0.0 60:34.22 rdb_inet_server > 15897 root 20 0 956m 509m 166m S 21.5 0.1 126:36.86 rdb_inet_server > 46960 qemu 20 0 1365m 1.0g 2156 S 5.2 0.2 973:33.28 qemu-kvm > 61680 qemu 20 0 1366m 1.0g 2536 S 4.6 0.2 934:21.36 qemu-kvm > 24615 root 20 0 466m 112m 96m S 3.6 0.0 0:08.07 rdb_inet_server > ... > > [root@mvv bin]# ps aux | grep -c rdb_inet_server > 719 > > Database is on a small FusionIO drive: > > mount: > /dev/fiob on /mnt/db type ext2 (rw,noatime) > > df -h: > /dev/fiob 587G 423G 135G 76% /mnt/db > > ls -l: > -rw-rw---- 1 root root 453031493632 Feb 11 03:26 ncore-mvv.fdb > > We run Firebird as root because we hit 5000 thread count limit in the kernel > per user, and we > haven't figured how to disable it yet. > pthread_create returns 11 (EAGAIN) in this case. > > Lock table is configured to grow in 35 MB increments, and after system > warm-up grows to about 150 MB: > > LOCK_HEADER BLOCK > Version: 145, Active owner: 0, Length: 175000000, Used: 157142248 > Flags: 0x0001 > Enqs: 1820518593, Converts: 8135269, Rejects: 2192413, Blocks: 3961057 > Deadlock scans: 0, Deadlocks: 0, Scan interval: 100 > Acquires: 1916210916, Acquire blocks: 217985004, Spin count: 0 > Mutex wait: 11.4% > Hash slots: 1009, Hash lengths (min/avg/max): 0/ 4/ 12 > Remove node: 0, Insert queue: 0, Insert prior: 0 > Owners (675): forward: 3489832, backward: 4286776 > Free owners (2): forward: 4610504, backward: 1464048 > Free locks (646): forward: 22816, backward: 115160 > Free requests (2333726): forward: 112908000, backward: > 143407712 > Lock Ordering: Enabled > > The core dump shows that lock manager table is 2170000000 bytes at crash > time, and the crash > happened due to SLONG offsets wrapping in lock.cpp. > There were 2500 connections active during crash period. Each lock request > consumes 64 bytes of RAM > thus there should have been ~13000 locks per connection on average. > > We had coredumps enabled on the system (but without coredump_filter=0x37 > kernel option unfortunately > :-((( ). > The system with overflowed lock table continued to serve requests for 3 hours > (at night). > During this time the system produced ~2 TB of core dumps and exhausted space > on root partition > (causing minor data loss). > Unidentified bug in the wire protocol implementation made one of the clients > think that cursor > returned EOF instead of fetch error (causing massive data loss at consumer > side). > > We'll workaround this problem via tuning our connections pool. But I think to > support large > installations: > 1) Engine should not crash when lock memory size limit is reached, and fail > gracefully instead > 2) Various offsets inside lock manager should be made 64-bit? I will respond for 2) at first changing the offsets to 64bit doesn't look too hard replace SLONG with SINT64 in the lock.cpp functions but i might be wrong so i leave the core dev if is easy My guess is that you can send the patches > > Nikolay Samofatov PS: nice production system :) 512G of ram and 1T of swap
------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ Firebird-Devel mailing list, web interface at https://lists.sourceforge.net/lists/listinfo/firebird-devel