Greetings,
We have a three node Riak cluster set up in a pre-production environment
with Level DB configured on the backend. Systems are beefy dual 6 core,
96GB RAM, running all SSDs. Preliminary testing showed some issues with
long latencies (~10-30 seconds and increasing) shown in
node_get_fsm_time_100. We raised our initial concerns at the Riak
workgroup in San Francisco last week.
After the workgroup, we made the following changes to our configuration:
1. Tuned /etc/security/limits.conf to add:
riak soft nofile 2048
riak hard nofile 10240
2. Added noatime to riak filesystem mount (running on 6-device RAID
6/RAID 10 Intel 710 200 GB SSD)
/dev/mapper/vg_raid10-lv_riak on /var/lib/riak type ext4 (rw,noatime)
3. Edited eleveldb config to add write buffer and cache size
%% eLevelDB Config
{eleveldb, [
{data_root, "/var/lib/riak/leveldb"},
{write_buffer_size, 16777216},
{cache_size, 1073741824}
]},
At first blush, this tuning seemed to correct the problem. Bash bench
testing failed to uncover any latency. The get_fsm_time returned to
near zero. However, over the weekend and into this week the peak delays
started to creep back up linearly. See graphs from Ganglia:
http://www.flickr.com/photos/dmourati/sets/72157629758658870/
Average get times remain constant. Put times do not show similar delay.
In talking with Basho folks, we learned the behavior is likely caused by
"LevelDB Compaction."
http://leveldb.googlecode.com/svn/trunk/doc/impl.html
Question:
What can we do to reduce/eliminate the latency shown in
node_get_fsm_time_100?
Thanks,
Demetri
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com