Greetings,

We have a three node Riak cluster set up in a pre-production environment with Level DB configured on the backend. Systems are beefy dual 6 core, 96GB RAM, running all SSDs. Preliminary testing showed some issues with long latencies (~10-30 seconds and increasing) shown in node_get_fsm_time_100. We raised our initial concerns at the Riak workgroup in San Francisco last week.

After the workgroup, we made the following changes to our configuration:

1.  Tuned /etc/security/limits.conf to add:

riak            soft    nofile          2048
riak            hard    nofile          10240

2. Added noatime to riak filesystem mount (running on 6-device RAID 6/RAID 10 Intel 710 200 GB SSD)

/dev/mapper/vg_raid10-lv_riak on /var/lib/riak type ext4 (rw,noatime)

3.  Edited eleveldb config to add write buffer and cache size


      %% eLevelDB Config
 {eleveldb, [
             {data_root, "/var/lib/riak/leveldb"},
             {write_buffer_size, 16777216},
             {cache_size, 1073741824}
            ]},

At first blush, this tuning seemed to correct the problem. Bash bench testing failed to uncover any latency. The get_fsm_time returned to near zero. However, over the weekend and into this week the peak delays started to creep back up linearly. See graphs from Ganglia:

http://www.flickr.com/photos/dmourati/sets/72157629758658870/

Average get times remain constant.  Put times do not show similar delay.

In talking with Basho folks, we learned the behavior is likely caused by "LevelDB Compaction."

http://leveldb.googlecode.com/svn/trunk/doc/impl.html

Question:

What can we do to reduce/eliminate the latency shown in node_get_fsm_time_100?

Thanks,

Demetri

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to