I have brick multiplexing functional to the point that it passes all basic
AFR, EC, and quota tests. There are still some issues with tiering, and I
wouldn't consider snapshots functional at all, but it seemed like a good point
to see how well it works. I ran some *very simple* tests with 20 volumes, each
2x distribute on top of 2x replicate.
First, the good news: it worked! Getting 80 bricks to come up in the same
process, and then run I/O correctly across all of those, is pretty cool. Also,
memory consumption is *way* down. RSS size went from 1.1GB before (total
across 80 processes) to about 400MB (one process) with multiplexing. Each
process seems to consume approximately 8MB globally plus 5MB per brick, so
(8+5)*80 = 1040 vs. 8+(5*80) = 408. Just considering the amount of memory,
this means we could support about three times as many bricks as before. When
memory *contention* is considered, the difference is likely to be even greater.
Bad news: some of our code doesn't scale very well in terms of CPU use. To
test performance I ran a test which would create 20,000 files across all 20
volumes, then write and delete them, all using 100 client threads. This is
similar to what smallfile does, but deliberately constructed to use a minimum
of disk space - at any given, only one file per thread (maximum) actually has
4KB worth of data in it. This allows me to run it against SSDs or even
ramdisks even with high brick counts, to factor out slow disks in a study of
CPU/memory issues. Here are some results and observations.
* On my first run, the multiplexed version of the test took 77% longer to run
than the non-multiplexed version (5:42 vs. 3:13). And that was after I'd done
some hacking to use 16 epoll threads. There's something a bit broken about
trying to set that option normally, so that the value you set doesn't actually
make it to the place that tries to spawn the threads. Bumping this up further
to 32 threads didn't seem to help.
* A little profiling showed me that we're spending almost all of our time in
pthread_spin_lock. I disabled the code to use spinlocks instead of regular
mutexes, which immediately improved performance and also reduced CPU time by
* The next round of profiling showed that a lot of the locking is in mem-pool
code, and a lot of that in turn is from dictionary code. Changing the dict
code to use malloc/free instead of mem_get/mem_put gave another noticeable
At this point run time was down to 4:50, which is 20% better than where I
started but still far short of non-multiplexed performance. I can drive that
down still further by converting more things to use malloc/free. There seems
to be a significant opportunity here to improve performance - even without
multiplexing - by taking a more careful look at our memory-management
* Tune the mem-pool implementation to scale better with hundreds of threads.
* Use mem-pools more selectively, or even abandon them altogether.
* Try a different memory allocator such as jemalloc.
I'd certainly appreciate some help/collaboration in studying these options
further. It's a great opportunity to make a large impact on overall
performance without a lot of code or specialized knowledge. Even so, however,
I don't think memory management is our only internal scalability problem.
There must be something else limiting parallelism, and quite severely at that.
My first guess is io-threads, so I'll be looking into that first, but if
anybody else has any ideas please let me know. There's no *good* reason why
running many bricks in one process should be slower than running them in
separate processes. If it remains slower, then the limit on the number of
bricks and volumes we can support will remain unreasonably low. Also, the
problems I'm seeing here probably don't *only* affect multiplexing. Excessive
memory/CPU use and poor parallelism are issues that we kind of need to address
anyway, so if anybody has any ideas please let me know.
Gluster-devel mailing list