I have brick multiplexing[1] functional to the point that it passes all basic 
AFR, EC, and quota tests.  There are still some issues with tiering, and I 
wouldn't consider snapshots functional at all, but it seemed like a good point 
to see how well it works.  I ran some *very simple* tests with 20 volumes, each 
2x distribute on top of 2x replicate.

First, the good news: it worked!  Getting 80 bricks to come up in the same 
process, and then run I/O correctly across all of those, is pretty cool.  Also, 
memory consumption is *way* down.  RSS size went from 1.1GB before (total 
across 80 processes) to about 400MB (one process) with multiplexing.  Each 
process seems to consume approximately 8MB globally plus 5MB per brick, so 
(8+5)*80 = 1040 vs. 8+(5*80) = 408.  Just considering the amount of memory, 
this means we could support about three times as many bricks as before.  When 
memory *contention* is considered, the difference is likely to be even greater.

Bad news: some of our code doesn't scale very well in terms of CPU use.  To 
test performance I ran a test which would create 20,000 files across all 20 
volumes, then write and delete them, all using 100 client threads.  This is 
similar to what smallfile does, but deliberately constructed to use a minimum 
of disk space - at any given, only one file per thread (maximum) actually has 
4KB worth of data in it.  This allows me to run it against SSDs or even 
ramdisks even with high brick counts, to factor out slow disks in a study of 
CPU/memory issues.  Here are some results and observations.

* On my first run, the multiplexed version of the test took 77% longer to run 
than the non-multiplexed version (5:42 vs. 3:13).  And that was after I'd done 
some hacking to use 16 epoll threads.  There's something a bit broken about 
trying to set that option normally, so that the value you set doesn't actually 
make it to the place that tries to spawn the threads.  Bumping this up further 
to 32 threads didn't seem to help.

* A little profiling showed me that we're spending almost all of our time in 
pthread_spin_lock.  I disabled the code to use spinlocks instead of regular 
mutexes, which immediately improved performance and also reduced CPU time by 
almost 50%.

* The next round of profiling showed that a lot of the locking is in mem-pool 
code, and a lot of that in turn is from dictionary code.  Changing the dict 
code to use malloc/free instead of mem_get/mem_put gave another noticeable 

At this point run time was down to 4:50, which is 20% better than where I 
started but still far short of non-multiplexed performance.  I can drive that 
down still further by converting more things to use malloc/free.  There seems 
to be a significant opportunity here to improve performance - even without 
multiplexing - by taking a more careful look at our memory-management 

* Tune the mem-pool implementation to scale better with hundreds of threads.

* Use mem-pools more selectively, or even abandon them altogether.

* Try a different memory allocator such as jemalloc.

I'd certainly appreciate some help/collaboration in studying these options 
further.  It's a great opportunity to make a large impact on overall 
performance without a lot of code or specialized knowledge.  Even so, however, 
I don't think memory management is our only internal scalability problem.  
There must be something else limiting parallelism, and quite severely at that.  
My first guess is io-threads, so I'll be looking into that first, but if 
anybody else has any ideas please let me know.  There's no *good* reason why 
running many bricks in one process should be slower than running them in 
separate processes.  If it remains slower, then the limit on the number of 
bricks and volumes we can support will remain unreasonably low.  Also, the 
problems I'm seeing here probably don't *only* affect multiplexing.  Excessive 
memory/CPU use and poor parallelism are issues that we kind of need to address 
anyway, so if anybody has any ideas please let me know.

[1] http://review.gluster.org/#/c/14763/
Gluster-devel mailing list

Reply via email to