I know y'all are probably getting tired of these updates, but developing out in 
the open and all that.  Executive summary: the combination of disabling memory 
pools and using jemalloc makes multiplexing shine.  You can skip forward to 
***RESULTS*** if you're not interested in my tired ramblings.

Let's talk about memory pools first.  I had identified this is a problem area a 
while ago, leading to a new memory-pool implementation[1].  I was rather proud 
of it, actually, but one of the lessons I've learned is that empirical results 
trump pride.  Another lesson is that it's really important to test 
performance-related changes on more than one kind of system.  On my default 
test system and at scale up to 100 volumes (400 bricks) the new mem-pool 
implementation was looking really good.  Unfortunately, at about 120 volumes it 
would run into a limit on the number of keys accessible via 
pthread_getspecific.  Well, crap.  I made some changes to overcome this limit, 
they hurt performance a little but I thought they'd save the effort.  Then I 
realized that there's *no limit* to how many pools a thread might use.  Each 
brick creates a dozen or so pools, and with multiplexing there's a potentially 
unlimited number of bricks in a process.  As a worker thread jumps from brick 
to bri
 ck, it might hit all of those pools.  This left three options.

(1) Bind threads to bricks, which I've already shown is bad for scalability.

(2) Tweak the mem-pool implementation to handle even more (almost infinitely 
more) thread/pool combinations, adding complexity and hurting performance even 
more.

(3) Reduce the number of pools by combining all pools for the same size.

Well, (3) sure sounds great, doesn't it?  There's only a couple of dozen sizes 
we use for pools, therefore only a couple of dozen pools no matter how many 
threads or bricks we have, and it's all wonderful.  We've also effectively 
reinvented yet another general-purpose memory allocator at that point, and 
there are a few of those out there already.  I hate letting my code die as much 
as anyone, but sometimes that's what the empirical results dictate must happen. 
 In fact, every developer should go through this periodically to keep them 
humble.  Lord knows I need that kind of lesson more often.  ;)

OK, so my own mem-pool implementation was out.  First experiment was to just 
disable mem-pools entirely (turn them into plain malloc/free) and see what the 
results were.  For these tests I used my standard create/write/delete 20,000 
files test, on each of two different machines: a 16-core 8GB (artificially 
constrained) machine in Westford, and a 12-core 32GB machine with a much faster 
SSD at Digital Ocean.  The results were good on the Westford machine, with 
multiplexing mostly outperforming the alternative at scales up to 145 volumes 
(580 bricks).  However, on the DO machine multiplexing performance degraded 
badly even at 40 volumes.  Remember what I said about testing on multiple kind 
of machines?  This kind of result is why that matters.  My io-threads patch[2] 
seemed to help some, but not much.

Now it was time to revisit jemalloc.  Last time I looked at it, the benefit 
seemed minimal at best.  However, with the new load presented by the removal of 
memory pools, things were different this time.  Now performance remained smooth 
on the DO configuration with multiplexing up to 220 volumes.  Without 
multiplexing, I ran into a swap storm at 180 volumes and then everything died.  
I mean *everything*; I had to do a hard reboot.  Similarly, on the Westford 
machine the current code died at 100 volumes while the multiplexing version was 
still going strong.  We have a winner.  With some more tweaking, I'm pretty 
confident that we'll be able to support 1000 bricks on a 32GB machine this way 
- not that anyone will have that many disks, but once we start slicing and 
dicing physical disks into smaller units for container-type workloads it's 
pretty easy to get there.

***RESULTS***

Relying on jemalloc instead of our own mem-pools will likely double the number 
of bricks we can support with the same memory (assuming further fixes to reduce 
memory over-use).  Also, performance of brick addition/removal is around 2x 
what it was before, because manipulating the graph in an existing process is a 
lot cheaper than starting a new one.  On the other hand, multiplexing 
performance is generally worse than non-multiplexed until we get close to those 
scalability limits.  We'll probably need to use an "adaptive" approach that 
will continue to use the current process-per-brick scheme until we get close to 
maximum capacity.

[1] http://review.gluster.org/#/c/15645/
[2] http://review.gluster.org/#/c/15643/
_______________________________________________
Gluster-devel mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-devel

Reply via email to