Re: [naviserver-devel] Quest for malloc

Zoran Vasiljevic Mon, 15 Jan 2007 05:48:59 -0800


Am 15.01.2007 um 10:27 schrieb Mike:


Although not entirely sure why, I have spent some time analyzing the
behavior and code of both of these allocators.


Well, I'd say it is simple "why": the results I have presented
are just too "tempting" so you wanted really to know *why*.
This is normal. I'd do the same.


a)
The test program Zoran includes biases Zippy toward "standard"
allocator, which it does not do for VT.  The following patch
"corrects" this behavior:

+++ memtest.c   Sun Jan 14 16:43:23 2007
@@ -211,6 +211,7 @@
             } else {
                 size &= 0x3FFF; /* Limit to 16K */
             }

+ if (size>16000)size = 16000;

             *toallocptr++ = size;
         }
     }



First of all, I wanted to give Zippy a fair chance. If I increase
the max allocation size, Zippy becomes even more slow than it is.
And, Zippy handles 16K pages, whereas we handle 32K pages.
Hence the

    size &= 0x3FFF; /* Limit to 16K */

which limits the allocation size to 16K max. To increase that
would even more hit Zippy than us.

b)
The key difference between Zippy and VT allocators arises form their
use of the shared "freed" memory pool.  Zippy calls this the shared
cache, VT calls this the global cache.  Zippy's goal appears to have
been to minimize memory usage (while the stated goal is to reduce lock
contention).  Zippy does this by aggressively moving freed blocks to
the shared cache, allowing any thread to later allocate memory from
this shared pool.  Meanwhile VT targets speed, trading off bloat, and
allowing freed blocks to return to the private per-thread pools.  To
allow for this speed optimization, VT keeps a pointer to the cache
that allocated it within each "page", something that can be done for
Zippy if speed was the goal.


Hmhm... Still, our intention is to be more conservative in *overall*
memory usage. That means, I'm prepared to give myself more memory
if I can be faster with that *temporarily* (after all modern systems
have huge memory banks) but I would not like to be greedy and keep
that memory for myself all the time. Which is precisely what VT does:
it is more memory hungry in terms of temporarily allocated memory
(although not that significant for this to be a problem) but it is
social-enough to release that when not needed any more.


c)
The key reason why Zippy substantially lags behind VT in performance
is actually because Zippy beats itself at its own game.  While it's
stated goal is to minimize lock contention, the hardcoded constants
used in Zippy actually completely sacrifice lock contention for
storage.  Naturally, thread-local pools can be used to allocate blocks
immediately, while the shared pool must be locked by a mutex when
allocation is performed.  The current Zippy configuration minimizes
the amount of storage "wasted" in per-thread pools by aggressively
moving larger blocks to the shared cache.  The more threads attempt to
allocate/free large blocks, the worse the contention and the lower the
performance.

Zoran's test program produces allocation sizes that are uniform
random, so large blocks are equally likely to small blocks, therefore
performance suffers substantially.  A more accurate benchmark would
take common usage patterns from Tcl/NaviServer, which I suspect are
heavily biased toward allocation of small objects.


If you can modify memtest.c to be like that I'd have nothing against!
Actually, we have no problems with small allocations nor with large ones

as they are all handled by the same mechanism. In Zippy largeallocations

(over 16K) are just handled with the system malloc with all trade-offs
that this brings.


The following patch allows Zippy to be a lot less aggressive in
putting blocks into the shared pool, bringing the performance of Zippy
much closer to VT, at the expense of substantially higher memory
"waste":

@@ -128,12 +174,12 @@
     {   64,  256, 128, NULL},
     {  128,  128,  64, NULL},
     {  256,   64,  32, NULL},
-    {  512,   32,  16, NULL},
-    { 1024,   16,   8, NULL},
-    { 2048,    8,   4, NULL},
-    { 4096,    4,   2, NULL},
-    { 8192,    2,   1, NULL},
-    {16284,    1,   1, NULL},
+    {  512,   64,  32, NULL},
+    { 1024,   64,  32, NULL},
+    { 2048,   64,  32, NULL},
+    { 4096,   64,  32, NULL},
+    { 8192,   64,  32, NULL},
+    {16284,   64,  32, NULL},


I cannot comment on that. Possibly you are right but I do not
see much benefit of that except speeding up Zippy to be on pair
with VT, whereas most important VT feature is not the speed,
it is the memory handling.

d)
VT uses mmap by default to allocate memory, Zippy uses the system
malloc.  By doing this, VT actually penalizes itself in an environment
where lots of small blocks are frequently allocated and threads are
often created/destroyed.


Partly right. Lots of small blocks is no problem. We allocate 32K
pages that yields 2048 16-byte blocks, 1024 32-byte blocks etc.
So, small allocations are pretty ok.
If you have a situation where large number of very short-lived
threads are used, then it is perhaps better to use system allocator
as at that point you will not experience much locking contention.
Admitently, VT is not designed for such extreme. It works best when
you have number of "reasonably"-lived threads. We can discuss what
"reasonably" means. For example, a connection-thread in Naviserver
is a "reasonably" lived thread. Also, the job-queue thread is one,
to give just two examples.

VT releases the memory held in a thread's
local pool when a thread terminates.  Since it uses mmap by default,
this means that de-allocated storage is actually released to the
operating system, forcing new threads to call mmap() again to get
memory, thereby incurring system call overhead that could be avoided
in some cases if the system malloc implementation did not lower the
sbrk point at each deallocation. Using malloc() in VT allocator
should give it much more uniform and consisent performance.


Not necessarily.  We'd shoot ourselves in the foot by doing so,
because most OS allocators never return memory to the system and
one of our major benefits will be gone.
What we could do: timestamp each page, return all pages to the
global cache and prune older. Or, put a size constraint on the
global cache. But then you'd have yet-another-knob to adjust
and the difficulty would be to find the right setup. VT is more
simple in that as it does not offer you ANY knobs you can trim
(for better or for worse). In some early stages of the design
we had number of knobs and were not certain how to adjust them.
So we threw that away and redesigned all parts to be "self adjusting"
if possible.

Using
mmap() in Zippy has less performnace impact since memory is never
released by Zippy (at thread termintion it is just placed back into
the shared pool).


Right. Which we also could do, but that would again mean we are
trashing our major benefit: being a social player, memory-wise.


Another obvious downside of using mmap() for Zippy is that realloc()
must always fall back to the slow allocate/copy/free mechanism and can
never be optimized.


This is true.


e)
Both allocators use an O(n) algorithm to compute the power of two
"bucket" for the allocated size.  This is just plain silly since an
O(n log n) algorithm will ofer non-negligible speed up in both
allocators.  This is the current O(n) code:

while (bucket < NBUCKETS && globalCache.sizes[bucket].blocksize < size) {

        ++bucket;
    }


How about adding this into the code?


f)
Zippy uses Ptr2Block and Block2Ptr functions where as VT uses macros
for this.  Zippy also does more checks on MAGIC numbers on each
allocation which VT only performs on de-allocation.  I am not sure if
current compilers are smart enough to inline the functions in Zippy, I
did not test this.  When compiled with -O0 with gcc, changing Zippy to
use macros instead of function calls offers non-trivial and
statistically significant speedup.


I believe you must add "inline" to the function definition for them
to do so. Anyways, you get some percent of speed by inlining or
using macros, that is correct. But I would not expect this to be
a breakthrough. But, on the bottom line: yes, on all places that are
repeatedly called, one should resort to inlining or macros, if possible.


g)
VT does some things that are too complicated for me to follow to
achieve it's high-speed no-lock needed allocation.  In particular, the
following code worries me:
    if (pagePtr->p_cachePtr == cachePtr /* Dirty read! */) {
        blockPtr->nextPtr = cachePtr->blocks[bucket].head;
        cachePtr->blocks[bucket].head = blockPtr;
        cachePtr->blocks[bucket].count++;
    } else {

although to its credit, I have not figured out any case where itwould break.


Ha! It is pretty simple: you can atomically check pointer equivalence
without risking a core (at least this is my experience). You are not
expected to make far-reaching decisions based on it, though.
In this particular example, even if the test was false, there would be
no "harm" done, just an inoptimal path would be selected.
I have marked that "Dirty read" to draw people attention on that place.
And, I succeeded obviously :-)

h)
One of the key drivers behind VT over Zippy is the "memory leak
problem" requiring server restart after a day of operation.


Yes. This is a *major* pain in the neck.

Although
this has been attributed to Zippy "never freeing" memory, it does not
look like this is related to the problem.  Furthermore, Zoran's test
in this respect (looking at data footprint before program exit but
after thread join) is inherently flawed.  Of course Zippy will show
higher usage since VT explicitly deallocates memory when a thread
exists and Zippy retains it in the shared pool.  But this is not a
leak - Zippy should not grow in size as a result of this.


Consider this scenario: you start a process, then make a task which
requires you 50 threads to execute. After that you never ever need
that large number of them for the lifetime of the process. Zippy will
allocate and *hold* that memory all the time, which a complete waste.
VT will allocate and release all the memory. It will more naturally
follow the "landscape" of the process...

Just *one* usage pattern is better served by Zippy: you always have
a constant number of threads in the process. VT would incur more
locking without giving you any additional benefit. So, only in such
scenarios, Zippy is actually better. In my experience I have not
encountered such scenarios in long-lived application servers. I can
imagine that some specialized applications may follow that pattern.


What does appear to be responsible for Zippy's growth is a design bug.
 When allocating a new block, Zippy will first look in the bucket
responsible for the given size, and then look in larger buckets,
seeking a block that can be split up and placed into the smaller
buckets.  This is a design flaw which I believe is the most likely
cause of the "needs daily restart" phenomenon.  Zippy lacks any
ability to coalesce smaller blocks in order to promote them to a
larger bucket.  This most likely leads to more and more allocation of
large blocks when they are needed, and slow and gradual movement and
splitting of blocks from larger buckets when smaller buckets do not
have any available entries.  Of course this is nothing but a theory as
I have nothing to back it up other than reading the code, but it
should be reasonably easy for someone who experiences the daily
restart problem to check by applying the following simple patch:
@@ -910,6 +960,7 @@
        blockPtr = NULL;
        n = NBUCKETS;
        size = 0; /* lint */
+       #if 0
        while (--n > bucket) {
            if (cachePtr->buckets[n].nfree > 0) {
                size = binfo[n].blocksize;
@@ -919,6 +970,7 @@
                break;
            }
        }
+       #endif

        /*
         * Otherwise, allocate a big new block directly.


Interesting...


i)
After having reviewed code for both of the allocators, I must say that
I like VT better.  It's design and implementation is simpler, it
translaes to running fewer instructions per allocation, it actually
achieve's Zippy's goal of minimizing lock contention.


Yes. I believe (although I'm maximally biased) that VT is better
for general-purpose code and when used on modern machines.
After all, my speed tests reveal that. Sometimes in a head-shaking
way (as for example for Mac OSX platform), but so far I haven't
found any problem with it, however hard I try.


-----

As I've stated way up top, I've alreayd spent a lot more time on this
than I should have, and thus I can not even go back up to proof-read
this long mail or check it for completeness of my findings.  I do
notice that I "blame" Zoran in a bunch of the above paragraphs.  This
is purely accidental, I mean no offense, and the statements are meant
simply to provide an actor for describing the benchmark/methodology,
not to find a scape goat.  Zoran - if you feel offended in any way, I
apologize, it was not at all intended.

I hope the above is at least in some ways helpful.


It was VERY helpful. We now know more about Zippy and we have a
non-biased view on the matter which will definitely help other
people and add to overall understanding. Thanks for taking time
and going to the extent of writing such detailed report!

Now, if we could have more of those, that would be very exciting!

Cheers,
Zoran

Re: [naviserver-devel] Quest for malloc

Reply via email to