Re: [zfs-discuss] single memory allocation in the ZFS intent log

Erblichs Thu, 05 Oct 2006 18:23:24 -0700

Casper Dik,

        After my posting, I assumed that a code question should be
        directed to the ZFS code alias, so I apologize to the people
        show don't read code. However, since the discussion is here,
        I will post a code proof here. Just use "time program" to get
        a generic time frame. It is under 0.1 secs for 500k loops
        (each loop does removes a obj and puts it back).


        It is just to be used as a proof of concept that a simple
        pre-alloc'ed set of objects can be accessed so much faster
        than allocating and assigning them.

        To support the change to the intent log objects, I would suggest
        first identiying the number of objects normally allocated and
        use that as a working set of objects. A time element is also
        needed to identify when objects should be released from the
        free list to the memory cache. Yes, the initial thoughts of
        having a per CPU based allocs are coded in which would allow
        multiple simultaneious access to a freelist per CPU. This
        should remove most of the mutex code necessary for scalability.

        Objective
        -----------
           What this app code is proving is that items that are pre-alloc'ed
        can be removed off a simple free list and stored on a free list.
        This is just a inception proof that shows "fast access" to a
        working set of objects.
        
           The time to make one chunk alloc, place all of them on a
        free list, and then perform 500k ops of removal and insertions
        is probably somewhere 50 to 1000x faster than even the best memory
        allocators allocating/retrieving 500k items. If a dynamic list of
        nodes are required the chunk alloc should be changed.

          This quick piece of app prog runs in less than 0.1 sec with 500k
        retrieves and store ops. This is fast enough to grab 64byte chunks
        dealing with even a 1Gb Eth link. Even though this code is simplified,
it
        indicates that kmem_allocs will have the same benefits even without
        sleeping.

        -------------

        The code does a single pre alloc and then breaks up the alloc
        into N node pieces. It takes each piece and places them
        on a free list in the init section. The assumption here is
        that we have a fixed reasonable number of items. If the
        number of items is dynamic the init could easily alloc
        a number of nodes, then use watermarks to alloc and free
        into / from the free list as the number of nodes are used.

        If the logic is used to deal with kmem ops, then any free
        nodes could be returned to memory as excess nodes are in
        the free list.

        The base level of this type of logic is normally used when
        a special program project requires non-standard interfaces
        to guarantee a HIGH level of performance.

        The main has a hard code of 500k loops, which allocs one
        node and then frees it. Thus, 500k equiv allocs would need
        to be done. This ran in the 0.02 to 0.35 secs on a 1.3GHz
        laptop Linux box.

        -----

        It is my understanding that the Bonwicks's new allocator was created
        to remove fragmentation. And yes it also allows the OS to reduce the
        overhead of of dealing with mem objects of process's that 
        are freed and alloc'ed frequently. When the system gets low on
        memory, it steals freed objects that are being cached. However,
        even with no SLEEPING, I have yet to see it perform as fast as 
        simple retrieves and stores.

        Years ago, the amount of memory on a system was limited due to
        its expense. This is no longer the case. Some/most 
        processes/threads could have a decent increase in performance
        if the amount of workload done
        on a working set of objects is minimized. Up to this workset,
        I propose that a almost guranteed level of performance could
        be achieved.

        With the comment that any type of functionality that has merits
        get a API so multiple processes / threads can use that
        functionality. Years ago I was in the "process" of doing that
        when I left a company with a ARC group. It was to add a layer
        of working set mem objects that would have "fast access" properties.

        I will ONLY GUARANTEE that X working set objects once freed 
        to the FREE LIST can be re-allocated without waiting for the objects. 
        Any count beyond that working set, has the same underlying properties. 
        Except if I KNOW
        that the number on my freelist goes down to a small value, I could
        pre-request more objects. The latency of retrieving these objects
        could thus be minimized.

        This logic then removes on demand memory allocations, so any WAIT
        time MAY not effect the other parts of the process that need more
        objects.

        -----------------

        Mitchell Erblich


[EMAIL PROTECTED] wrote:
> 
> >Casper Dik,
> >
> >       Yes, I am familiar with Bonwick's slab allocators and tried
> >       it for wirespeed test of 64byte pieces for a 1Gb and then
> >       100Mb Eths and lastly 10Mb Eth. My results were not
> >       encouraging. I assume it has improved over time.
> 
> Nothing which tries to send 64 byte pieces over 1Gb ethernet or 100Mb
> ethernet will give encouraging results.
> 
> >       First, let me ask what happens to the FS if the allocs
> >       in the intent log code are sleeping waiting for memory????
> 
> How are you going to guarantee that there is *always* memory available?
> 
> I think that's barking up the wrong tree.  I think that a proper solution is
> not trying to find a way which prevents memory from running out but
> rather a way of dealing with the case of it running out.
> 
> If KMEM_SLEEP is used in a path where it is causing problems, then no
> amount of freelists is going to solve that.  There needs to be a solution
> which does not sleep.
> 
> >       - getting memory from a "cache" of ones own size/type
> >         is orders of magnitude higher than just getting some
> >         off one's own freelist,
> 
> Actually, that's not true; Bonwick's allocator is *FASTER* by a *wide*
> margin than your own freelist.
> 
> Believe me, I've measured this, I've seen "my own freelist" collapse
> on the floor when confronted with as few as two CPUs.
> 
> As a minimum, you will need *per CPU* free lists.
> 
> And that's precisely what the kernel memory allocator gives you.
> 
> >       In the time when memory was expensive, maybe a global
> >       sharing mechanisms would make sense, but when  the amount
> >       of memory is somewhat plentiful and cheap,
> 
> Not if all bits of the system are going to keep their own freelists * #CPUs.
> 
> Then you are suddenly faced with a *MUCH* higher memory demand.  The
> Bonwick allocator does keep quite a bit cached and keeps more memory
> unavailable already.
> 
> >       *** It then makes sense for a 2 stage implementation of
> >           preallocation of a working set and then normal allocation
> >           with the added latency.
> 
> But the normal Bonwick allocation *is* two-stage; you are proposing to
> add a 3rd stage.
> 
> >       So, it makes sense to pre-allocate a working set of allocs
> >       by a single alloc call, break up the alloc into needed sizes,
> >       and then alloc from your own free list,
> 
> That's what the Bonwick allocator does; so why are you duplicating this?
> 
> Apart from the questionable performance gain (I believe there to be none),
> the loss of the kernel memory allocator debugging functionality is severe:
> 
>         - you can no longer track where the individual blocks are allocated
>         - you can no longer track buffer overruns
>         - buffers run into one another, so one overrun buffer corrupts another
>           without trace
> 
> >       -----> if that freelist then empties, maybe then take the extra
> >       overhead with the kmem call. Consider this a expected cost to exceed
> >       a certain watermark.
> 
> This is exactly how the magazine layer works.
> 
> >       But otherwise, I bet if I give you some code for the pre-alloc, I bet
> >10
> >       allocs from the freelist can be done versus the kmem_alloc call, and
> >       at least 100 to 10k allocs if sleep occurs on your side.
> 
> I hope you're not designing this with a single lock per queue.
> 
> I have eradicated code in Solaris 9 which looked like this:
> 
>         struct au_buff *
>         au_get_buff(void)
>         {
>                 au_buff_t *buffer = NULL;
> 
>                 mutex_enter(&au_free_queue_lock);
> 
>                 if (au_free_queue == NULL) {
>                         if (au_get_chunk(1)) {
>                                 mutex_exit(&au_free_queue_lock);
>                                 return (NULL);
>                         }
>                 }
> 
>                 buffer = au_free_queue;
>                 au_free_queue = au_free_queue->next_buf;
>                 mutex_exit(&au_free_queue_lock);
>                 buffer->next_buf = NULL;
>                 return (buffer);
>         }
> 
> (with a corresponding free routine which never returned memory to the
> system but kept it in the freelist)
> 
> This was replaced with essentially:
> 
>         buffer = kmem_cache_alloc(au_buf_cache, KM_SLEEP);
> 
> The first bit of code stopped scaling at 1 CPU (the performance
> with two CPUs was slightly worse than with one CPU)
> 
> The second bit of code was both FASTER in the single CPU case and
> scaled to the twelve CPUs I had for testing.
> 
> >       Actually, I think it is so bad, that why don't you time 1 kmem_free
> >       versus grabbing elements off the freelist,
> 
> I did, it's horrendous.
> 
> Don't forget that the typical case, when the magazine layer is properly
> size after the system has been running for a while, no locks need to be
> grabbed to get memory as the magazines are per-CPU.
> 
> But with your single freelist, you must grab a lock.  Somewhere in
> the grab/release lock cycle there's at least one atomic operation
> and memory barrier.
> 
> Those are perhaps cheap on single CPU systems but run in the hundreds
> of cycles on larger systems.
> 
> Here's something else from Sun's "collective" memory which I think
> illustrates why we think private freelists are bad, /on principle/.
> 
> When Jeff Bonwick redid the memory allocator he did this because he
> noticed that the performance was bad, so bad even that people generally
> avoided the allocator if they could (hence the use of private freelists
> in particular parts of the system such as the above example from
> auditing).  This is particularly poor software engineering for two important
> reasons; if a core bit of software does not work or perform properly, it
> needs to be rewritten and not worked around; and if a core bit of software
> is worked around, any future improvements will go unnoticed.
> 
> The auditing example and the resulting poor performance of the auditing
> code amply demonstrates both points; there's even additional similarity
> between it and the intent log: both allocate buffers to be written to
> disk.
> 
> The object lesson of the Bonwick memory allocator, the addition of
> the magazine layer in a later release is this: if this tool is not
> up to the job, it must be improved not worked around.
> 
> So *if* the memory allocator is not able to deliver the QoS required
> for a particular part of the system, it stands to reason that it
> should be improved with QoS features to serve the need of those
> consumers.
> 
> (But I'm not sure that that is even the case; but then, I'm not sufficiently
> familiar with how ZIL or ZFS work and what exact issues there are)
> 
> Sorry for the lecture; my experience with single freelists of
> buffers to be written to disk is just something I thought I needed to
> share in this discussion.
> 
> Casper

node_alloc.c
Description: application/unknown-content-type-c_auto_file

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] single memory allocation in the ZFS intent log

Reply via email to