On 30.09.2010 19:15, Matthew Fleming wrote:
On Thu, Sep 30, 2010 at 9:37 AM, Andre Oppermann<an...@freebsd.org>  wrote:
Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.

The VM system has two major structures:
  1) the VM map which is per process and manages its address space
    by tracking all regions that are allocated and their permissions.
  2) the VM page table which is per VM map and provides the backing
    memory pages to the regions and interfaces with the pmap layer
    (virtual->physical).

Both of these are very frequently accessed through memory allocations
from userspace and page faults from the pmap layer.  Their efficiency
and SMP scalability is crucial for many operations beginning with fork/
exec, going through malloc/mmap and ending with free/exit of a process.

Both the vmmap and page table make use of splay trees to manage the
entries and to speed up lookups compared to long to traverse linked
lists or more memory expensive hash tables.  Some structures though
do have an additional linked list to simplify ordered traversals.

A splay tree is an interesting binary search tree with insertion,
lookup and removal performed in O(log n) *amortized* time.  With
the *amortized* time being the crucial difference to other binary trees.
On every access *including* lookup it rotates the tree to make the
just found element the new root node.  For all gory details see:
  http://en.wikipedia.org/wiki/Splay_tree

This behavior has a few important implications:
  1) the root node and constant rotation works like a cache with
    least recent looked up node at the top and less recently ones
    close to the top;
  2) every lookup that doesn't match the root node, ie. a cache miss,
    causes a rotation of the tree to make the newly found node the new
    root;
  3) during the rotate it has to re-arrange and touch possibly a large
    number of other nodes;
  4) in worst case the tree may resemble a linked list.  A splay tree
    makes no guarantee that it is balanced.

For the kernel and the splay tree usage in the VM map and page table
some more implications come up:
  a) the amortization effect/cost only balance if the majority of
    lookups are root node (cache) hits, otherwise the cost of
    rebalancing destroys any advantage and quickly turns into a
    large overhead.
  b) the overhead becomes even worse due to touching many nodes and
    causing a lot of CPU cache line dirtying.  For processes with
    shared memory or threads across CPU's this overhead becomes
    almost excessive because the CPU's stomp on each others cached
    nodes and get their own caches invalidated.  The penalty is a
    high-latency memory refetch not only for the root node lookup
    but every hop in the following rebalancing operation =>  thrashing.

To quantify the positive and negative effects of the splay tree I
instrumented the code and added counters to track the behavior of
the splay tree in the vmmap and page table.

The test case is an AMD64 kernel build to get a first overview.
Other system usages may have different results depending on their
fork and memory usage patters.

The results are very interesting:

  The page table shows a *very* poor root node hit rate and an excessive
  amount of rotational node modifications representing almost the
  worst case for splay trees.

http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies

  The vmmap shows a high root node hit rate but still a significant
  amount of rotational node modifications.

http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies

 From these observations I come to the conclusion that a splay tree
is suboptimal for the normal case and quickly horrible even on small
deviations from the normal case in the vmmap.  For the page table it
is always horrible.  Not to mention the locking complications due to
modifying the tree on lookups.

Based on an examination of other binary trees I put forward the
hypothesis that a Red-Black tree is a much better, if not the optimal,
fit for the vmmap and page table.  RB trees are balanced binary trees
and O(log n) in all cases.  The big advantage in this context is that
lookups are pure reads and do not cause CPU cache invalidations on
other CPU's and always only require a read lock without the worst
case properties of the unbalanced splay tree.  The high cache locality
of the vmmap lookups can be used with the RB tree as well by simply
adding a pointer to the least recently found node.  To prevent write locking
this can be done lazily.  More profiling is required to make
a non-speculative statement on this though.  In addition a few of
the additional linked lists that currently accompany the vmmap and
page structures are no longer necessary as they easily can be done
with standard RB tree accessors.  Our standard userspace jemalloc
also uses RB trees for its internal housekeeping.  RB tree details:
  http://en.wikipedia.org/wiki/Red-black_tree

I say hypothesis because I haven't measured the difference to an
RB tree implementation yet.  I've hacked up a crude and somewhat
mechanical patch to convert the vmmap and page VM structures to
use RB trees, the vmmap part is not stable yet.  The page part
seems to work fine though.

This is what I've hacked together so far:
  http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
  http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff

The diffs are still in their early stages and do not make use of
any code simplifications becoming possible by using RB trees instead
of the current splay trees.

Comments on the VM issue and splay vs. RB tree hypothesis welcome.

Do you have actual performance numbers from a real workload?

No, not yet.  I obviously would have posted those numbers as well.
What do you consider a representative real workload?

I implemented an AA tree locally (basically a 2,3 tree that uses
coloring to represent the 3 nodes) and the performance was much worse
than the splay tree.  I assume this is due to the overhead of keeping
the tree balanced; I didn't instrument either the splay or the AA
code.  I suspect that the overhead of an RB tree would similarly wash
out the benefits of O(log_2 n) time, but this is theory.  The facts
would be measuring several different workloads an looking at the
system/real times for them between the two solutions.

I think there are two pronounced effects to consider:
 a) the algorithmic complexity
 b) the real world implications of said complexity and operations
    on today's CPU's multi-hierarchy caches and associated SMP side
    effects.

We've talked internally at $work about using a B+-tree with maybe
branching factor 5-7; whatever makes sense for the size of a cache
line.  This seems likely to be better for performance than an RB-tree
but does require a lot more changes, since separate memory is needed
for the tree's nodes outside the vm_page structure.  There just hasn't
been time to implement it and try it out.
>
Unfortunately I won't have the time to experiment at $work for a few
months on this problem.

--
Andre
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to