This is some really nice analysis Pete. One thing we might consider for reducing context switches would be to re-use the coalescing ideas we've tried to apply for meta data syncs to context switches as well. If we've got a lot of operations completing in a particular module (trove for example), we can just signal once instead of for each operation. The sync coalescing code already does this, but you would only see any benefits from that with a bunch of clients.

The locking and context switches are inherent to our design of server module separation and queueing framework. I think it would be hard to replace the context switches without doing some serious redesign.

We might be able to eliminate the trove thread though. It doesn't do anything but move items from the trove completion queue to the job completion queue. Since that thread waits on a condition variable (and gets signalled by trove), and then signals the job completion condition variable, we're essentially doing a double context switch when we only need one. Instead we could change the trove apis to take a callback and user ptr, and have the callback add the completed job to the completion queue directly. The bits of flow that use job callbacks with trove would have to be changed too, but I think the flows would benefit from the bmi callback being called directly from trove as well. Does this seem reasonable?

-sam

On Dec 11, 2006, at 10:26 AM, Pete Wyckoff wrote:

I've been looking at IB latency, and made some improvements.
Thought I'd report to the list some more general observations too.

We call gettimeofday() a lot on the server.  We also do lots
of pthreads mutex and condition-wait operations.  These all have a
significant cost, and show up in system-wide profiling.

There are a few options for kernel-provided time services.  On a
single-processor setup, the TSC is your best option, as it uses a
cycle counter in the processor.  But on multi-processor machines,
this rarely works due to the fact that they are not synchronized,
hence the kernel disables it for SMP.  If you have an HPET, that is
supposed to be very fast and work for SMP, but we aren't so lucky
here on our 2-way Opterons.  Finally, the old slow fallback called
"pmtimer" uses the PIT hardware, requiring inb/outb operations to
get the time.

Test setup: 1 client, 1 MD + IO server.  Disable client acache.  Put
storage on a tmpfs.  Create a single file in an empty file system.
Use PVFS_sys_getattr() to get the attributes 10k times in a loop.
The results are very repeatable with low standard deviation.
Round-trip time to do one operation is:

    4-threaded server, 2 cpu, pmtimer:   44 us
    1-thread   server, 2 cpu, pmtimer:   35 us
    1-thread   server, 1 cpu, TSC:       29 us

Note the first line is the default build.  You have to edit
Makefile.in to get a single-threaded server.

Using the slow pmtimer compared to the fast TSC costs 6 us (21%).
Nothing to do about that but avoid using gettimeofday().

Using four threads on the server adds another 9 us (26%).  This
comes from mutex and condition activity in the fast path of every
operation.

Looking at create times in the same scenario, the results are almost
exactly multiplied by four, for the four RPCs necessary to do a
create.

I looked a bit at how to reduce some of the thread overheads, but
was afraid to change anything significant.  I'm not advocating
getting rid of the threads, as perhaps they allow overlapping of
operations, especially when both the network and disk and state
machines are busy.  But there's a lot of little locks to grab and
release along the way for every trove op and every bmi op, and they
add up, and there are many context switches that have to happen to
push an op through its path on the server.  I don't have any
thoughts about how to simplify all that.

If you actually do anything to real disk, none of this overhead will
show up.  But for those with battery-backed cache or solid state
RAM disk, these overheads will be in the way.

                -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to