I've been looking at IB latency, and made some improvements.
Thought I'd report to the list some more general observations too.
We call gettimeofday() a lot on the server. We also do lots
of pthreads mutex and condition-wait operations. These all have a
significant cost, and show up in system-wide profiling.
There are a few options for kernel-provided time services. On a
single-processor setup, the TSC is your best option, as it uses a
cycle counter in the processor. But on multi-processor machines,
this rarely works due to the fact that they are not synchronized,
hence the kernel disables it for SMP. If you have an HPET, that is
supposed to be very fast and work for SMP, but we aren't so lucky
here on our 2-way Opterons. Finally, the old slow fallback called
"pmtimer" uses the PIT hardware, requiring inb/outb operations to
get the time.
Test setup: 1 client, 1 MD + IO server. Disable client acache. Put
storage on a tmpfs. Create a single file in an empty file system.
Use PVFS_sys_getattr() to get the attributes 10k times in a loop.
The results are very repeatable with low standard deviation.
Round-trip time to do one operation is:
4-threaded server, 2 cpu, pmtimer: 44 us
1-thread server, 2 cpu, pmtimer: 35 us
1-thread server, 1 cpu, TSC: 29 us
Note the first line is the default build. You have to edit
Makefile.in to get a single-threaded server.
Using the slow pmtimer compared to the fast TSC costs 6 us (21%).
Nothing to do about that but avoid using gettimeofday().
Using four threads on the server adds another 9 us (26%). This
comes from mutex and condition activity in the fast path of every
operation.
Looking at create times in the same scenario, the results are almost
exactly multiplied by four, for the four RPCs necessary to do a
create.
I looked a bit at how to reduce some of the thread overheads, but
was afraid to change anything significant. I'm not advocating
getting rid of the threads, as perhaps they allow overlapping of
operations, especially when both the network and disk and state
machines are busy. But there's a lot of little locks to grab and
release along the way for every trove op and every bmi op, and they
add up, and there are many context switches that have to happen to
push an op through its path on the server. I don't have any
thoughts about how to simplify all that.
If you actually do anything to real disk, none of this overhead will
show up. But for those with battery-backed cache or solid state
RAM disk, these overheads will be in the way.
-- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers