On Mon, 2006-03-06 at 12:15 -0800, Luck, Tony wrote: 
> > Why do we need this?
> >
> > Correctness: ...
> > Consolidation: ...
> 
> Any performance measurements?  ia64 has gone to some lengths to
> ensure that gettimeofday(2) has a very low overhead (it does
> show up in the kernel profiles of some benchmarks ... managed
> runtime systems in particular, but others like to get timestamps
> at quite frighteningly short intervals).  So I'm interested in
> smp cases where are moderate number of cpus (4-16) are pounding
> on gettimeofday().  I think that the huge SMP systems running
> HPC workloads spend less of their time asking what time it is,
> so I'm not as worried about the 512 cpu ... but if all 512 cpus
> do happen to call gettimeofday() at the same time the system
> shouldn't sink into the swamp as cache-lines bounce around.

This is a good question, as I haven't run much in the way of performance
measurements recently. More below on that.

I'd be interested in hearing more about specifically what ia64 has done
(reading the fsyscall asm is not my idea of fun :), but I'd like to
think that generally it shouldn't be an issue. First of all, the
infrastructure is there for arch specific optimizations like
VDSO/vsyscall implementations (or fsyscall on ia64). In fact, it makes
it much easier to implement the vsyscall feature on arches that do not
yet support it (I have a patch for i386 that was pretty straight
forward, although I still need to sort out the unwind bits for gdb). And
secondly, larger SMP systems (although PPC and SPARC are exceptions)
tend to require use of slower mmapped clocksources which tend to
dominate the gettimeofday() usage.

But part of the issue is that there are a number of arches that have
done their own arch specific optimizations that are for the most part,
generically applicable. This goes back to the consolidation point. My
hope is to join these divergent efforts to the benefit of all. This may
sound a bit naive and wide-eyed, and I'm fine allowing for arch specific
optimizations where they are necessary, but really, what are we doing in
almost all cases? Reading some hardware, converting it to nanoseconds
and adding it to a base value, all under a seq_read_lock. We don't need
a dozen implementations of this, and it makes other features harder to
implement because we don't know things like: Which arches will function
if we disable interrupts for a bit? Or what is the hardware level
resolution of clock_gettime()?


But back to the actual performance numbers: 

The last time I generated numbers for i386 the patch hit gettimeofday()
by ~2%, which was the worst case I could generate using the clocksource
with the lowest overhead (TSC). Most of this was due to some extra u64
usage and the lack of a generic mul_u64xu32 wrapper. However for this
cost, you get correct behavior (which I think is *much* more important,
at least for i386) and nanosecond resolution in clock_gettime().

Currently I suspect the impact its a bit worse with the patches in -mm,
since cycle_t was set back to a u64 to be extra robust in the case of 2
seconds of lost ticks. That's more of a -RT tree concern, so I'd be fine
setting that back to a unsigned long for mainline.

And while its not gettimeofday(), I'm working with Roman on further
reducing the overhead in the periodic_hook() accumulation function, so
there is less periodic overhead as well.

I'll try to generate some more numbers and put some more focus on the
performance aspect of the current patches. If anyone does get a chance
to look the code over in detail, please do point out any specific
concerns or ideas for improvements.  

thanks
-john

-
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to