On Jul 7, 6:23 am, Jon Harrop <j...@ffconsultancy.com> wrote:
> On Tuesday 07 July 2009 02:08:57 Bradbev wrote:
>
> > On Jul 6, 4:30 pm, fft1976 <fft1...@gmail.com> wrote:
> > > On Jul 5, 11:42 pm, Bradbev <brad.beveri...@gmail.com> wrote:
> > > > more to modern x86 chips.  After you have the best algorithm for the
> > > > job, you very quickly find that going fast is entirely bound by memory
> > > > speed (actually latency) - cache misses are the enemy.
>
> > > IME (outside JVM), this depends strongly on the kind of problem you
> > > are solving as well as your implementation (you need to know how to
> > > cache-optimize). One can easily think of problems that would fit
> > > entirely in cache, but take an enormous amount of time.
>
> > What sort of problems did you have in mind?  Anything that I can think
> > of quickly spills the cache.
>
> There are many examples in scientific computing where many small problems are
> attacked instead of one large problem. For example, practical use of FFTs
> fall into this category with most users performing many transforms with no
> more than 1,024 elements rather than fewer longer transforms. Time frequency
> analysis usually takes n samples and produces an nxn grid over time and
> frequency representing the signal where each frequency is computed from a
> separate FFT. So you can get away with naive distribution of FFTs across
> cores with no regard for cache coherence and still get very good performance.
>
Interesting, I've never dealt with problems in this domain - most of
my performance problems involve relatively simple transforms over
streams of data.  It would be quite a different mindset to program for
problems that fit entirely in cache, it would be fun to try & squeeze
the theoretical power out of a chip.  I guess at those levels you're
mostly concerned about preventing pipeline stalls & instruction
conflicts.  I think I'd pretty much go for handwritten assembler at
those levels.

> I would actually say that intercore cache effects are more important than
> conventional cache coherence today because you get massive performance
> degradation if you cock it up and it is not at all obvious when that might
> occur because it depends upon things like where the allocator places your
> data. For example, if you have cores mutating global counters then you must
> make sure they are spaced out enough in memory that none share cache lines.
Effects like that & cache line aliasing are difficult to diagnose
without good tools.  Not to mention that in most machines you have an
OS scheduling other threads & polluting your cache.

Brad


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to