On Aug 6, 3:07 am, Andy Fingerhut <andy_finger...@alum.wustl.edu>
wrote:
> On Aug 5, 6:09 am, Rich Hickey <richhic...@gmail.com> wrote:
>
>
>
> > On Wed, Aug 5, 2009 at 8:29 AM, Johann Kraus<johann.kr...@gmail.com> wrote:
>
> > >> Could it be that your CPU has a single floating-point unit shared by 4
> > >> cores on a single die, and thus only 2 floating-point units total for
> > >> all 8 of your cores?  If so, then that fact, plus the fact that each
> > >> core has its own separate ALU for integer operations, would seem to
> > >> explain the results you are seeing.
>
> > > Exactly, this would explain the behaviour. But unfortunately it is not
> > > the case. I implemented a small example using Java (Java Threads) and
> > > C (PThreads) and both times I get a linear speedup. See the attached
> > > code below. The cores only share 12 MB cache, but this should be
> > > enough memory for my micro-benchmark. Seeing the linear speedup in
> > > Java and C, I would negate a hardware limitation.
>
> > > _
> > > Johann
>
> > I looked briefly at your problem and don't see anything right off the
> > bat. Do you have a profiler and could you try that out? I'm
> > interested.
> > Rich
>
> I ran these tests on my iMac with 2.16 GHz Intel Core 2 Duo (2 cores)
> using latest Clojure and clojure-contrib from git as of some time on
> Aug 4, 2009.  The Java implementation is from Apple, version 1.6.0_13.
>
> ----------------------------------------------------------------------
> For int, there are 64 "jobs" run, each of which consists of doing
> (inc 0) 1,000,000,000 times.  See pmap-batch.sh and pmap-testing.clj
> for details.
>
> http://github.com/jafingerhut/clojure-benchmarks/blob/398688c71525964...
>
> http://github.com/jafingerhut/clojure-benchmarks/blob/398688c71525964...
>
> Yes, yes, I know.  I should really use a library for command line
> argument parsing to avoid so much repetitive code.  I may do that some
> day.
>
> Results for int 1 thread - jobs run sequentially
>
> "Elapsed time: 267547.789 msecs"
> real       269.22
> user       268.61
> sys          1.79
>
> int 2 threads - jobs run in 2 threads using modified-pmap, which
> limits the number of futures causing threads to run jobs to be at most
> 2 at a time.
>
> "Elapsed time: 177428.626 msecs"
> real       179.14
> user       330.30
> sys         15.46
>
> Comment: Elapsed time with 2 threads is about 2/3 of elapsed time with
> 1 thread.  Not as good as the 1/2 as we'd like with a 2 core machine,
> but better than not being faster at all.
>
> ----------------------------------------------------------------------
> For double, there are 16 "jobs" run, each of which consists of doing
> (inc 0.1) 1,000,000,000 times.
>
> double 1 thread
>
> "Elapsed time: 258659.424 msecs"
> real       263.28
> user       247.29
> sys         12.17
>
> double 2 threads
>
> "Elapsed time: 229382.68 msecs"
> Dumping CPU usage by sampling running threads ... done.
> real       231.05
> user       380.79
> sys         11.49
>
> Comment: Elapsed time with 2 threads is about 7/8 of elapsed time with
> 1 thread.  Hardly any improvement at all for something that should be
> "embarrassingly parallel", and the user time reported by Mac OS X's
> /usr/bin/time increased by a factor of about 1.5.  That seems like way
> too much overhead for thread coordination.
>
> Here are hprof output files for the "double 1 thread" and "double 2
> threads" tests:
>
> http://github.com/jafingerhut/clojure-benchmarks/blob/51d499c2679c2d5...
>
> http://github.com/jafingerhut/clojure-benchmarks/blob/51d499c2679c2d5...
>
> In both cases, over 98% of the time is spent in
> java.lang.Double.valueOf(double d).  See the files for the full stack
> backtraces if you are curious.
>
> I don't see any reason why that method should have any kind of
> contention or worse performance when running on 2 cores vs. 1 core,
> but I don't know the guts of how it is implemented.  At least in
> OpenJDK all it does is "return new Double(d)", where d is the double
> arg to valueOf().  Is there any reason why "new" might exhibit
> contention between parallel threads?

Can you run your benchmarks with the number of concurrent threads
being equal to the number of cores that you have?  The increase in
system time is interesting to me - is it possible that the JVM or OS
can detect threads that don't use floating point registers & therefore
doesn't bother to save them when doing a thread context switch?  If
so, that is a significant amount of memory that doesn't need to be
touched during context switch.

Brad
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to