Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

2008-06-28 Thread Don Stewart
so I had a look at the code. The loops are all fine. replicateM_ isn't 
a problem, but getDot is decidedly non trivial. Lots of pattern matching
on different vector forms, and to top it off ffi calls.

With some inlining in the blas library I was able to cut a few seconds
off the running time, but getDot looks to be fundamentally a bit
complicated in the current implementation.

I wonder if you'll get different results with hmatrix?

Anyway, this is a library issue. Better take it up with Patrick.
Pass on to the library author the C code, the Haskell you think should
be compiled identically.

-- Don

aeyakovenko:
> i get the same crappy performance with:
> 
> $ cat htestdot.hs
> {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
> -fglasgow-exts -fbang-patterns -lcblas#-}
> module Main where
> 
> import Data.Vector.Dense.IO
> import Control.Monad
> 
> main = do
>let size = 10
>let times = 10*1000*1000
>v1::IOVector Int Double <- newListVector size $ replicate size 0.1
>v2::IOVector Int Double <- newListVector size $ replicate size 0.1
>replicateM_ times $ v1 `getDot` v2
> 
> 
> 
> On Fri, Jun 27, 2008 at 7:41 PM, Dan Doel <[EMAIL PROTECTED]> wrote:
> > On Friday 27 June 2008, Anatoly Yakovenko wrote:
> >> $ cat htestdot.hs
> >> {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
> >> -fglasgow-exts -fbang-patterns -lcblas#-}
> >> module Main where
> >>
> >> import Data.Vector.Dense.IO
> >> import Control.Monad
> >>
> >> main = do
> >>let size = 10
> >>let times = 10*1000*1000
> >>v1::IOVector Int Double <- newListVector size $ replicate size 0.1
> >>v2::IOVector Int Double <- newListVector size $ replicate size 0.1
> >>sum <- foldM (\ ii zz -> do
> >>   rv <- v1 `getDot` v2
> >>   return $ zz + rv
> >>   ) 0.0 [0..times]
> >>print $ sum
> >
> > Hackage is down for the time being, so I can't install blas and look at the
> > core for your program. However, there are still some reasons why this code
> > would be slow.
> >
> > For instance, a brief experiment seems to indicate that foldM is not a good
> > consumer in the foldr/build sense, so no deforestation occurs. Your program
> > is iterating over a 10-million element lazy list. That's going to add
> > overhead. I wrote a simple test program which just adds 0.1 in each
> > iteration:
> >
> >  snip 
> >
> > {-# LANGUAGE BangPatterns #-}
> >
> > module Main (main) where
> >
> > import Control.Monad
> >
> > main = do
> >  let times = 10*1000*1000
> >  sum <- foldM (\_ zz -> return $ zz + 0.1) 0.0 [0..times]
> > --  sum <- foo 0 times 0.0
> >  print $ sum
> >
> > foo :: Int -> Int -> Double -> IO Double
> > foo k m !zz
> >  | k <= m = foo (k+1) m (zz + 0.1)
> >  | otherwise = return zz
> >
> >  snip 
> >
> > With foldM, it takes 2.5 seconds on my machine. If you comment that line, 
> > and
> > use foo instead, it takes around .1 seconds. So that's a factor of what, 
> > 250?
> > That loop allows for a lot more unboxing, which allows much better code to 
> > be
> > generated.
> >
> > When Hackage comes back online, I'll take a look at your code, and see if I
> > can make it run faster, but you might want to try it yourself in the time
> > being. Strictifying the addition of the accumulator is probably a good idea,
> > for instance.
> >
> > Cheers,
> > -- Dan
> >
> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
Sat Jun 28 00:33:17 PDT 2008  Don Stewart <[EMAIL PROTECTED]>
  * Some inlining and unpacking for DVectors

New patches:

[Some inlining and unpacking for DVectors
Don Stewart <[EMAIL PROTECTED]>**20080628073317] {
hunk ./BLAS/Internal.hs 115
+{-# INLINE checkVecVecOp #-}
hunk ./Data/Vector/Dense/Internal.hs 78
-data DVector t n e = 
-  DV { fptr   :: !(ForeignPtr e) -- ^ a pointer to the storage region
- , offset :: !Int-- ^ an offset (in elements, not bytes) 
to the first element in the vector. 
- , len:: !Int-- ^ the length of the vector
- , stride :: !Int-- ^ the stride (in elements, not bytes) 
between elements.
+data DVector t n e =
+  DV { fptr   :: {-# UNPACK #-} !(ForeignPtr e) -- ^ a pointer to the 
storage region
+ , offset :: {-# UNPACK #-} !Int-- ^ an offset (in 
elements, not bytes) to the first element in the vector. 
+ , len:: {-# UNPACK #-} !Int-- ^ the length of the 
vector
+ , stride :: {-# UNPACK #-} !Int-- ^ the stride (in 
elements, not bytes) between elements.
hunk ./Data/Vector/Dense/Internal.hs 84
-| C !(DVector t n e)-- ^ a conjugated vector
+| C {-# UNPACK #-} !(DVector t n e)-- ^ a conjugated vector
hunk ./Data/Vector/Dense/Internal.hs 92
+{-# INLINE coerceVector #-}
hunk ./Data/Vector/Dense/Internal.hs 424
+
hunk ./Data/Vector/Dense/Operations.hs 144
+{-# INLI

Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

2008-06-27 Thread Don Stewart
aeyakovenko:
> i get the same crappy performance with:
> 
> $ cat htestdot.hs
> {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
> -fglasgow-exts -fbang-patterns -lcblas#-}
> module Main where
> 
> import Data.Vector.Dense.IO
> import Control.Monad
> 
> main = do
>let size = 10
>let times = 10*1000*1000
>v1::IOVector Int Double <- newListVector size $ replicate size 0.1
>v2::IOVector Int Double <- newListVector size $ replicate size 0.1
>replicateM_ times $ v1 `getDot` v2

replicateM_ is using a list underneath for control as well,

replicateM n x= sequence (replicate n x)

Try writing a simple recursive loop, as Dan suggested. No list node
forcing overhead, so in a very tight loop you'll just want the index in
a register.

See here for more examples of tight register loops,

http://cgi.cse.unsw.edu.au/~dons/blog/2008/05/16#fast

In general, if you're chasing C performance for a loop, your best bet is
to write a loop first. Then later see if you can get the same kind of
code from higher order, lazy, monadic functions.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

2008-06-27 Thread Anatoly Yakovenko
i get the same crappy performance with:

$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where

import Data.Vector.Dense.IO
import Control.Monad

main = do
   let size = 10
   let times = 10*1000*1000
   v1::IOVector Int Double <- newListVector size $ replicate size 0.1
   v2::IOVector Int Double <- newListVector size $ replicate size 0.1
   replicateM_ times $ v1 `getDot` v2



On Fri, Jun 27, 2008 at 7:41 PM, Dan Doel <[EMAIL PROTECTED]> wrote:
> On Friday 27 June 2008, Anatoly Yakovenko wrote:
>> $ cat htestdot.hs
>> {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
>> -fglasgow-exts -fbang-patterns -lcblas#-}
>> module Main where
>>
>> import Data.Vector.Dense.IO
>> import Control.Monad
>>
>> main = do
>>let size = 10
>>let times = 10*1000*1000
>>v1::IOVector Int Double <- newListVector size $ replicate size 0.1
>>v2::IOVector Int Double <- newListVector size $ replicate size 0.1
>>sum <- foldM (\ ii zz -> do
>>   rv <- v1 `getDot` v2
>>   return $ zz + rv
>>   ) 0.0 [0..times]
>>print $ sum
>
> Hackage is down for the time being, so I can't install blas and look at the
> core for your program. However, there are still some reasons why this code
> would be slow.
>
> For instance, a brief experiment seems to indicate that foldM is not a good
> consumer in the foldr/build sense, so no deforestation occurs. Your program
> is iterating over a 10-million element lazy list. That's going to add
> overhead. I wrote a simple test program which just adds 0.1 in each
> iteration:
>
>  snip 
>
> {-# LANGUAGE BangPatterns #-}
>
> module Main (main) where
>
> import Control.Monad
>
> main = do
>  let times = 10*1000*1000
>  sum <- foldM (\_ zz -> return $ zz + 0.1) 0.0 [0..times]
> --  sum <- foo 0 times 0.0
>  print $ sum
>
> foo :: Int -> Int -> Double -> IO Double
> foo k m !zz
>  | k <= m = foo (k+1) m (zz + 0.1)
>  | otherwise = return zz
>
>  snip 
>
> With foldM, it takes 2.5 seconds on my machine. If you comment that line, and
> use foo instead, it takes around .1 seconds. So that's a factor of what, 250?
> That loop allows for a lot more unboxing, which allows much better code to be
> generated.
>
> When Hackage comes back online, I'll take a look at your code, and see if I
> can make it run faster, but you might want to try it yourself in the time
> being. Strictifying the addition of the accumulator is probably a good idea,
> for instance.
>
> Cheers,
> -- Dan
>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

2008-06-27 Thread Dan Doel
On Friday 27 June 2008, Anatoly Yakovenko wrote:
> $ cat htestdot.hs
> {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
> -fglasgow-exts -fbang-patterns -lcblas#-}
> module Main where
>
> import Data.Vector.Dense.IO
> import Control.Monad
>
> main = do
>let size = 10
>let times = 10*1000*1000
>v1::IOVector Int Double <- newListVector size $ replicate size 0.1
>v2::IOVector Int Double <- newListVector size $ replicate size 0.1
>sum <- foldM (\ ii zz -> do
>   rv <- v1 `getDot` v2
>   return $ zz + rv
>   ) 0.0 [0..times]
>print $ sum

Hackage is down for the time being, so I can't install blas and look at the 
core for your program. However, there are still some reasons why this code 
would be slow.

For instance, a brief experiment seems to indicate that foldM is not a good 
consumer in the foldr/build sense, so no deforestation occurs. Your program 
is iterating over a 10-million element lazy list. That's going to add 
overhead. I wrote a simple test program which just adds 0.1 in each 
iteration:

 snip 

{-# LANGUAGE BangPatterns #-}

module Main (main) where

import Control.Monad

main = do
  let times = 10*1000*1000
  sum <- foldM (\_ zz -> return $ zz + 0.1) 0.0 [0..times]
--  sum <- foo 0 times 0.0
  print $ sum

foo :: Int -> Int -> Double -> IO Double
foo k m !zz
  | k <= m = foo (k+1) m (zz + 0.1)
  | otherwise = return zz

 snip 

With foldM, it takes 2.5 seconds on my machine. If you comment that line, and 
use foo instead, it takes around .1 seconds. So that's a factor of what, 250? 
That loop allows for a lot more unboxing, which allows much better code to be 
generated.

When Hackage comes back online, I'll take a look at your code, and see if I 
can make it run faster, but you might want to try it yourself in the time 
being. Strictifying the addition of the accumulator is probably a good idea, 
for instance.

Cheers,
-- Dan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

2008-06-27 Thread Anatoly Yakovenko
> I suspect that it is your initialization that is the difference.  For
> one thing, you've initialized the arrays to different values, and in
> your C code you've fused what are two separate loops in your Haskell
> code.  So you've not only given the C compiler an easier loop to run
> (since you're initializing the array to a constant rather than to a
> sequence of numbers), but you've also manually optimized that
> initialization.  In fact, this fusion could be precisely the factor of
> two.  Why not see what happens in Haskell if you create just one
> vector and dot it with itself? (of course, that'll also make the blas
> call faster, so you'll need to be careful in your interpretation of
> your results.)

The difference cant be in the initialization.   I am calling the dot
product a million times, the malloc and init in both cases are
insignificant.  Also, "fusing" the two loops in C probably wont help,
if anything having each loop run separate is likely to be faster and
result in less cache misses.

In this case, i am using vectors of size 10 only, and calling the loop
10 million times, haskell is far far slower, or 35 times.  That's
pretty crappy.


$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where

import Data.Vector.Dense.IO
import Control.Monad

main = do
   let size = 10
   let times = 10*1000*1000
   v1::IOVector Int Double <- newListVector size $ replicate size 0.1
   v2::IOVector Int Double <- newListVector size $ replicate size 0.1
   sum <- foldM (\ ii zz -> do
  rv <- v1 `getDot` v2
  return $ zz + rv
  ) 0.0 [0..times]
   print $ sum


$ ghc --make htestdot.hs
$ time ./htestdot
1.0001e7

real0m17.328s
user0m17.320s
sys 0m0.010

$ cat testdot.c
#include 
#include 
#include 
#include 

int main() {
   int size = 10;
   int times = 10*1000*1000;
   int ii = 0;
   double dd = 0.0;
   double* v1 = malloc(sizeof(double) * (size));
   double* v2 = malloc(sizeof(double) * (size));
   for(ii = 0; ii < size; ++ii) {
  v1[ii] = 0.1;
   }
   for(ii = 0; ii < size; ++ii) {
  v2[ii] = 0.1;
   }
   for(ii = 0; ii < times; ++ii) {
  dd += cblas_ddot(size, v1, 1, v2, 1);
   }
   free(v1);
   free(v2);
   printf("%f\n", dd);
   return 0;
}

$ gcc -O2 testdot.c -lcblas -o testdot
$ time ./testdot
99.999839

real0m0.491s
user0m0.480s
sys 0m0.020s


Just to make sure that fold isnt causing the slowdown, i reverted the
haskell program to use the mapM_, i still got almost the same
performance:

$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where

import Data.Vector.Dense.IO
import Control.Monad

main = do
   let size = 10
   let times = 10*1000*1000
   v1::IOVector Int Double <- newListVector size $ replicate size 0.1
   v2::IOVector Int Double <- newListVector size $ replicate size 0.1
   mapM_ (\ ii -> do v1 `getDot` v2) [0..times]
$ ghc --make htestdot

$ time ./htestdot

real0m15.660s
user0m15.630s
sys 0m0.030s

This is what the profiler has to say:

 $ cat htestdot.prof
Fri Jun 27 18:06 2008 Time and Allocation Profiling Report  (Final)

   htestdot +RTS -p -RTS

total time  =   22.00 secs   (1100 ticks @ 20 ms)
total alloc = 3,320,010,716 bytes  (excludes profiling overheads)

COST CENTREMODULE   %time %alloc

main   Main 100.0  100.0



individualinherited
COST CENTRE  MODULE
   no.entries  %time %alloc   %time %alloc

MAIN MAIN
 1   0   0.00.0   100.0  100.0
 mainMain
   222   1  93.6   88.093.6   88.0
 CAF Main
   216   5   0.00.0 6.4   12.0
  main   Main
   223   0   6.4   12.0 6.4   12.0
 CAF GHC.Handle
   168   3   0.00.0 0.00.0
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

2008-06-18 Thread Richard A. O'Keefe


On 19 Jun 2008, at 4:16 am, Anatoly Yakovenko wrote:

C doesn't work like that :).  functions always get called.


Not true.  A C compiler must produce the same *effect* as if
the function had been called, but if by some means the compiler
knows that the function has no effect, it is entitled to skip
the call.  In particular, the C compiler I normally use offers
these pragmas, amongst others:

#pragma does_not_write_global_data (funcname [, funcname])
#pragma no_side_effect(funcname[, funcname])

So with a declaration like

extern double cblas_ddot(
int,
double const *, int,
double const *, int);
#pragma no_side_effect (cblas_ddot)



the compiler would be completely within its rights to discard
any call to cblas_ddot() whose result was not used.  (As it
happens, it didn't, but it would have been allowed to.)
If using gcc,

extern double cblas_ddot( ... as before ...)
__attribute__ ((const));

seems to have the same effect, certainly the test case I tried
did in fact completely eliminate a call to cblas_ddot() when
so declared.

Since the malloc() results pointed to uninitialised memory,
the C compiler was entitled to do anything it pleased anyway.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

2008-06-18 Thread David Roundy
On Wed, Jun 18, 2008 at 06:03:42PM +0100, Jules Bean wrote:
> Anatoly Yakovenko wrote:
> >>>#include 
> >>>#include 
> >>>
> >>>int main() {
> >>>  int size = 1024;
> >>>  int ii = 0;
> >>>  double* v1 = malloc(sizeof(double) * (size));
> >>>  double* v2 = malloc(sizeof(double) * (size));
> >>>  for(ii = 0; ii < size*size; ++ii) {
> >>> double _dd = cblas_ddot(0, v1, size, v2, size);
> >>>  }
> >>>  free(v1);
> >>>  free(v2);
> >>>}
> >>Your C compiler sees that you're not using the result of cblas_ddot,
> >>so it doesn't even bother to call it. That loop never gets run. All
> >>your program does at runtime is call malloc and free twice, which is
> >>very fast :-)
> >
> >C doesn't work like that :). 
> 
> C compilers can do what they like ;)
> 
> GCC in particular is pretty good at removing dead code, including entire 
> loops. However it shouldn't eliminate the call to cblas_ddot unless it 
> thinks cblas_ddot has no side effects at all, which would be surprising 
> unless it's inlined somehow.

Or unless it's been annotated as pure, which it should be.

David
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

2008-06-18 Thread David Roundy
On Wed, Jun 18, 2008 at 09:16:24AM -0700, Anatoly Yakovenko wrote:
> >> #include 
> >> #include 
> >>
> >> int main() {
> >>   int size = 1024;
> >>   int ii = 0;
> >>   double* v1 = malloc(sizeof(double) * (size));
> >>   double* v2 = malloc(sizeof(double) * (size));
> >>   for(ii = 0; ii < size*size; ++ii) {
> >>  double _dd = cblas_ddot(0, v1, size, v2, size);
> >>   }
> >>   free(v1);
> >>   free(v2);
> >> }
> >
> > Your C compiler sees that you're not using the result of cblas_ddot,
> > so it doesn't even bother to call it. That loop never gets run. All
> > your program does at runtime is call malloc and free twice, which is
> > very fast :-)
> 
> C doesn't work like that :).  functions always get called.  but i did
> find a problem with my C code, i am incorrectly calling the dot
> production function:

See a recent article in lwn on pure and const functions to see how gcc
is able to perform dead code elimination and CSE, provided its given
annotations on the relevant functions.  I'd certainly hope that your
blas library is properly annotated!

> #include 
> #include 
> #include 
> #include 
> 
> int main() {
>int size = 1024;
>int ii = 0;
>double dd = 0.0;
>double* v1 = malloc(sizeof(double) * (size));
>double* v2 = malloc(sizeof(double) * (size));
>for(ii = 0; ii < size; ++ii) {
>   v1[ii] = 0.1;
>   v2[ii] = 0.1;
>}
>for(ii = 0; ii < size*size; ++ii) {
>   dd += cblas_ddot(size, v1, 0, v2, 0);
>}
>free(v1);
>free(v2);
>printf("%f\n", dd);
>return 0;
> }
> 
> time ./testdot
> 10737418.240187
> 
> real0m2.200s
> user0m2.190s
> sys 0m0.010s
> 
> So C is about twice as fast.  I can live with that.

I suspect that it is your initialization that is the difference.  For
one thing, you've initialized the arrays to different values, and in
your C code you've fused what are two separate loops in your Haskell
code.  So you've not only given the C compiler an easier loop to run
(since you're initializing the array to a constant rather than to a
sequence of numbers), but you've also manually optimized that
initialization.  In fact, this fusion could be precisely the factor of
two.  Why not see what happens in Haskell if you create just one
vector and dot it with itself? (of course, that'll also make the blas
call faster, so you'll need to be careful in your interpretation of
your results.)

David
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

2008-06-18 Thread Jules Bean

Anatoly Yakovenko wrote:

#include 
#include 

int main() {
  int size = 1024;
  int ii = 0;
  double* v1 = malloc(sizeof(double) * (size));
  double* v2 = malloc(sizeof(double) * (size));
  for(ii = 0; ii < size*size; ++ii) {
 double _dd = cblas_ddot(0, v1, size, v2, size);
  }
  free(v1);
  free(v2);
}

Your C compiler sees that you're not using the result of cblas_ddot,
so it doesn't even bother to call it. That loop never gets run. All
your program does at runtime is call malloc and free twice, which is
very fast :-)


C doesn't work like that :). 


C compilers can do what they like ;)

GCC in particular is pretty good at removing dead code, including entire 
loops. However it shouldn't eliminate the call to cblas_ddot unless it 
thinks cblas_ddot has no side effects at all, which would be surprising 
unless it's inlined somehow.


Jules
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

2008-06-18 Thread Adam Langley
On Wed, Jun 18, 2008 at 9:16 AM, Anatoly Yakovenko
<[EMAIL PROTECTED]> wrote:
> C doesn't work like that :)

Yes it can. You would have to check the disassembly to be sure, but C
compilers can, and do, perform dead code elimination.


AGL

-- 
Adam Langley [EMAIL PROTECTED] http://www.imperialviolet.org
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

2008-06-18 Thread Anatoly Yakovenko
>> #include 
>> #include 
>>
>> int main() {
>>   int size = 1024;
>>   int ii = 0;
>>   double* v1 = malloc(sizeof(double) * (size));
>>   double* v2 = malloc(sizeof(double) * (size));
>>   for(ii = 0; ii < size*size; ++ii) {
>>  double _dd = cblas_ddot(0, v1, size, v2, size);
>>   }
>>   free(v1);
>>   free(v2);
>> }
>
> Your C compiler sees that you're not using the result of cblas_ddot,
> so it doesn't even bother to call it. That loop never gets run. All
> your program does at runtime is call malloc and free twice, which is
> very fast :-)

C doesn't work like that :).  functions always get called.  but i did
find a problem with my C code, i am incorrectly calling the dot
production function:

#include 
#include 
#include 
#include 

int main() {
   int size = 1024;
   int ii = 0;
   double dd = 0.0;
   double* v1 = malloc(sizeof(double) * (size));
   double* v2 = malloc(sizeof(double) * (size));
   for(ii = 0; ii < size; ++ii) {
  v1[ii] = 0.1;
  v2[ii] = 0.1;
   }
   for(ii = 0; ii < size*size; ++ii) {
  dd += cblas_ddot(size, v1, 0, v2, 0);
   }
   free(v1);
   free(v2);
   printf("%f\n", dd);
   return 0;
}

time ./testdot
10737418.240187

real0m2.200s
user0m2.190s
sys 0m0.010s

So C is about twice as fast.  I can live with that.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blas bindings, why are they so much slower the C?

2008-06-18 Thread Bryan O'Sullivan
On Tue, Jun 17, 2008 at 9:00 PM, Anatoly Yakovenko
<[EMAIL PROTECTED]> wrote:
> here is the C:
>
> #include 
> #include 
>
> int main() {
>   int size = 1024;
>   int ii = 0;
>   double* v1 = malloc(sizeof(double) * (size));
>   double* v2 = malloc(sizeof(double) * (size));
>   for(ii = 0; ii < size*size; ++ii) {
>  double _dd = cblas_ddot(0, v1, size, v2, size);
>   }
>   free(v1);
>   free(v2);
> }

Your C compiler sees that you're not using the result of cblas_ddot,
so it doesn't even bother to call it. That loop never gets run. All
your program does at runtime is call malloc and free twice, which is
very fast :-)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] blas bindings, why are they so much slower the C?

2008-06-18 Thread Anatoly Yakovenko
here is the C:

#include 
#include 

int main() {
   int size = 1024;
   int ii = 0;
   double* v1 = malloc(sizeof(double) * (size));
   double* v2 = malloc(sizeof(double) * (size));
   for(ii = 0; ii < size*size; ++ii) {
  double _dd = cblas_ddot(0, v1, size, v2, size);
   }
   free(v1);
   free(v2);
}

this is the haskell:

module Main where

import Data.Vector.Dense.IO

main = do
   let size = 1024
   v1::IOVector Int Double <- newListVector size [0..]
   v2::IOVector Int Double <- newListVector size [0..]
   mapM_ (\ ii -> do v1 `getDot` v2) [0..size*size]

time ./testdot

real0m0.017s
user0m0.010s
sys 0m0.010s

time ./htestdot

real0m4.692s
user0m4.670s
sys 0m0.030s

so like 250x difference

htestdot.prof is no help

   Tue Jun 17 20:46 2008 Time and Allocation Profiling Report  (Final)

  htestdot +RTS -p -RTS

   total time  =3.92 secs   (196 ticks @ 20 ms)
   total alloc = 419,653,032 bytes  (excludes profiling overheads)

COST CENTREMODULE   %time %alloc

main   Main  88.3   83.0
CAFMain  11.7   17.0



individualinherited
COST CENTRE  MODULE
   no.entries  %time %alloc   %time %alloc

MAIN MAIN
 1   0   0.00.0   100.0  100.0
 CAF Main
   216   7  11.7   17.0   100.0  100.0
  main   Main
   222   1  88.3   83.088.3   83.0
 CAF GHC.Float
   187   1   0.00.0 0.00.0
 CAF GHC.Handle
   168   3   0.00.0 0.00.0
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe