You do realize that the throughput from onboard (video) RAM is going
to be much higher, right? It's not just the parallelization but the
memory bandwidth. And as James pointed out, if you can keep most of
your intermediate computation on-card, you stand to benefit immensely,
even if doing
A Thursday 10 September 2009 09:45:29 Rohit Garg escrigué:
You do realize that the throughput from onboard (video) RAM is going
to be much higher, right? It's not just the parallelization but the
memory bandwidth. And as James pointed out, if you can keep most of
your intermediate
Where are you getting this info from? IMO the technology of memory in
graphics boards cannot be so different than in commercial motherboards. It
could be a *bit* faster (at the expenses of packing less of it), but I'd say
not as much as 4x faster (100 GB/s vs 25 GB/s of Intel i7 in sequential
Hi Sturla,
The proper way to speed up dot(a*b+c*sqrt(d), e) is to get rid of
temporary intermediates.
I implemented a patch
http://projects.scipy.org/numpy/ticket/1153
that reduces the number of temporary intermediates.
In your example from 4 to 2.
There is a big improvement in terms of
Citi, Luca skrev:
That is exactly why numexpr is faster in these cases.
I hope one day numpy will be able to perform such
optimizations.
I think it is going to require lazy evaluation. Whenever possible, an
operator would just return a symbolic representation of the operation.
This would
Rohit Garg skrev:
gtx280--141GBps--has 1GB
ati4870--115GBps--has 1GB
ati5870--153GBps (launches sept 22, 2009)--2GB models will be there too
That is going to help if buffers are kept in graphics memory. But the
problem is that graphics memory is a scarse resource.
S.M.
A Thursday 10 September 2009 11:11:22 Sturla Molden escrigué:
Citi, Luca skrev:
That is exactly why numexpr is faster in these cases.
I hope one day numpy will be able to perform such
optimizations.
I think it is going to require lazy evaluation. Whenever possible, an
operator would just
On Thu, Sep 10, 2009 at 10:36:27AM +0200, Francesc Alted wrote:
Where are you getting this info from? IMO the technology of memory in
graphics boards cannot be so different than in commercial motherboards. It
could be a *bit* faster (at the expenses of packing less of it), but I'd
A Thursday 10 September 2009 10:58:13 Rohit Garg escrigué:
Where are you getting this info from? IMO the technology of memory in
graphics boards cannot be so different than in commercial motherboards.
It could be a *bit* faster (at the expenses of packing less of it), but
I'd say not as
A Thursday 10 September 2009 11:20:21 Gael Varoquaux escrigué:
On Thu, Sep 10, 2009 at 10:36:27AM +0200, Francesc Alted wrote:
Where are you getting this info from? IMO the technology of memory in
graphics boards cannot be so different than in commercial
motherboards. It could be a
On Thu, Sep 10, 2009 at 11:29:49AM +0200, Francesc Alted wrote:
The point is: are GPUs prepared to compete with a general-purpose CPUs in
all-road operations, like evaluating transcendental functions,
conditionals all of this with a rich set of data types? I would like to
believe
Sure. Specially because NumPy is all about embarrasingly parallel problems
(after all, this is how an ufunc works, doing operations
element-by-element).
The point is: are GPUs prepared to compete with a general-purpose CPUs in
all-road operations, like evaluating transcendental functions,
The point is: are GPUs prepared to compete with a general-purpose CPUs in
all-road operations, like evaluating transcendental functions, conditionals
all of this with a rich set of data types?
Yup.
--
Rohit Garg
http://rpg-314.blogspot.com/
Senior Undergraduate
Department of Physics
Indian
A Thursday 10 September 2009 11:40:48 Sturla Molden escrigué:
Francesc Alted skrev:
Numexpr already uses the Python parser, instead of build a new one.
However the bytecode emitted after the compilation process is
different, of course.
Also, I don't see the point in requiring immutable
A Thursday 10 September 2009 11:37:24 Gael Varoquaux escrigué:
On Thu, Sep 10, 2009 at 11:29:49AM +0200, Francesc Alted wrote:
The point is: are GPUs prepared to compete with a general-purpose CPUs
in all-road operations, like evaluating transcendental functions,
conditionals all of this
a = np.cos(b)
where b is a 1x1 matrix is *very* embarrassing (in the parallel
meaning of the term ;-)
On this operation, gpu's will eat up cpu's like a pack of pirhanas. :)
--
Rohit Garg
http://rpg-314.blogspot.com/
Senior Undergraduate
Department of Physics
Indian Institute of
That's nice to see. I think I'll change my mind if someone could perform a
vector-vector multiplication (a operation that is typically memory-bounded)
You mean a dot product?
--
Rohit Garg
http://rpg-314.blogspot.com/
Senior Undergraduate
Department of Physics
Indian Institute of Technology
A Thursday 10 September 2009 14:36:16 Rohit Garg escrigué:
That's nice to see. I think I'll change my mind if someone could perform
a vector-vector multiplication (a operation that is typically
memory-bounded)
You mean a dot product?
Whatever, dot product or element-wise product. Both
On 09/10/2009 07:40 AM, Francesc Alted wrote:
A Thursday 10 September 2009 14:36:16 Rohit Garg escrigué:
That's nice to see. I think I'll change my mind if someone could
perform
a vector-vector multiplication (a operation that is typically
memory-bounded)
You mean a dot product?
Apart from float and double, which floating point formats are
supported by numpy?
On Thu, Sep 10, 2009 at 7:09 PM, Bruce Southey bsout...@gmail.com wrote:
On 09/10/2009 07:40 AM, Francesc Alted wrote:
A Thursday 10 September 2009 14:36:16 Rohit Garg escrigué:
That's nice to see. I think
A Thursday 10 September 2009 15:51:15 Rohit Garg escrigué:
Apart from float and double, which floating point formats are
supported by numpy?
I think whatever supported by the underlying CPU, whenever it is extended
double precision (12 bytes) or quad precision (16 bytes).
--
Francesc Alted
I think whatever supported by the underlying CPU, whenever it is extended
double precision (12 bytes) or quad precision (16 bytes).
classic 64 bit cpu's support neither.
--
Francesc Alted
___
NumPy-Discussion mailing list
On Thu, Sep 10, 2009 at 07:28, Francesc Altedfal...@pytables.org wrote:
A Thursday 10 September 2009 11:37:24 Gael Varoquaux escrigué:
On Thu, Sep 10, 2009 at 11:29:49AM +0200, Francesc Alted wrote:
The point is: are GPUs prepared to compete with a general-purpose CPUs
in all-road
Yes. However, it is worth making the distinction between
embarrassingly parallel problems and SIMD problems. Not all
embarrassingly parallel problems are SIMD-capable. GPUs do SIMD, not
generally embarrassing problems.
GPUs exploit both dimensions of parallelism, both simd (aka
vectorization)
A Tuesday 08 September 2009 21:19:05 George Dahl escrigué:
Sturla Molden sturla at molden.no writes:
Erik Tollerud skrev:
NumPy arrays on the GPU memory is an easy task. But then I would have
to write the computation in OpenCL's dialect of C99?
This is true to some extent, but also
A Tuesday 08 September 2009 23:21:53 Christopher Barker escrigué:
Also, perhaps a GPU-aware numexpr could be helpful which I think is the
kind of thing that Sturla was refering to when she wrote:
Incidentally, this will also make it easier to leverage on modern GPUs.
Numexpr mainly supports
Received from Francesc Alted on Wed, Sep 09, 2009 at 05:18:48AM EDT:
(snip)
The point here is that matrix-matrix multiplications (or, in general,
functions with a large operation/element ratio) are a *tiny* part of all the
possible operations between arrays that NumPy supports. This is why
A Wednesday 09 September 2009 11:26:06 Francesc Alted escrigué:
A Tuesday 08 September 2009 23:21:53 Christopher Barker escrigué:
Also, perhaps a GPU-aware numexpr could be helpful which I think is the
kind of thing that Sturla was refering to when she wrote:
Incidentally, this will
On Wed, Sep 9, 2009 at 10:41 AM, Francesc Alted fal...@pytables.org wrote:
Numexpr mainly supports functions that are meant to be used element-wise,
so the operation/element ratio is normally 1 (or close to 1). In these
scenarios is where improved memory access is much more important than CPU
Christopher Barker wrote:
George Dahl wrote:
Sturla Molden sturla at molden.no writes:
Teraflops peak performance of modern GPUs is impressive. But NumPy
cannot easily benefit from that.
I know that for my work, I can get around an order of a 50-fold speedup over
numpy using a python
George Dahl skrev:
I know that for my work, I can get around an order of a 50-fold
speedup over
numpy using a python wrapper for a simple GPU matrix class. So I
might be
dealing with a lot of matrix products where I multiply a fixed 512 by
784 matrix
by a 784 by 256 matrix that changes
James Bergstra skrev:
Suppose you want to evaluate dot(a*b+c*sqrt(d), e). The GPU is
great for doing dot(),
The CPU is equally great (or better?) for doing dot(). In both cases:
- memory access scale O(n) for dot producs.
- computation scale O(n) for dot producs.
- memory is low
- computation
On 10-Sep-09, at 12:47 AM, Sturla Molden wrote:
The CPU is equally great (or better?) for doing dot(). In both cases:
- memory access scale O(n) for dot producs.
- computation scale O(n) for dot producs.
- memory is low
- computation is fast (faster for GPU)
You do realize that the
On Wed, Sep 9, 2009 at 9:47 PM, Sturla Moldenstu...@molden.no wrote:
James Bergstra skrev:
Suppose you want to evaluate dot(a*b+c*sqrt(d), e). The GPU is
great for doing dot(),
The CPU is equally great (or better?) for doing dot(). In both cases:
- memory access scale O(n) for dot producs.
George Dahl wrote:
Sturla Molden sturla at molden.no writes:
Teraflops peak performance of modern GPUs is impressive. But NumPy
cannot easily benefit from that.
I know that for my work, I can get around an order of a 50-fold speedup over
numpy using a python wrapper for a simple GPU matrix
Sturla Molden sturla at molden.no writes:
Erik Tollerud skrev:
NumPy arrays on the GPU memory is an easy task. But then I would have to
write the computation in OpenCL's dialect of C99?
This is true to some extent, but also probably difficult to do given
the fact that paralellizable
Hi everyone,
In case anyone is interested, I just set up a google group to discuss
GPU-based simulation for our Python neural simulator Brian:
http://groups.google.fr/group/brian-on-gpu
Our simulator relies heavily Numpy. I would be very happy if the GPU
experts here would like to share their
Erik Tollerud skrev:
NumPy arrays on the GPU memory is an easy task. But then I would have to
write the computation in OpenCL's dialect of C99?
This is true to some extent, but also probably difficult to do given
the fact that paralellizable algorithms are generally more difficult
to
I realize this topic is a bit old, but I couldn't help but add
something I forgot to mention earlier...
I mean, once the computations are moved elsewhere numpy is basically a
convenient way to address memory.
That is how I mostly use NumPy, though. Computations I often do in
Fortran 95 or C.
Sturla Molden a écrit :
Thus, here is my plan:
1. a special context-manager class
2. immutable arrays inside with statement
3. lazy evaluation: expressions build up a parse tree
4. dynamic code generation
5. evaluation on exit
There seems to be some similarity with what we want to do to
David Warde-Farley dwf at cs.toronto.edu writes:
It did inspire some of our colleagues in Montreal to create this,
though:
http://code.google.com/p/cuda-ndarray/
I gather it is VERY early in development, but I'm sure they'd love
contributions!
Hi David,
That does look quite close to
On Thu, Aug 6, 2009 at 11:12 AM, James Bergstra
bergs...@iro.umontreal.cawrote:
David Warde-Farley dwf at cs.toronto.edu writes:
It did inspire some of our colleagues in Montreal to create this,
though:
http://code.google.com/p/cuda-ndarray/
I gather it is VERY early in
On Thu, Aug 6, 2009 at 1:19 PM, Charles R
Harrischarlesr.har...@gmail.com wrote:
I almost looks like you are reimplementing numpy, in c++ no less. Is there
any reason why you aren't working with a numpy branch and just adding
ufuncs?
I don't know how that would work. The Ufuncs need a
Note that this is from a user perspective, as I have no particular plan of
developing the details of this implementation, but I've thought for a long
time that GPU support could be great for numpy (I would also vote for OpenCL
support over cuda, although conceptually they seem quite similar)...
2009/8/6 Erik Tollerud erik.tolle...@gmail.com:
Note that this is from a user perspective, as I have no particular plan of
developing the details of this implementation, but I've thought for a long
time that GPU support could be great for numpy (I would also vote for OpenCL
support over cuda,
On 6-Aug-09, at 2:54 PM, Erik Tollerud wrote:
Now linear algebra or FFTs on a GPU would probably be a huge boon,
I'll
admit - especially if it's in the form of a drop-in replacement for
the
numpy or scipy versions.
The word I'm hearing from people in my direct acquaintance who are
Now linear algebra or FFTs on a GPU would probably be a huge boon,
I'll admit - especially if it's in the form of a drop-in replacement
for the numpy or scipy versions.
NumPy generate temporary arrays for expressions involving ndarrays. This
extra allocation and copying often takes more
On Thu, Aug 6, 2009 at 15:57, Sturla Moldenstu...@molden.no wrote:
Now linear algebra or FFTs on a GPU would probably be a huge boon,
I'll admit - especially if it's in the form of a drop-in replacement
for the numpy or scipy versions.
NumPy generate temporary arrays for expressions
Robert Kern wrote:
I believe that is exactly the point that Erik is making. :-)
I wasn't arguing against him, just suggesting a solution. :-)
I have big hopes for lazy evaluation, if we can find a way to to it right.
Sturla
___
NumPy-Discussion
On Thu, Aug 6, 2009 at 4:57 PM, Sturla Moldenstu...@molden.no wrote:
Now linear algebra or FFTs on a GPU would probably be a huge boon,
I'll admit - especially if it's in the form of a drop-in replacement
for the numpy or scipy versions.
NumPy generate temporary arrays for expressions
On Thu, Aug 6, 2009 at 3:29 PM, James Bergstra bergs...@iro.umontreal.cawrote:
On Thu, Aug 6, 2009 at 4:57 PM, Sturla Moldenstu...@molden.no wrote:
Now linear algebra or FFTs on a GPU would probably be a huge boon,
I'll admit - especially if it's in the form of a drop-in replacement
for
Charles R Harris wrote:
Whether the code that gets compiled is written using lazy evaluation
(ala Sturla), or is expressed some other way seems like an independent
issue. It sounds like one important thing would be having arrays that
reside on the GPU.
Memory management is slow compared to
Sturla Molden wrote:
Memory management is slow compared to computation. Operations like
malloc, free and memcpy is not faster for VRAM than for RAM.
Actually it's not VRAM anymore, but whatever you call the memory
dedicated to the GPU.
It is cheap to put 8 GB of RAM into a computer, but
On Thu, Aug 6, 2009 at 4:36 PM, Sturla Molden stu...@molden.no wrote:
Charles R Harris wrote:
Whether the code that gets compiled is written using lazy evaluation
(ala Sturla), or is expressed some other way seems like an independent
issue. It sounds like one important thing would be
Charles R Harris wrote:
I mean, once the computations are moved elsewhere numpy is basically a
convenient way to address memory.
That is how I mostly use NumPy, though. Computations I often do in
Fortran 95 or C.
NumPy arrays on the GPU memory is an easy task. But then I would have to
James Bergstra wrote:
The plan you describe is a good one, and Theano
(www.pylearn.org/theano) almost exactly implements it. You should
check it out. It does not use 'with' syntax at the moment, but it
could provide the backend machinery for your mechanism if you want to
go forward with
On Thu, Aug 6, 2009 at 5:10 PM, Sturla Molden stu...@molden.no wrote:
Charles R Harris wrote:
I mean, once the computations are moved elsewhere numpy is basically a
convenient way to address memory.
That is how I mostly use NumPy, though. Computations I often do in
Fortran 95 or C.
On Thu, Aug 6, 2009 at 1:57 PM, Sturla Moldenstu...@molden.no wrote:
In order to reduce the effect of immutable arrays, we could introduce a
context-manager. Inside the with statement, all arrays would be
immutable. Second, the __exit__ method could trigger the code generator
and do all the
On Thu, Aug 6, 2009 at 19:00, Fernando Perezfperez@gmail.com wrote:
On Thu, Aug 6, 2009 at 1:57 PM, Sturla Moldenstu...@molden.no wrote:
In order to reduce the effect of immutable arrays, we could introduce a
context-manager. Inside the with statement, all arrays would be
immutable.
59 matches
Mail list logo