Re: Using CUDA within Spark / boosting linear algebra

2016-02-04 Thread Max Grossman
Allen,

Currently it only supports OpenCL because the code generator we’ve extended 
targeted OpenCL. There’s no technical reason that CUDA couldn’t be supported if 
people would be interested in that, but it would require a rewrite of some of 
the code generator as well as some ifdefs in the runtime to allow us to compile 
with either OpenCL or CUDA support. There are actually a few components that 
support both OpenCL and CUDA for when they’ve been reused for other projects 
that did use CUDA, just not all of them.

Thanks,

Max

> On Feb 4, 2016, at 9:42 AM, Allen Zhang <allenzhang...@126.com> wrote:
> 
> Hi Max,
> 
> I will look at it tomorrow. but a quick question, does it support CUDA from 
> Nvidia, not only OpenCL?
> 
> Thanks,
> Allen
> 
> 
> 
> 
> 
> At 2016-02-04 23:13:05, "Max Grossman" <j...@rice.edu> wrote:
> Hi all,
> 
> I’m jumping on this thread to point out another Spark+GPU project for people 
> to take a look at: https://github.com/agrippa/spark-swat 
> <https://github.com/agrippa/spark-swat>
> 
> SWAT (Spark with Accelerated Tasks) is a third-party JAR sitting on top of 
> Spark that uses runtime code generation to convert user-written 
> transformations into OpenCL kernels. SWAT’s lightweight runtime supports 
> multi-GPU systems, managing each device and its memory automatically. You 
> write your own Spark programs, and the runtime takes care of offloading your 
> transformations to the GPUs in your system:
> 
> val rdd = CLWrapper.cl(sc.objectFile(inputPath))
> val next = rdd.map(i => 2 * i).collect
> 
> SWAT primarily distinguishes itself in programmability: an explicit goal of 
> this project is to have as few user-visible API changes as possible from what 
> people have come to know and love in Spark. There are a number of 
> fixed-function GPU libraries out there now, so we wanted to look instead at 
> something that could be used to build new but still well-performing Spark 
> apps.
> 
> SWAT is currently more of a research project than a production-ready system, 
> so there’s a chance it won’t work out-of-the-box on some systems. With that 
> said, it does have fairly comprehensive functional and code generation 
> testing. If you’re interested in trying it out and having trouble setting up, 
> feel free to contact me directly. And of course, any questions or feedback 
> from the community are always welcome.
> 
> Thanks,
> 
> Max
> 
>> On Jan 22, 2016, at 3:42 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com 
>> <mailto:ishiz...@jp.ibm.com>> wrote:
>> 
>> Hi Alexander,
>> The goal of our columnar to effectively drive GPUs in Spark. One of 
>> important items is to effectively and easily enable highly-tuned libraries 
>> for GPU such as BIDMach.
>> 
>> We will enable BIDMach with our columnar storage. On the other hand, it is 
>> not easy task to scaling BIDMach with current Spark. I expect that this talk 
>> would help us.
>> http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565
>>  
>> <http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565>
>> 
>> We appreciate your great feedback.
>> 
>> Best Regards,
>> Kazuaki Ishizaki, Ph.D., Senior research staff member, IBM Research - Tokyo
>> 
>> 
>> 
>> From:"Ulanov, Alexander" <alexander.ula...@hpe.com 
>> <mailto:alexander.ula...@hpe.com>>
>> To:Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org 
>> <mailto:dev@spark.apache.org>" <dev@spark.apache.org 
>> <mailto:dev@spark.apache.org>>, Joseph Bradley <jos...@databricks.com 
>> <mailto:jos...@databricks.com>>
>> Cc:John Canny <ca...@berkeley.edu <mailto:ca...@berkeley.edu>>, 
>> "Evan R. Sparks" <evan.spa...@gmail.com <mailto:evan.spa...@gmail.com>>, 
>> Xiangrui Meng <men...@gmail.com <mailto:men...@gmail.com>>, Sam Halliday 
>> <sam.halli...@gmail.com <mailto:sam.halli...@gmail.com>>
>> Date:2016/01/22 04:20
>> Subject:RE: Using CUDA within Spark / boosting linear algebra
>> 
>> 
>> 
>> Hi Kazuaki,
>>  
>> Indeed, moving data to/from GPU is costly and this benchmark summarizes the 
>> costs for moving different data sizes with regards to matrices 
>> multiplication. These costs are paid for the convenience of using the 
>> standard BLAS API that Nvidia NVBLAS provides. The thing is that there are 
>> no code changes required (in Spark), one just needs to reference BLAS 
>> implementation with the syst

Re: Using CUDA within Spark / boosting linear algebra

2016-02-04 Thread Max Grossman
Hi all,

I’m jumping on this thread to point out another Spark+GPU project for people to 
take a look at: https://github.com/agrippa/spark-swat 
<https://github.com/agrippa/spark-swat>

SWAT (Spark with Accelerated Tasks) is a third-party JAR sitting on top of 
Spark that uses runtime code generation to convert user-written transformations 
into OpenCL kernels. SWAT’s lightweight runtime supports multi-GPU systems, 
managing each device and its memory automatically. You write your own Spark 
programs, and the runtime takes care of offloading your transformations to the 
GPUs in your system:

val rdd = CLWrapper.cl(sc.objectFile(inputPath))
val next = rdd.map(i => 2 * i).collect

SWAT primarily distinguishes itself in programmability: an explicit goal of 
this project is to have as few user-visible API changes as possible from what 
people have come to know and love in Spark. There are a number of 
fixed-function GPU libraries out there now, so we wanted to look instead at 
something that could be used to build new but still well-performing Spark apps.

SWAT is currently more of a research project than a production-ready system, so 
there’s a chance it won’t work out-of-the-box on some systems. With that said, 
it does have fairly comprehensive functional and code generation testing. If 
you’re interested in trying it out and having trouble setting up, feel free to 
contact me directly. And of course, any questions or feedback from the 
community are always welcome.

Thanks,

Max

> On Jan 22, 2016, at 3:42 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com> wrote:
> 
> Hi Alexander,
> The goal of our columnar to effectively drive GPUs in Spark. One of important 
> items is to effectively and easily enable highly-tuned libraries for GPU such 
> as BIDMach.
> 
> We will enable BIDMach with our columnar storage. On the other hand, it is 
> not easy task to scaling BIDMach with current Spark. I expect that this talk 
> would help us.
> http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565
>  
> <http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565>
> 
> We appreciate your great feedback.
> 
> Best Regards,
> Kazuaki Ishizaki, Ph.D., Senior research staff member, IBM Research - Tokyo
> 
> 
> 
> From:"Ulanov, Alexander" <alexander.ula...@hpe.com>
> To:Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org" 
> <dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>
> Cc:John Canny <ca...@berkeley.edu>, "Evan R. Sparks" 
> <evan.spa...@gmail.com>, Xiangrui Meng <men...@gmail.com>, Sam Halliday 
> <sam.halli...@gmail.com>
> Date:2016/01/22 04:20
> Subject:RE: Using CUDA within Spark / boosting linear algebra
> 
> 
> 
> Hi Kazuaki,
>  
> Indeed, moving data to/from GPU is costly and this benchmark summarizes the 
> costs for moving different data sizes with regards to matrices 
> multiplication. These costs are paid for the convenience of using the 
> standard BLAS API that Nvidia NVBLAS provides. The thing is that there are no 
> code changes required (in Spark), one just needs to reference BLAS 
> implementation with the system variable. Naturally, hardware-specific 
> implementation will always be faster than default. The benchmark results show 
> that fact by comparing jCuda (by means of BIDMat) and NVBLAS. However, it 
> also shows that it worth using NVBLAS for large matrices because it can take 
> advantage of several GPUs and it will be faster despite the copying overhead. 
> That is also a known thing advertised by Nvidia.
>  
> By the way, I don’t think that the column/row friendly format is an issue, 
> because one can use transposed matrices to fit the required format. I believe 
> that is just a software preference.
>  
> My suggestion with regards to your prototype would be to make comparisons 
> with Spark’s implementation of logistic regression (that does not take 
> advantage of GPU) and also with BIDMach’s (that takes advantage of GPUs). It 
> will give the users a better understanding of your’s implementation 
> performance. Currently you compare it with Spark’s example logistic 
> regression implementation that is supposed to be a reference for learning 
> Spark rather than benchmarking its performance.
>  
> Best regards, Alexander
>  
> From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com 
> <mailto:ishiz...@jp.ibm.com>] 
> Sent: Thursday, January 21, 2016 3:34 AM
> To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley
> Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>  
> Dear all,
> 
> >>>> Hi 

RE: Using CUDA within Spark / boosting linear algebra

2016-01-22 Thread Kazuaki Ishizaki
Hi Allen,
Thank you for your feedback.
An API to launch GPU kernels with JCuda is the our first step. A purpose 
to release our prototype is to get feedback. In the future, we may use 
other wrappers instead of JCuda.

We are very appreciate it if you would suggest or propose APIs to 
effectively exploit GPUs such as BIDMat in Spark.
If we would run BIDMat with our columnar strorage, the performance boost 
would be good as others reported.

Best Regards,
Kazuaki Ishizaki,



From:   "Allen Zhang" <allenzhang...@126.com>
To: Kazuaki Ishizaki/Japan/IBM@IBMJP
Cc: "dev@spark.apache.org" <dev@spark.apache.org>, "Ulanov, Alexander" 
<alexander.ula...@hpe.com>, "Joseph Bradley" <jos...@databricks.com>, 
"John Canny" <ca...@berkeley.edu>, "Evan R. Sparks" 
<evan.spa...@gmail.com>, "Xiangrui Meng" <men...@gmail.com>, "Sam 
Halliday" <sam.halli...@gmail.com>
Date:   2016/01/21 21:05
Subject:RE: Using CUDA within Spark / boosting linear algebra




Hi Kazuaki,

Jcuda is actually a wrapper of the **pure** CUDA, as your wiki page shows 
that 3.15x performance boost of logistic regression seems slower than 
BIDMat-cublas or pure CUDA.
Could you elaborate on why you chose Jcuda other then JNI to call CUDA 
directly?

Regards,
Allen Zhang






At 2016-01-21 19:34:14, "Kazuaki Ishizaki" <ishiz...@jp.ibm.com> wrote:
Dear all,

>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as 
another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph

As Joseph pointed out before, there are two potential issues to 
efficiently exploit GPUs in Spark.
(1) the cost of data movement between CPU and GPU
(2) the cost of encoding/decoding between current row-format and 
GPU-friendly column format

Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues 
by supporting data partition caching in GPU device memory and by providing 
binary column storage for data partition. We really appreciate it if you 
would give us comments, suggestions, or feedback.

Best Regards
Kazuaki Ishizaki



From:"Ulanov, Alexander" <alexander.ula...@hpe.com>
To:    Sam Halliday <sam.halli...@gmail.com>, John Canny <
ca...@berkeley.edu>
Cc:Xiangrui Meng <men...@gmail.com>, "dev@spark.apache.org" <
dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>, "Evan R. 
Sparks" <evan.spa...@gmail.com>
Date:2016/01/21 11:07
Subject:RE: Using CUDA within Spark / boosting linear algebra



Hi Everyone,
 
I’ve updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.
 
This time I computed average and median of 10 runs for each of experiment 
and approximated FLOPS.
 
Results are available at google docs (old experiments are in the other 2 
sheets):
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Benchmark code:
https://github.com/avulanov/scala-blas
 
Best regards, Alexander
 
 
From: Sam Halliday [mailto:sam.halli...@gmail.com] 
Sent: Thursday, March 26, 2015 9:27 AM
To: John Canny
Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; 
Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra
 
John, I have to disagree with you there. Dense matrices come up a lot in 
industry,  although your personal experience may be different. 
On 26 Mar 2015 16:20, "John Canny" <ca...@berkeley.edu> wrote:
I mentioned this earlier in the thread, but I'll put it out again. Dense 
BLAS are not very important for most machine learning workloads: at least 
for non-image workloads in industry (and for image processing you would 
probably want a deep learning/SGD solution with convolution kernels). e.g. 
it was only relevant for 1/7 of our recent benchmarks, which should be a 
reasonable sample. What really matters is sparse BLAS performance. BIDMat 
is still an order of magnitude faster there. Those kernels are only in 
BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data. 

Its also the case that the overa

RE: Using CUDA within Spark / boosting linear algebra

2016-01-22 Thread Kazuaki Ishizaki
Hi Alexander,
The goal of our columnar to effectively drive GPUs in Spark. One of 
important items is to effectively and easily enable highly-tuned libraries 
for GPU such as BIDMach.

We will enable BIDMach with our columnar storage. On the other hand, it is 
not easy task to scaling BIDMach with current Spark. I expect that this 
talk would help us.
http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565

We appreciate your great feedback.

Best Regards,
Kazuaki Ishizaki, Ph.D., Senior research staff member, IBM Research - 
Tokyo



From:   "Ulanov, Alexander" <alexander.ula...@hpe.com>
To: Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org" 
<dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>
Cc: John Canny <ca...@berkeley.edu>, "Evan R. Sparks" 
<evan.spa...@gmail.com>, Xiangrui Meng <men...@gmail.com>, Sam Halliday 
<sam.halli...@gmail.com>
Date:   2016/01/22 04:20
Subject:RE: Using CUDA within Spark / boosting linear algebra



Hi Kazuaki,
 
Indeed, moving data to/from GPU is costly and this benchmark summarizes 
the costs for moving different data sizes with regards to matrices 
multiplication. These costs are paid for the convenience of using the 
standard BLAS API that Nvidia NVBLAS provides. The thing is that there are 
no code changes required (in Spark), one just needs to reference BLAS 
implementation with the system variable. Naturally, hardware-specific 
implementation will always be faster than default. The benchmark results 
show that fact by comparing jCuda (by means of BIDMat) and NVBLAS. 
However, it also shows that it worth using NVBLAS for large matrices 
because it can take advantage of several GPUs and it will be faster 
despite the copying overhead. That is also a known thing advertised by 
Nvidia.
 
By the way, I don’t think that the column/row friendly format is an 
issue, because one can use transposed matrices to fit the required format. 
I believe that is just a software preference.
 
My suggestion with regards to your prototype would be to make comparisons 
with Spark’s implementation of logistic regression (that does not take 
advantage of GPU) and also with BIDMach’s (that takes advantage of GPUs). 
It will give the users a better understanding of your’s implementation 
performance. Currently you compare it with Spark’s example logistic 
regression implementation that is supposed to be a reference for learning 
Spark rather than benchmarking its performance.
 
Best regards, Alexander
 
From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com] 
Sent: Thursday, January 21, 2016 3:34 AM
To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley
Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday
Subject: RE: Using CUDA within Spark / boosting linear algebra
 
Dear all,

>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as 
another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph

As Joseph pointed out before, there are two potential issues to 
efficiently exploit GPUs in Spark.
(1) the cost of data movement between CPU and GPU
(2) the cost of encoding/decoding between current row-format and 
GPU-friendly column format

Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues 
by supporting data partition caching in GPU device memory and by providing 
binary column storage for data partition. We really appreciate it if you 
would give us comments, suggestions, or feedback.

Best Regards
Kazuaki Ishizaki



From:"Ulanov, Alexander" <alexander.ula...@hpe.com>
To:Sam Halliday <sam.halli...@gmail.com>, John Canny <
ca...@berkeley.edu>
Cc:    Xiangrui Meng <men...@gmail.com>, "dev@spark.apache.org" <
dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>, "Evan R. 
Sparks" <evan.spa...@gmail.com>
Date:2016/01/21 11:07
Subject:RE: Using CUDA within Spark / boosting linear algebra




Hi Everyone,
 
I’ve updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.
 
This time I computed average and median of 10 runs for each of experiment 
and

RE: Using CUDA within Spark / boosting linear algebra

2016-01-21 Thread Allen Zhang


Hi Kazuaki,


Jcuda is actually a wrapper of the **pure** CUDA, as your wiki page shows that 
3.15x performance boost of logistic regression seems slower than BIDMat-cublas 
or pure CUDA.
Could you elaborate on why you chose Jcuda other then JNI to call CUDA directly?


Regards,
Allen Zhang








At 2016-01-21 19:34:14, "Kazuaki Ishizaki" <ishiz...@jp.ibm.com> wrote:
Dear all,

>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as another 
>>>> part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph

As Joseph pointed out before, there are two potential issues to efficiently 
exploit GPUs in Spark.
(1) the cost of data movement between CPU and GPU
(2) the cost of encoding/decoding between current row-format and GPU-friendly 
column format

Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues by 
supporting data partition caching in GPU device memory and by providing binary 
column storage for data partition. We really appreciate it if you would give us 
comments, suggestions, or feedback.

Best Regards
Kazuaki Ishizaki



From:"Ulanov, Alexander" <alexander.ula...@hpe.com>
To:Sam Halliday <sam.halli...@gmail.com>, John Canny 
<ca...@berkeley.edu>
Cc:Xiangrui Meng <men...@gmail.com>, "dev@spark.apache.org" 
<dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>, "Evan R. 
Sparks" <evan.spa...@gmail.com>
Date:2016/01/21 11:07
Subject:RE: Using CUDA within Spark / boosting linear algebra




Hi Everyone,
 
I’ve updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.
 
This time I computed average and median of 10 runs for each of experiment and 
approximated FLOPS.
 
Results are available at google docs (old experiments are in the other 2 
sheets):
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
Benchmark code:
https://github.com/avulanov/scala-blas
 
Best regards, Alexander
 
 
From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Thursday, March 26, 2015 9:27 AM
To: John Canny
Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; 
Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra
 

John, I have to disagree with you there. Dense matrices come up a lot in 
industry,  although your personal experience may be different.
On 26 Mar 2015 16:20, "John Canny" <ca...@berkeley.edu> wrote:
I mentioned this earlier in the thread, but I'll put it out again. Dense BLAS 
are not very important for most machine learning workloads: at least for 
non-image workloads in industry (and for image processing you would probably 
want a deep learning/SGD solution with convolution kernels). e.g. it was only 
relevant for 1/7 of our recent benchmarks, which should be a reasonable sample. 
What really matters is sparse BLAS performance. BIDMat is still an order of 
magnitude faster there. Those kernels are only in BIDMat, since NVIDIAs sparse 
BLAS dont perform well on power-law data.

Its also the case that the overall performance of an algorithm is determined by 
the slowest kernel, not the fastest. If the goal is to get closer to BIDMach's 
performance on typical problems, you need to make sure that every kernel goes 
at comparable speed. So the real question is how much faster MLLib routines do 
on a complete problem with/without GPU acceleration. For BIDMach, its close to 
a factor of 10. But that required running entirely on the GPU, and making sure 
every kernel is close to its limit.

-John

If you think nvblas would be helpful, you should try it in some end-to-end 
benchmarks.
On 3/25/15, 6:23 PM, Evan R. Sparks wrote:
Yeah, much more reasonable - nice to know that we can get full GPU performance 
from breeze/netlib-java - meaning there's no compelling performance reason to 
switch out our current linear algebra library (at least as far as this 
benchmark is concerned).
 
Instead, it looks like a user guide for configuring Spark/MLlib to use the 
right BLAS library will get us most of the way there. Or, would it make sense 
to finally ship openblas compiled for so

RE: Using CUDA within Spark / boosting linear algebra

2016-01-21 Thread Kazuaki Ishizaki
Dear all,

>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as 
another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph

As Joseph pointed out before, there are two potential issues to 
efficiently exploit GPUs in Spark.
(1) the cost of data movement between CPU and GPU
(2) the cost of encoding/decoding between current row-format and 
GPU-friendly column format

Our prototype http://kiszk.github.io/spark-gpu/ addresses these two issues 
by supporting data partition caching in GPU device memory and by providing 
binary column storage for data partition. We really appreciate it if you 
would give us comments, suggestions, or feedback.

Best Regards
Kazuaki Ishizaki



From:   "Ulanov, Alexander" <alexander.ula...@hpe.com>
To: Sam Halliday <sam.halli...@gmail.com>, John Canny 
<ca...@berkeley.edu>
Cc: Xiangrui Meng <men...@gmail.com>, "dev@spark.apache.org" 
<dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>, "Evan R. 
Sparks" <evan.spa...@gmail.com>
Date:   2016/01/21 11:07
Subject:RE: Using CUDA within Spark / boosting linear algebra



Hi Everyone,
 
I’ve updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.
 
This time I computed average and median of 10 runs for each of experiment 
and approximated FLOPS.
 
Results are available at google docs (old experiments are in the other 2 
sheets):
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
Benchmark code:
https://github.com/avulanov/scala-blas
 
Best regards, Alexander
 
 
From: Sam Halliday [mailto:sam.halli...@gmail.com] 
Sent: Thursday, March 26, 2015 9:27 AM
To: John Canny
Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; 
Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra
 
John, I have to disagree with you there. Dense matrices come up a lot in 
industry,  although your personal experience may be different. 
On 26 Mar 2015 16:20, "John Canny" <ca...@berkeley.edu> wrote:
I mentioned this earlier in the thread, but I'll put it out again. Dense 
BLAS are not very important for most machine learning workloads: at least 
for non-image workloads in industry (and for image processing you would 
probably want a deep learning/SGD solution with convolution kernels). e.g. 
it was only relevant for 1/7 of our recent benchmarks, which should be a 
reasonable sample. What really matters is sparse BLAS performance. BIDMat 
is still an order of magnitude faster there. Those kernels are only in 
BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data. 

Its also the case that the overall performance of an algorithm is 
determined by the slowest kernel, not the fastest. If the goal is to get 
closer to BIDMach's performance on typical problems, you need to make sure 
that every kernel goes at comparable speed. So the real question is how 
much faster MLLib routines do on a complete problem with/without GPU 
acceleration. For BIDMach, its close to a factor of 10. But that required 
running entirely on the GPU, and making sure every kernel is close to its 
limit.

-John

If you think nvblas would be helpful, you should try it in some end-to-end 
benchmarks. 
On 3/25/15, 6:23 PM, Evan R. Sparks wrote:
Yeah, much more reasonable - nice to know that we can get full GPU 
performance from breeze/netlib-java - meaning there's no compelling 
performance reason to switch out our current linear algebra library (at 
least as far as this benchmark is concerned). 
 
Instead, it looks like a user guide for configuring Spark/MLlib to use the 
right BLAS library will get us most of the way there. Or, would it make 
sense to finally ship openblas compiled for some common platforms (64-bit 
linux, windows, mac) directly with Spark - hopefully eliminating the jblas 
warnings once and for all for most users? (Licensing is BSD) Or am I 
missing something?
 
On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <
alexander.ula...@hp.com> wrote:
As everyone suggested, the results were too good to be true, so I 
double-checked them. It turns that nvblas did not do multip

RE: Using CUDA within Spark / boosting linear algebra

2016-01-21 Thread Ulanov, Alexander
Hi Kazuaki,

Indeed, moving data to/from GPU is costly and this benchmark summarizes the 
costs for moving different data sizes with regards to matrices multiplication. 
These costs are paid for the convenience of using the standard BLAS API that 
Nvidia NVBLAS provides. The thing is that there are no code changes required 
(in Spark), one just needs to reference BLAS implementation with the system 
variable. Naturally, hardware-specific implementation will always be faster 
than default. The benchmark results show that fact by comparing jCuda (by means 
of BIDMat) and NVBLAS. However, it also shows that it worth using NVBLAS for 
large matrices because it can take advantage of several GPUs and it will be 
faster despite the copying overhead. That is also a known thing advertised by 
Nvidia.

By the way, I don't think that the column/row friendly format is an issue, 
because one can use transposed matrices to fit the required format. I believe 
that is just a software preference.

My suggestion with regards to your prototype would be to make comparisons with 
Spark's implementation of logistic regression (that does not take advantage of 
GPU) and also with BIDMach's (that takes advantage of GPUs). It will give the 
users a better understanding of your's implementation performance. Currently 
you compare it with Spark's example logistic regression implementation that is 
supposed to be a reference for learning Spark rather than benchmarking its 
performance.

Best regards, Alexander

From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com]
Sent: Thursday, January 21, 2016 3:34 AM
To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley
Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday
Subject: RE: Using CUDA within Spark / boosting linear algebra

Dear all,

>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as another 
>>>> part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph

As Joseph pointed out before, there are two potential issues to efficiently 
exploit GPUs in Spark.
(1) the cost of data movement between CPU and GPU
(2) the cost of encoding/decoding between current row-format and GPU-friendly 
column format

Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues by 
supporting data partition caching in GPU device memory and by providing binary 
column storage for data partition. We really appreciate it if you would give us 
comments, suggestions, or feedback.

Best Regards
Kazuaki Ishizaki



From:"Ulanov, Alexander" 
<alexander.ula...@hpe.com<mailto:alexander.ula...@hpe.com>>
To:Sam Halliday 
<sam.halli...@gmail.com<mailto:sam.halli...@gmail.com>>, John Canny 
<ca...@berkeley.edu<mailto:ca...@berkeley.edu>>
Cc:Xiangrui Meng <men...@gmail.com<mailto:men...@gmail.com>>, 
"dev@spark.apache.org<mailto:dev@spark.apache.org>" 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>, Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>>, "Evan R. Sparks" 
<evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>
Date:2016/01/21 11:07
Subject:RE: Using CUDA within Spark / boosting linear algebra




Hi Everyone,

I've updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.

This time I computed average and median of 10 runs for each of experiment and 
approximated FLOPS.

Results are available at google docs (old experiments are in the other 2 
sheets):
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
Benchmark code:
https://github.com/avulanov/scala-blas

Best regards, Alexander


From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Thursday, March 26, 2015 9:27 AM
To: John Canny
Cc: Xiangrui Meng; dev@spark.apache.org<mailto:dev@spark.apache.org>; Joseph 
Bradley; Evan R. Sparks; Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra


John, I have to disagree with you there. Dense matrices come up a lot in 
industry,  although your personal experience may be different.
On 26 Mar 2015 16:20, "J

RE: Using CUDA within Spark / boosting linear algebra

2016-01-20 Thread Ulanov, Alexander
Hi Everyone,

I’ve updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.

This time I computed average and median of 10 runs for each of experiment and 
approximated FLOPS.

Results are available at google docs (old experiments are in the other 2 
sheets):
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
Benchmark code:
https://github.com/avulanov/scala-blas

Best regards, Alexander


From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Thursday, March 26, 2015 9:27 AM
To: John Canny
Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; 
Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra


John, I have to disagree with you there. Dense matrices come up a lot in 
industry,  although your personal experience may be different.
On 26 Mar 2015 16:20, "John Canny" 
<ca...@berkeley.edu<mailto:ca...@berkeley.edu>> wrote:
I mentioned this earlier in the thread, but I'll put it out again. Dense BLAS 
are not very important for most machine learning workloads: at least for 
non-image workloads in industry (and for image processing you would probably 
want a deep learning/SGD solution with convolution kernels). e.g. it was only 
relevant for 1/7 of our recent benchmarks, which should be a reasonable sample. 
What really matters is sparse BLAS performance. BIDMat is still an order of 
magnitude faster there. Those kernels are only in BIDMat, since NVIDIAs sparse 
BLAS dont perform well on power-law data.

Its also the case that the overall performance of an algorithm is determined by 
the slowest kernel, not the fastest. If the goal is to get closer to BIDMach's 
performance on typical problems, you need to make sure that every kernel goes 
at comparable speed. So the real question is how much faster MLLib routines do 
on a complete problem with/without GPU acceleration. For BIDMach, its close to 
a factor of 10. But that required running entirely on the GPU, and making sure 
every kernel is close to its limit.

-John

If you think nvblas would be helpful, you should try it in some end-to-end 
benchmarks.
On 3/25/15, 6:23 PM, Evan R. Sparks wrote:
Yeah, much more reasonable - nice to know that we can get full GPU performance 
from breeze/netlib-java - meaning there's no compelling performance reason to 
switch out our current linear algebra library (at least as far as this 
benchmark is concerned).

Instead, it looks like a user guide for configuring Spark/MLlib to use the 
right BLAS library will get us most of the way there. Or, would it make sense 
to finally ship openblas compiled for some common platforms (64-bit linux, 
windows, mac) directly with Spark - hopefully eliminating the jblas warnings 
once and for all for most users? (Licensing is BSD) Or am I missing something?

On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
As everyone suggested, the results were too good to be true, so I 
double-checked them. It turns that nvblas did not do multiplication due to 
parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My 
previously posted results with nvblas are matrices copying only. The default 
NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked 
other values that worked. As a result, netlib+nvblas is on par with 
BIDMat-cuda. As promised, I am going to post a how-to for nvblas configuration.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing



-Original Message-
From: Ulanov, Alexander
Sent: Wednesday, March 25, 2015 2:31 PM
To: Sam Halliday
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>; Xiangrui Meng; Joseph 
Bradley; Evan R. Sparks; jfcanny
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has exceptional 
performance for big matrices with Double, faster than BIDMat-cuda with Float. 
But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL 
might be a better choice. This correlates with original nvblas presentation on 
GPU conf 2013 (slide 21): 
http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of performance of 
different libraries. I just want to pick a library that does at best dense 
matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran blas 
functions, at the same time netlib-java uses C cblas functions. So, one needs 
cblas shared library to use nvblas through ne

Re: Using CUDA within Spark / boosting linear algebra

2015-03-30 Thread Xiangrui Meng
Hi Alex,

Since it is non-trivial to make nvblas work with netlib-java, it would
be great if you can send the instructions to netlib-java as part of
the README. Hopefully we don't need to modify netlib-java code to use
nvblas.

Best,
Xiangrui

On Thu, Mar 26, 2015 at 9:54 AM, Sean Owen so...@cloudera.com wrote:
 The license issue is with libgfortran, rather than OpenBLAS.

 (FWIW I am going through the motions to get OpenBLAS set up by default
 on CDH in the near future, and the hard part is just handling
 libgfortran.)

 On Thu, Mar 26, 2015 at 4:07 PM, Evan R. Sparks evan.spa...@gmail.com wrote:
 Alright Sam - you are the expert here. If the GPL issues are unavoidable,
 that's fine - what is the exact bit of code that is GPL?

 The suggestion to use OpenBLAS is not to say it's the best option, but that
 it's a *free, reasonable default* for many users - keep in mind the most
 common deployment for Spark/MLlib is on 64-bit linux on EC2[1].
 Additionally, for many of the problems we're targeting, this reasonable
 default can provide a 1-2 orders of magnitude improvement in performance
 over the f2jblas implementation that netlib-java falls back on.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Using CUDA within Spark / boosting linear algebra

2015-03-30 Thread Ulanov, Alexander
Hi Sam, 

What is the best way to do it? Should I clone netlib-java, edit readme.md and 
make a PR?

Best regards, Alexander


-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Monday, March 30, 2015 2:43 PM
To: Sean Owen
Cc: Evan R. Sparks; Sam Halliday; dev@spark.apache.org; Ulanov, Alexander; 
jfcanny
Subject: Re: Using CUDA within Spark / boosting linear algebra

Hi Alex,

Since it is non-trivial to make nvblas work with netlib-java, it would be great 
if you can send the instructions to netlib-java as part of the README. 
Hopefully we don't need to modify netlib-java code to use nvblas.

Best,
Xiangrui

On Thu, Mar 26, 2015 at 9:54 AM, Sean Owen so...@cloudera.com wrote:
 The license issue is with libgfortran, rather than OpenBLAS.

 (FWIW I am going through the motions to get OpenBLAS set up by default 
 on CDH in the near future, and the hard part is just handling
 libgfortran.)

 On Thu, Mar 26, 2015 at 4:07 PM, Evan R. Sparks evan.spa...@gmail.com wrote:
 Alright Sam - you are the expert here. If the GPL issues are 
 unavoidable, that's fine - what is the exact bit of code that is GPL?

 The suggestion to use OpenBLAS is not to say it's the best option, 
 but that it's a *free, reasonable default* for many users - keep in 
 mind the most common deployment for Spark/MLlib is on 64-bit linux on EC2[1].
 Additionally, for many of the problems we're targeting, this 
 reasonable default can provide a 1-2 orders of magnitude improvement 
 in performance over the f2jblas implementation that netlib-java falls back 
 on.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For 
 additional commands, e-mail: dev-h...@spark.apache.org



Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread John Canny
I mentioned this earlier in the thread, but I'll put it out again. Dense 
BLAS are not very important for most machine learning workloads: at 
least for non-image workloads in industry (and for image processing you 
would probably want a deep learning/SGD solution with convolution 
kernels). e.g. it was only relevant for 1/7 of our recent benchmarks, 
which should be a reasonable sample. What really matters is sparse BLAS 
performance. BIDMat is still an order of magnitude faster there. Those 
kernels are only in BIDMat, since NVIDIAs sparse BLAS dont perform well 
on power-law data.


Its also the case that the overall performance of an algorithm is 
determined by the slowest kernel, not the fastest. If the goal is to get 
closer to BIDMach's performance on typical problems, you need to make 
sure that every kernel goes at comparable speed. So the real question is 
how much faster MLLib routines do on a complete problem with/without GPU 
acceleration. For BIDMach, its close to a factor of 10. But that 
required running entirely on the GPU, and making sure every kernel is 
close to its limit.


-John

If you think nvblas would be helpful, you should try it in some 
end-to-end benchmarks.

On 3/25/15, 6:23 PM, Evan R. Sparks wrote:
Yeah, much more reasonable - nice to know that we can get full GPU 
performance from breeze/netlib-java - meaning there's no compelling 
performance reason to switch out our current linear algebra library 
(at least as far as this benchmark is concerned).


Instead, it looks like a user guide for configuring Spark/MLlib to use 
the right BLAS library will get us most of the way there. Or, would it 
make sense to finally ship openblas compiled for some common platforms 
(64-bit linux, windows, mac) directly with Spark - hopefully 
eliminating the jblas warnings once and for all for most users? 
(Licensing is BSD) Or am I missing something?


On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander 
alexander.ula...@hp.com mailto:alexander.ula...@hp.com wrote:


As everyone suggested, the results were too good to be true, so I
double-checked them. It turns that nvblas did not do
multiplication due to parameter NVBLAS_TILE_DIM from nvblas.conf
and returned zero matrix. My previously posted results with nvblas
are matrices copying only. The default NVBLAS_TILE_DIM==2048 is
too big for my graphic card/matrix size. I handpicked other values
that worked. As a result, netlib+nvblas is on par with
BIDMat-cuda. As promised, I am going to post a how-to for nvblas
configuration.


https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing



-Original Message-
From: Ulanov, Alexander
Sent: Wednesday, March 25, 2015 2:31 PM
To: Sam Halliday
Cc: dev@spark.apache.org mailto:dev@spark.apache.org; Xiangrui
Meng; Joseph Bradley; Evan R. Sparks; jfcanny
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has
exceptional performance for big matrices with Double, faster than
BIDMat-cuda with Float. But for smaller matrices, if you will copy
them to/from GPU, OpenBlas or MKL might be a better choice. This
correlates with original nvblas presentation on GPU conf 2013
(slide 21):

http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

My results:

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of
performance of different libraries. I just want to pick a library
that does at best dense matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has
Fortran blas functions, at the same time netlib-java uses C cblas
functions. So, one needs cblas shared library to use nvblas
through netlib-java. Fedora does not have cblas (but Debian and
Ubuntu have), so I needed to compile it. I could not use cblas
from Atlas or Openblas because they link to their implementation
and not to Fortran blas.

Best regards, Alexander

-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.org mailto:dev@spark.apache.org; Xiangrui
Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas
functions should replace current blas functions calls after
executing LD_PRELOAD as suggested in
http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to
netlib-java. It seems to work for simple Java example, but I
cannot make it work with Spark. I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64

Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Sam Halliday
I'm not at all surprised ;-) I fully expect the GPU performance to get
better automatically as the hardware improves.

Netlib natives still need to be shipped separately. I'd also oppose any
move to make Open BLAS the default - is not always better and I think
natives really need DevOps buy-in. It's not the right solution for
everybody.
On 26 Mar 2015 01:23, Evan R. Sparks evan.spa...@gmail.com wrote:

 Yeah, much more reasonable - nice to know that we can get full GPU
 performance from breeze/netlib-java - meaning there's no compelling
 performance reason to switch out our current linear algebra library (at
 least as far as this benchmark is concerned).

 Instead, it looks like a user guide for configuring Spark/MLlib to use the
 right BLAS library will get us most of the way there. Or, would it make
 sense to finally ship openblas compiled for some common platforms (64-bit
 linux, windows, mac) directly with Spark - hopefully eliminating the jblas
 warnings once and for all for most users? (Licensing is BSD) Or am I
 missing something?

 On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander 
 alexander.ula...@hp.com wrote:

 As everyone suggested, the results were too good to be true, so I
 double-checked them. It turns that nvblas did not do multiplication due to
 parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My
 previously posted results with nvblas are matrices copying only. The
 default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
 handpicked other values that worked. As a result, netlib+nvblas is on par
 with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
 configuration.


 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing



 -Original Message-
 From: Ulanov, Alexander
 Sent: Wednesday, March 25, 2015 2:31 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks;
 jfcanny
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi again,

 I finally managed to use nvblas within Spark+netlib-java. It has
 exceptional performance for big matrices with Double, faster than
 BIDMat-cuda with Float. But for smaller matrices, if you will copy them
 to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
 original nvblas presentation on GPU conf 2013 (slide 21):
 http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

 My results:

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Just in case, these tests are not for generalization of performance of
 different libraries. I just want to pick a library that does at best dense
 matrices multiplication for my task.

 P.S. My previous issue with nvblas was the following: it has Fortran blas
 functions, at the same time netlib-java uses C cblas functions. So, one
 needs cblas shared library to use nvblas through netlib-java. Fedora does
 not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
 could not use cblas from Atlas or Openblas because they link to their
 implementation and not to Fortran blas.

 Best regards, Alexander

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, March 24, 2015 6:57 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi,

 I am trying to use nvblas with netlib-java from Spark. nvblas functions
 should replace current blas functions calls after executing LD_PRELOAD as
 suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
 changes to netlib-java. It seems to work for simple Java example, but I
 cannot make it work with Spark. I run the following:
 export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
 --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:

 +-+
 | Processes:   GPU
 Memory |
 |  GPU   PID  Type  Process name   Usage
 |

 |=|
 |0  8873C   bash
 39MiB |
 |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
 39MiB |

 +-+

 In Spark shell I do matrix multiplication and see the following:
 15/03/25 06:48:01 INFO JniLoader: successfully loaded
 /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
 So I am sure that netlib-native is loaded and cblas supposedly used.
 However, matrix multiplication does executes on CPU since I see 16% of CPU
 used and 0% of GPU used. I also checked different matrix sizes, from
 100x100 to 12000x12000

 Could you

Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Sam Halliday
Btw, OpenBLAS requires GPL runtime binaries which are typically considered
system libraries (and these fall under something similar to the Java
classpath exception rule)... so it's basically impossible to distribute
OpenBLAS the way you're suggesting, sorry. Indeed, there is work ongoing in
Spark right now to clear up something of this nature.

On a more technical level, I'd recommend watching my talk at ScalaX which
explains in detail why high performance only comes from machine optimised
binaries, which requires DevOps buy-in (and, I'd recommend using MKL anyway
on the CPU, not OpenBLAS).

On an even deeper level, using natives has consequences to JIT and GC which
isn't suitable for everybody and we'd really like people to go into that
with their eyes wide open.
On 26 Mar 2015 07:43, Sam Halliday sam.halli...@gmail.com wrote:

 I'm not at all surprised ;-) I fully expect the GPU performance to get
 better automatically as the hardware improves.

 Netlib natives still need to be shipped separately. I'd also oppose any
 move to make Open BLAS the default - is not always better and I think
 natives really need DevOps buy-in. It's not the right solution for
 everybody.
 On 26 Mar 2015 01:23, Evan R. Sparks evan.spa...@gmail.com wrote:

 Yeah, much more reasonable - nice to know that we can get full GPU
 performance from breeze/netlib-java - meaning there's no compelling
 performance reason to switch out our current linear algebra library (at
 least as far as this benchmark is concerned).

 Instead, it looks like a user guide for configuring Spark/MLlib to use
 the right BLAS library will get us most of the way there. Or, would it make
 sense to finally ship openblas compiled for some common platforms (64-bit
 linux, windows, mac) directly with Spark - hopefully eliminating the jblas
 warnings once and for all for most users? (Licensing is BSD) Or am I
 missing something?

 On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander 
 alexander.ula...@hp.com wrote:

 As everyone suggested, the results were too good to be true, so I
 double-checked them. It turns that nvblas did not do multiplication due to
 parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My
 previously posted results with nvblas are matrices copying only. The
 default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
 handpicked other values that worked. As a result, netlib+nvblas is on par
 with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
 configuration.


 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing



 -Original Message-
 From: Ulanov, Alexander
 Sent: Wednesday, March 25, 2015 2:31 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R.
 Sparks; jfcanny
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi again,

 I finally managed to use nvblas within Spark+netlib-java. It has
 exceptional performance for big matrices with Double, faster than
 BIDMat-cuda with Float. But for smaller matrices, if you will copy them
 to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
 original nvblas presentation on GPU conf 2013 (slide 21):
 http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

 My results:

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Just in case, these tests are not for generalization of performance of
 different libraries. I just want to pick a library that does at best dense
 matrices multiplication for my task.

 P.S. My previous issue with nvblas was the following: it has Fortran
 blas functions, at the same time netlib-java uses C cblas functions. So,
 one needs cblas shared library to use nvblas through netlib-java. Fedora
 does not have cblas (but Debian and Ubuntu have), so I needed to compile
 it. I could not use cblas from Atlas or Openblas because they link to their
 implementation and not to Fortran blas.

 Best regards, Alexander

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, March 24, 2015 6:57 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi,

 I am trying to use nvblas with netlib-java from Spark. nvblas functions
 should replace current blas functions calls after executing LD_PRELOAD as
 suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
 changes to netlib-java. It seems to work for simple Java example, but I
 cannot make it work with Spark. I run the following:
 export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
 --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:

 +-+
 | Processes

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander
Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has exceptional 
performance for big matrices with Double, faster than BIDMat-cuda with Float. 
But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL 
might be a better choice. This correlates with original nvblas presentation on 
GPU conf 2013 (slide 21): 
http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
 
My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
 

Just in case, these tests are not for generalization of performance of 
different libraries. I just want to pick a library that does at best dense 
matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran blas 
functions, at the same time netlib-java uses C cblas functions. So, one needs 
cblas shared library to use nvblas through netlib-java. Fedora does not have 
cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use 
cblas from Atlas or Openblas because they link to their implementation and not 
to Fortran blas.

Best regards, Alexander

-Original Message-
From: Ulanov, Alexander 
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas functions should 
replace current blas functions calls after executing LD_PRELOAD as suggested in 
http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. 
It seems to work for simple Java example, but I cannot make it work with Spark. 
I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell 
--driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
+-+
| Processes:   GPU Memory |
|  GPU   PID  Type  Process name   Usage  |
|=|
|0  8873C   bash39MiB |
|0  8910C   /usr/lib/jvm/java-1.7.0/bin/java39MiB |
+-+

In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded 
/tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly used. However, 
matrix multiplication does executes on CPU since I see 16% of CPU used and 0% 
of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander



From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on 
various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the 
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the 
support of Double in the current source code), did the test with BIDMat and CPU 
Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday 
[mailto:sam.halli...@gmail.commailto:sam.halli...@gmail.com]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; 
dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng men...@gmail.commailto:men...@gmail.com writes:

 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x 
 slower than netlib-openblas. What is the overhead of using a GPU BLAS 
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
 jos...@databricks.commailto:jos...@databricks.com wrote:
 Better documentation for linking

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Sam Halliday
That would be a difficult task that would only benefit users of
netlib-java. MultiBLAS is easily implemented (although a lot of
boilerplate) and benefits all BLAS users on the system.

If anyone knows of a funding route for it, I'd love to hear from them,
because it's too much work for me to take on at the moment as hobby.
On 25 Mar 2015 22:16, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Sam,

 whould it be easier to hack netlib-java to allow multiple (configurable)
  library contexts? And so enable 3rd party configurations and optimizers to
 make their own choices until then?

 On Wed, Mar 25, 2015 at 3:07 PM, Sam Halliday sam.halli...@gmail.com
 wrote:

 Yeah, MultiBLAS... it is dynamic.

 Except, I haven't written it yet :-P
 On 25 Mar 2015 22:06, Ulanov, Alexander alexander.ula...@hp.com
 wrote:

  Netlib knows nothing about GPU (or CPU), it just uses cblas symbols
 from the provided libblas.so.3 library at the runtime. So, you can switch
 at the runtime by providing another library. Sam, please suggest if there
 is another way.



 *From:* Dmitriy Lyubimov [mailto:dlie...@gmail.com]
 *Sent:* Wednesday, March 25, 2015 2:55 PM
 *To:* Ulanov, Alexander
 *Cc:* Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph
 Bradley; Evan R. Sparks; jfcanny
 *Subject:* Re: Using CUDA within Spark / boosting linear algebra



 Alexander,



 does using netlib imply that one cannot switch between CPU and GPU blas
 alternatives at will at the same time? the choice is always determined by
 linking aliternatives to libblas.so, right?



 On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander 
 alexander.ula...@hp.com wrote:

 Hi again,

 I finally managed to use nvblas within Spark+netlib-java. It has
 exceptional performance for big matrices with Double, faster than
 BIDMat-cuda with Float. But for smaller matrices, if you will copy them
 to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
 original nvblas presentation on GPU conf 2013 (slide 21):
 http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

 My results:

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Just in case, these tests are not for generalization of performance of
 different libraries. I just want to pick a library that does at best dense
 matrices multiplication for my task.

 P.S. My previous issue with nvblas was the following: it has Fortran
 blas functions, at the same time netlib-java uses C cblas functions. So,
 one needs cblas shared library to use nvblas through netlib-java. Fedora
 does not have cblas (but Debian and Ubuntu have), so I needed to compile
 it. I could not use cblas from Atlas or Openblas because they link to their
 implementation and not to Fortran blas.

 Best regards, Alexander

 -Original Message-
 From: Ulanov, Alexander

 Sent: Tuesday, March 24, 2015 6:57 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi,

 I am trying to use nvblas with netlib-java from Spark. nvblas functions
 should replace current blas functions calls after executing LD_PRELOAD as
 suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
 changes to netlib-java. It seems to work for simple Java example, but I
 cannot make it work with Spark. I run the following:
 export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
 --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:

 +-+
 | Processes:   GPU
 Memory |
 |  GPU   PID  Type  Process name
  Usage  |

 |=|
 |0  8873C   bash
 39MiB |
 |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
 39MiB |

 +-+

 In Spark shell I do matrix multiplication and see the following:
 15/03/25 06:48:01 INFO JniLoader: successfully loaded
 /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
 So I am sure that netlib-native is loaded and cblas supposedly used.
 However, matrix multiplication does executes on CPU since I see 16% of CPU
 used and 0% of GPU used. I also checked different matrix sizes, from
 100x100 to 12000x12000

 Could you suggest might the LD_PRELOAD not affect Spark shell?

 Best regards, Alexander



 From: Sam Halliday [mailto:sam.halli...@gmail.com]
 Sent: Monday, March 09, 2015 6:01 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra


 Thanks so much for following up on this!

 Hmm, I wonder if we should have a concerted

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Sam Halliday
If you write it up I'll add it to the netlib-java wiki :-)

BTW, does it automatically flip between cpu/GPU? I've a project called
MultiBLAS which was going to do this, it should be easy (but boring to
write)
On 25 Mar 2015 22:00, Evan R. Sparks evan.spa...@gmail.com wrote:

 Alex - great stuff, and the nvblas numbers are pretty remarkable (almost
 too good... did you check the results for correctness? - also, is it
 possible that the unified memory model of nvblas is somehow hiding pci
 transfer time?)

 this last bit (getting nvblas + netlib-java to play together) sounds like
 it's non-trivial and took you a while to figure out! Would you mind posting
 a gist or something of maybe the shell scripts/exports you used to make
 this work - I can imagine it being highly useful for others in the future.

 Thanks!
 Evan

 On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander 
 alexander.ula...@hp.com wrote:

 Hi again,

 I finally managed to use nvblas within Spark+netlib-java. It has
 exceptional performance for big matrices with Double, faster than
 BIDMat-cuda with Float. But for smaller matrices, if you will copy them
 to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
 original nvblas presentation on GPU conf 2013 (slide 21):
 http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

 My results:

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Just in case, these tests are not for generalization of performance of
 different libraries. I just want to pick a library that does at best dense
 matrices multiplication for my task.

 P.S. My previous issue with nvblas was the following: it has Fortran blas
 functions, at the same time netlib-java uses C cblas functions. So, one
 needs cblas shared library to use nvblas through netlib-java. Fedora does
 not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
 could not use cblas from Atlas or Openblas because they link to their
 implementation and not to Fortran blas.

 Best regards, Alexander

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, March 24, 2015 6:57 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi,

 I am trying to use nvblas with netlib-java from Spark. nvblas functions
 should replace current blas functions calls after executing LD_PRELOAD as
 suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
 changes to netlib-java. It seems to work for simple Java example, but I
 cannot make it work with Spark. I run the following:
 export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
 --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:

 +-+
 | Processes:   GPU
 Memory |
 |  GPU   PID  Type  Process name   Usage
 |

 |=|
 |0  8873C   bash
 39MiB |
 |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
 39MiB |

 +-+

 In Spark shell I do matrix multiplication and see the following:
 15/03/25 06:48:01 INFO JniLoader: successfully loaded
 /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
 So I am sure that netlib-native is loaded and cblas supposedly used.
 However, matrix multiplication does executes on CPU since I see 16% of CPU
 used and 0% of GPU used. I also checked different matrix sizes, from
 100x100 to 12000x12000

 Could you suggest might the LD_PRELOAD not affect Spark shell?

 Best regards, Alexander



 From: Sam Halliday [mailto:sam.halli...@gmail.com]
 Sent: Monday, March 09, 2015 6:01 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra


 Thanks so much for following up on this!

 Hmm, I wonder if we should have a concerted effort to chart performance
 on various pieces of hardware...
 On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto:
 alexander.ula...@hp.com wrote:
 Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
 comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
 support of Double in the current source code), did the test with BIDMat and
 CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.


 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Best regards, Alexander

 -Original Message-
 From: Sam Halliday [mailto:sam.halli...@gmail.commailto:
 sam.halli...@gmail.com

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander
Sure, I will write a how-to after I re-check the results.

-Original Message-
From: Sam Halliday [mailto:sam.halli...@gmail.com] 
Sent: Wednesday, March 25, 2015 3:04 PM
To: Evan R. Sparks; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

If you write it up I'll add it to the netlib-java wiki :-)

BTW, does it automatically flip between cpu/GPU? I've a project called 
MultiBLAS which was going to do this, it should be easy (but boring to
write)
On 25 Mar 2015 22:00, Evan R. Sparks evan.spa...@gmail.com wrote:

 Alex - great stuff, and the nvblas numbers are pretty remarkable 
 (almost too good... did you check the results for correctness? - also, 
 is it possible that the unified memory model of nvblas is somehow 
 hiding pci transfer time?)

 this last bit (getting nvblas + netlib-java to play together) sounds 
 like it's non-trivial and took you a while to figure out! Would you 
 mind posting a gist or something of maybe the shell scripts/exports 
 you used to make this work - I can imagine it being highly useful for others 
 in the future.

 Thanks!
 Evan

 On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander  
 alexander.ula...@hp.com wrote:

 Hi again,

 I finally managed to use nvblas within Spark+netlib-java. It has 
 exceptional performance for big matrices with Double, faster than 
 BIDMat-cuda with Float. But for smaller matrices, if you will copy 
 them to/from GPU, OpenBlas or MKL might be a better choice. This 
 correlates with original nvblas presentation on GPU conf 2013 (slide 21):
 http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3
 108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

 My results:

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx37
 8T9J5r7kwKSPkY/edit?usp=sharing

 Just in case, these tests are not for generalization of performance 
 of different libraries. I just want to pick a library that does at 
 best dense matrices multiplication for my task.

 P.S. My previous issue with nvblas was the following: it has Fortran 
 blas functions, at the same time netlib-java uses C cblas functions. 
 So, one needs cblas shared library to use nvblas through netlib-java. 
 Fedora does not have cblas (but Debian and Ubuntu have), so I needed 
 to compile it. I could not use cblas from Atlas or Openblas because 
 they link to their implementation and not to Fortran blas.

 Best regards, Alexander

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, March 24, 2015 6:57 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. 
 Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi,

 I am trying to use nvblas with netlib-java from Spark. nvblas 
 functions should replace current blas functions calls after executing 
 LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage 
 without any changes to netlib-java. It seems to work for simple Java 
 example, but I cannot make it work with Spark. I run the following:
 export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell 
 --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:

 +-+
 | Processes:   GPU
 Memory |
 |  GPU   PID  Type  Process name   Usage
 |

 |=|
 |0  8873C   bash
 39MiB |
 |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
 39MiB |

 +-+

 In Spark shell I do matrix multiplication and see the following:
 15/03/25 06:48:01 INFO JniLoader: successfully loaded 
 /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
 So I am sure that netlib-native is loaded and cblas supposedly used.
 However, matrix multiplication does executes on CPU since I see 16% 
 of CPU used and 0% of GPU used. I also checked different matrix 
 sizes, from
 100x100 to 12000x12000

 Could you suggest might the LD_PRELOAD not affect Spark shell?

 Best regards, Alexander



 From: Sam Halliday [mailto:sam.halli...@gmail.com]
 Sent: Monday, March 09, 2015 6:01 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. 
 Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra


 Thanks so much for following up on this!

 Hmm, I wonder if we should have a concerted effort to chart 
 performance on various pieces of hardware...
 On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto:
 alexander.ula...@hp.com wrote:
 Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added 
 the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I 
 see the support of Double in the current source code), did the test

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander
Netlib knows nothing about GPU (or CPU), it just uses cblas symbols from the 
provided libblas.so.3 library at the runtime. So, you can switch at the runtime 
by providing another library. Sam, please suggest if there is another way.

From: Dmitriy Lyubimov [mailto:dlie...@gmail.com]
Sent: Wednesday, March 25, 2015 2:55 PM
To: Ulanov, Alexander
Cc: Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. 
Sparks; jfcanny
Subject: Re: Using CUDA within Spark / boosting linear algebra

Alexander,

does using netlib imply that one cannot switch between CPU and GPU blas 
alternatives at will at the same time? the choice is always determined by 
linking aliternatives to libblas.so, right?

On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has exceptional 
performance for big matrices with Double, faster than BIDMat-cuda with Float. 
But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL 
might be a better choice. This correlates with original nvblas presentation on 
GPU conf 2013 (slide 21): 
http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of performance of 
different libraries. I just want to pick a library that does at best dense 
matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran blas 
functions, at the same time netlib-java uses C cblas functions. So, one needs 
cblas shared library to use nvblas through netlib-java. Fedora does not have 
cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use 
cblas from Atlas or Openblas because they link to their implementation and not 
to Fortran blas.

Best regards, Alexander

-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org; Xiangrui Meng; Joseph 
Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas functions should 
replace current blas functions calls after executing LD_PRELOAD as suggested in 
http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. 
It seems to work for simple Java example, but I cannot make it work with Spark. 
I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell 
--driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
+-+
| Processes:   GPU Memory |
|  GPU   PID  Type  Process name   Usage  |
|=|
|0  8873C   bash39MiB |
|0  8910C   /usr/lib/jvm/java-1.7.0/bin/java39MiB |
+-+

In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded 
/tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly used. However, 
matrix multiplication does executes on CPU since I see 16% of CPU used and 0% 
of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander



From: Sam Halliday 
[mailto:sam.halli...@gmail.commailto:sam.halli...@gmail.com]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org; Xiangrui Meng; Joseph 
Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on 
various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.com
 wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the 
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the 
support of Double in the current source code), did the test with BIDMat and CPU 
Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander
As everyone suggested, the results were too good to be true, so I 
double-checked them. It turns that nvblas did not do multiplication due to 
parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My 
previously posted results with nvblas are matrices copying only. The default 
NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked 
other values that worked. As a result, netlib+nvblas is on par with 
BIDMat-cuda. As promised, I am going to post a how-to for nvblas configuration.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing



-Original Message-
From: Ulanov, Alexander 
Sent: Wednesday, March 25, 2015 2:31 PM
To: Sam Halliday
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has exceptional 
performance for big matrices with Double, faster than BIDMat-cuda with Float. 
But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL 
might be a better choice. This correlates with original nvblas presentation on 
GPU conf 2013 (slide 21): 
http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
 
My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
 

Just in case, these tests are not for generalization of performance of 
different libraries. I just want to pick a library that does at best dense 
matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran blas 
functions, at the same time netlib-java uses C cblas functions. So, one needs 
cblas shared library to use nvblas through netlib-java. Fedora does not have 
cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use 
cblas from Atlas or Openblas because they link to their implementation and not 
to Fortran blas.

Best regards, Alexander

-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas functions should 
replace current blas functions calls after executing LD_PRELOAD as suggested in 
http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. 
It seems to work for simple Java example, but I cannot make it work with Spark. 
I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell 
--driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
+-+
| Processes:   GPU Memory |
|  GPU   PID  Type  Process name   Usage  |
|=|
|0  8873C   bash39MiB |
|0  8910C   /usr/lib/jvm/java-1.7.0/bin/java39MiB |
+-+

In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded 
/tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly used. However, 
matrix multiplication does executes on CPU since I see 16% of CPU used and 0% 
of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander



From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on 
various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the 
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the 
support of Double in the current source code), did the test with BIDMat and CPU 
Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday 
[mailto:sam.halli...@gmail.commailto:sam.halli...@gmail.com]
Sent: Tuesday, March 03

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Evan R. Sparks
Alex - great stuff, and the nvblas numbers are pretty remarkable (almost
too good... did you check the results for correctness? - also, is it
possible that the unified memory model of nvblas is somehow hiding pci
transfer time?)

this last bit (getting nvblas + netlib-java to play together) sounds like
it's non-trivial and took you a while to figure out! Would you mind posting
a gist or something of maybe the shell scripts/exports you used to make
this work - I can imagine it being highly useful for others in the future.

Thanks!
Evan

On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Hi again,

 I finally managed to use nvblas within Spark+netlib-java. It has
 exceptional performance for big matrices with Double, faster than
 BIDMat-cuda with Float. But for smaller matrices, if you will copy them
 to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
 original nvblas presentation on GPU conf 2013 (slide 21):
 http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

 My results:

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Just in case, these tests are not for generalization of performance of
 different libraries. I just want to pick a library that does at best dense
 matrices multiplication for my task.

 P.S. My previous issue with nvblas was the following: it has Fortran blas
 functions, at the same time netlib-java uses C cblas functions. So, one
 needs cblas shared library to use nvblas through netlib-java. Fedora does
 not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
 could not use cblas from Atlas or Openblas because they link to their
 implementation and not to Fortran blas.

 Best regards, Alexander

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, March 24, 2015 6:57 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi,

 I am trying to use nvblas with netlib-java from Spark. nvblas functions
 should replace current blas functions calls after executing LD_PRELOAD as
 suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
 changes to netlib-java. It seems to work for simple Java example, but I
 cannot make it work with Spark. I run the following:
 export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
 --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:

 +-+
 | Processes:   GPU
 Memory |
 |  GPU   PID  Type  Process name   Usage
 |

 |=|
 |0  8873C   bash
 39MiB |
 |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
 39MiB |

 +-+

 In Spark shell I do matrix multiplication and see the following:
 15/03/25 06:48:01 INFO JniLoader: successfully loaded
 /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
 So I am sure that netlib-native is loaded and cblas supposedly used.
 However, matrix multiplication does executes on CPU since I see 16% of CPU
 used and 0% of GPU used. I also checked different matrix sizes, from
 100x100 to 12000x12000

 Could you suggest might the LD_PRELOAD not affect Spark shell?

 Best regards, Alexander



 From: Sam Halliday [mailto:sam.halli...@gmail.com]
 Sent: Monday, March 09, 2015 6:01 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra


 Thanks so much for following up on this!

 Hmm, I wonder if we should have a concerted effort to chart performance on
 various pieces of hardware...
 On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto:
 alexander.ula...@hp.com wrote:
 Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
 comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
 support of Double in the current source code), did the test with BIDMat and
 CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.


 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Best regards, Alexander

 -Original Message-
 From: Sam Halliday [mailto:sam.halli...@gmail.commailto:
 sam.halli...@gmail.com]
 Sent: Tuesday, March 03, 2015 1:54 PM
 To: Xiangrui Meng; Joseph Bradley
 Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.orgmailto:
 dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 BTW, is anybody on this list going to the London Meetup

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread jfcanny
Alex,
I think you should recheck your numbers. Both BIDMat and nvblas are 
wrappers for cublas. The speeds are identical, except on machines that 
have multiple GPUs which nvblas exploits and cublas doesnt.

It would be a good idea to add a column with Gflop throughput. Your 
numbers for BIDMat 10kx10k multiply give about 300 single float gflops, 
which seems about right for a Quadro 4000 (current generation devices 
are  10x faster than a 4000).

Your numbers for netlib-nvblas would indicate a double float throughput 
of 8 tflops, which is physically impossible on that device.

It shouldnt matter which interface you use if you have a single GPU.

-John

On 3/25/2015 2:34 PM, Ulanov, Alexander [via Apache Spark Developers 
List] wrote:
 Hi again,

 I finally managed to use nvblas within Spark+netlib-java. It has 
 exceptional performance for big matrices with Double, faster than 
 BIDMat-cuda with Float. But for smaller matrices, if you will copy 
 them to/from GPU, OpenBlas or MKL might be a better choice. This 
 correlates with original nvblas presentation on GPU conf 2013 (slide 
 21): 
 http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

 My results:
 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
  


 Just in case, these tests are not for generalization of performance of 
 different libraries. I just want to pick a library that does at best 
 dense matrices multiplication for my task.

 P.S. My previous issue with nvblas was the following: it has Fortran 
 blas functions, at the same time netlib-java uses C cblas functions. 
 So, one needs cblas shared library to use nvblas through netlib-java. 
 Fedora does not have cblas (but Debian and Ubuntu have), so I needed 
 to compile it. I could not use cblas from Atlas or Openblas because 
 they link to their implementation and not to Fortran blas.

 Best regards, Alexander

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, March 24, 2015 6:57 PM
 To: Sam Halliday
 Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi,

 I am trying to use nvblas with netlib-java from Spark. nvblas 
 functions should replace current blas functions calls after executing 
 LD_PRELOAD as suggested in 
 http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to 
 netlib-java. It seems to work for simple Java example, but I cannot 
 make it work with Spark. I run the following:
 export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell 
 --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
 +-+
  

 | Processes: GPU Memory |
 |  GPU   PID  Type  Process name Usage  |
 |=|
  

 |0  8873C   bash  39MiB |
 |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java  39MiB |
 +-+
  


 In Spark shell I do matrix multiplication and see the following:
 15/03/25 06:48:01 INFO JniLoader: successfully loaded 
 /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
 So I am sure that netlib-native is loaded and cblas supposedly used. 
 However, matrix multiplication does executes on CPU since I see 16% of 
 CPU used and 0% of GPU used. I also checked different matrix sizes, 
 from 100x100 to 12000x12000

 Could you suggest might the LD_PRELOAD not affect Spark shell?

 Best regards, Alexander



 From: Sam Halliday [mailto:[hidden email]]
 Sent: Monday, March 09, 2015 6:01 PM
 To: Ulanov, Alexander
 Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra


 Thanks so much for following up on this!

 Hmm, I wonder if we should have a concerted effort to chart 
 performance on various pieces of hardware...
 On 9 Mar 2015 21:08, Ulanov, Alexander [hidden 
 email]mailto:[hidden email] wrote:
 Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added 
 the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I 
 see the support of Double in the current source code), did the test 
 with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with 
 netlib MKL.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Best regards, Alexander

 -Original Message-
 From: Sam Halliday [mailto:[hidden email]mailto:[hidden email]]
 Sent: Tuesday, March 03, 2015 1:54 PM
 To: Xiangrui Meng; Joseph Bradley
 Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]mailto:[hidden 
 email]
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 BTW, is anybody on this list going

Re: Using CUDA within Spark / boosting linear algebra

2015-03-13 Thread Chester At Work
Reyonld, 

Prof Canny gives me the slides yesterday I will posted the link to the 
slides to both SF BIg Analytics and SF Machine Learning meetups.

Chester

Sent from my iPad

On Mar 12, 2015, at 22:53, Reynold Xin r...@databricks.com wrote:

 Thanks for chiming in, John. I missed your meetup last night - do you have
 any writeups or slides about roofline design? In particular, I'm curious
 about what optimizations are available for power-law dense * sparse? (I
 don't have any background in optimizations)
 
 
 
 On Thu, Mar 12, 2015 at 8:50 PM, jfcanny ca...@berkeley.edu wrote:
 
 If you're contemplating GPU acceleration in Spark, its important to look
 beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the
 datasets we've tested in BIDMach, and we've tried to make them
 representative of industry machine learning workloads. Unless you're
 crunching images or audio, the majority of data will be very sparse and
 power law distributed. You need a good sparse BLAS, and in practice it
 seems
 like you need a sparse BLAS tailored for power-law data. We had to write
 our
 own since the NVIDIA libraries didnt perform well on typical power-law
 data.
 Intel MKL sparse BLAS also have issues and we only use some of them.
 
 You also need 2D reductions, scan operations, slicing, element-wise
 transcendental functions and operators, many kinds of sort, random number
 generators etc, and some kind of memory management strategy. Some of this
 was layered on top of Thrust in BIDMat, but most had to be written from
 scratch. Its all been rooflined, typically to memory throughput of current
 GPUs (around 200 GB/s).
 
 When you have all this you can write Learning Algorithms in the same
 high-level primitives available in Breeze or Numpy/Scipy. Its literally the
 same in BIDMat, since the generic matrix operations are implemented on both
 CPU and GPU, so the same code runs on either platform.
 
 A lesser known fact is that GPUs are around 10x faster for *all* those
 operations, not just dense BLAS. Its mostly due to faster streaming memory
 speeds, but some kernels (random number generation and transcendentals) are
 more than an order of magnitude thanks to some specialized hardware for
 power series on the GPU chip.
 
 When you have all this there is no need to move data back and forth across
 the PCI bus. The CPU only has to pull chunks of data off disk, unpack them,
 and feed them to the available GPUs. Most models fit comfortably in GPU
 memory these days (4-12 GB). With minibatch algorithms you can push TBs of
 data through the GPU this way.
 
 
 
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Using CUDA within Spark / boosting linear algebra

2015-03-13 Thread jfcanny
Hi Reynold,
I left Chester with a copy of the slides, so I assume they'll be posted 
on the SF ML or Big Data sites. We have a draft paper under review. I 
can ask the co-authors about arxiv'ing it.

We have a few heuristics for power-law data. One of them is to keep the 
feature set sorted by frequency. Power-law data has roughly the same 
mass in each power-of-two range of feature frequency. By keeping the 
most frequent features together, you get a lot more value out of the 
caches on the device (even GPUs have them, albeit smaller ones). e.g. 
with 100 million features, 1/2 of the feature instances will be in the 
range 1...,10,000. If they're consecutive they will all hit a fast 
cache. Another 1/4 will be in 1,...,1,000,000 hitting the next cache etc.

Another is to subdivide sparse matrices using the vector of elements 
rather than rows or columns. Splitting power-law matrices by either rows 
or columns gives very uneven splits. That means we store sparse matrices 
in coordinate form rather than compressed row or column format.

Other than that, rooflining gives you a goal that you should be able to 
reach. If you arent at the limit, just knowing that gives you a target 
to aim at. You can try profiling the kernel to figure out why its slower 
than it should be. There are a few common reasons (low occupancy, 
imbalanced thread blocks, thread divergence) that you can discover with 
the profiler. Then hopefully you can solve them.

-John


On 3/12/2015 10:56 PM, rxin [via Apache Spark Developers List] wrote:
 Thanks for chiming in, John. I missed your meetup last night - do you 
 have
 any writeups or slides about roofline design? In particular, I'm curious
 about what optimizations are available for power-law dense * sparse? (I
 don't have any background in optimizations)



 On Thu, Mar 12, 2015 at 8:50 PM, jfcanny [hidden email] 
 /user/SendEmail.jtp?type=nodenode=11022i=0 wrote:

  If you're contemplating GPU acceleration in Spark, its important to 
 look
  beyond BLAS. Dense BLAS probably account for only 10% of the cycles 
 in the
  datasets we've tested in BIDMach, and we've tried to make them
  representative of industry machine learning workloads. Unless you're
  crunching images or audio, the majority of data will be very sparse and
  power law distributed. You need a good sparse BLAS, and in practice it
  seems
  like you need a sparse BLAS tailored for power-law data. We had to 
 write
  our
  own since the NVIDIA libraries didnt perform well on typical power-law
  data.
  Intel MKL sparse BLAS also have issues and we only use some of them.
 
  You also need 2D reductions, scan operations, slicing, element-wise
  transcendental functions and operators, many kinds of sort, random 
 number
  generators etc, and some kind of memory management strategy. Some of 
 this
  was layered on top of Thrust in BIDMat, but most had to be written from
  scratch. Its all been rooflined, typically to memory throughput of 
 current
  GPUs (around 200 GB/s).
 
  When you have all this you can write Learning Algorithms in the same
  high-level primitives available in Breeze or Numpy/Scipy. Its 
 literally the
  same in BIDMat, since the generic matrix operations are implemented 
 on both
  CPU and GPU, so the same code runs on either platform.
 
  A lesser known fact is that GPUs are around 10x faster for *all* those
  operations, not just dense BLAS. Its mostly due to faster streaming 
 memory
  speeds, but some kernels (random number generation and 
 transcendentals) are
  more than an order of magnitude thanks to some specialized hardware for
  power series on the GPU chip.
 
  When you have all this there is no need to move data back and forth 
 across
  the PCI bus. The CPU only has to pull chunks of data off disk, 
 unpack them,
  and feed them to the available GPUs. Most models fit comfortably in GPU
  memory these days (4-12 GB). With minibatch algorithms you can push 
 TBs of
  data through the GPU this way.
 
 
 
  --
  View this message in context:
  
 http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email] 
 /user/SendEmail.jtp?type=nodenode=11022i=1
  For additional commands, e-mail: [hidden email] 
 /user/SendEmail.jtp?type=nodenode=11022i=2
 
 


 
 If you reply to this email, your message will be added to the 
 discussion below:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11022.html
  

 To unsubscribe from Using CUDA within Spark / boosting linear algebra, 
 click here 
 http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode

RE: Using CUDA within Spark / boosting linear algebra

2015-03-10 Thread Ulanov, Alexander
I can run benchmark on another machine with GPU nVidia Titan and Intel Xeon 
E5-2650 v2, although it runs Windows and I have to run Linux tests in 
VirtualBox.

It would be also interesting to add results on netlib+nvblas, however I am not 
sure I understand in details how to build this and will appreciate any help 
from you ☺

From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on 
various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the 
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the 
support of Double in the current source code), did the test with BIDMat and CPU 
Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday 
[mailto:sam.halli...@gmail.commailto:sam.halli...@gmail.com]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; 
dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng men...@gmail.commailto:men...@gmail.com writes:

 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x
 slower than netlib-openblas. What is the overhead of using a GPU BLAS
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
 jos...@databricks.commailto:jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
 evan.spa...@gmail.commailto:evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running these benchmarks,
 Alex. The big takeaways here can be seen with this chart:

 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
 Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

 1) A properly configured GPU matrix multiply implementation (e.g.
 BIDMat+GPU) can provide substantial (but less than an order of
 BIDMat+magnitude)
 benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
 netlib-java+openblas-compiled).
 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
 worse than a well-tuned CPU implementation, particularly for larger 
 matrices.
 (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
 basically agrees with the authors own benchmarks (
 https://github.com/fommil/netlib-java)

 I think that most of our users are in a situation where using GPUs
 may not be practical - although we could consider having a good GPU
 backend available as an option. However, *ALL* users of MLlib could
 benefit (potentially tremendously) from using a well-tuned CPU-based
 BLAS implementation. Perhaps we should consider updating the mllib
 guide with a more complete section for enabling high performance
 binaries on OSX and Linux? Or better, figure out a way for the
 system to fetch these automatically.

 - Evan



 On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:

 Just to summarize this thread, I was finally able to make all
 performance comparisons that we discussed. It turns out that:
 BIDMat-cublasBIDMat
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo=
 =netlib-cublasnetlib-blasf2jblas

 Below is the link to the spreadsheet with full results.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
 378T9J5r7kwKSPkY/edit?usp=sharing

 One thing still needs exploration: does BIDMat-cublas perform
 copying to/from machine’s RAM?

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, February 10, 2015 2:12 PM
 To: Evan R. Sparks
 Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks, Evan! It seems that ticket was marked as duplicate though
 the original one discusses slightly different topic. I was able to
 link netlib with MKL from BIDMat binaries. Indeed, MKL is
 statically linked inside a 60MB library.

 |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
 Breeze+Netlib-OpenBlas(native

RE: Using CUDA within Spark / boosting linear algebra

2015-03-09 Thread Ulanov, Alexander
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the 
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the 
support of Double in the current source code), did the test with BIDMat and CPU 
Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday [mailto:sam.halli...@gmail.com] 
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng men...@gmail.com writes:

 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x 
 slower than netlib-openblas. What is the overhead of using a GPU BLAS 
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
 evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running these benchmarks, 
 Alex. The big takeaways here can be seen with this chart:

 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
 Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

 1) A properly configured GPU matrix multiply implementation (e.g.
 BIDMat+GPU) can provide substantial (but less than an order of 
 BIDMat+magnitude)
 benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
 netlib-java+openblas-compiled).
 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude 
 worse than a well-tuned CPU implementation, particularly for larger 
 matrices.
 (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this 
 basically agrees with the authors own benchmarks (
 https://github.com/fommil/netlib-java)

 I think that most of our users are in a situation where using GPUs 
 may not be practical - although we could consider having a good GPU 
 backend available as an option. However, *ALL* users of MLlib could 
 benefit (potentially tremendously) from using a well-tuned CPU-based 
 BLAS implementation. Perhaps we should consider updating the mllib 
 guide with a more complete section for enabling high performance 
 binaries on OSX and Linux? Or better, figure out a way for the 
 system to fetch these automatically.

 - Evan



 On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander  
 alexander.ula...@hp.com wrote:

 Just to summarize this thread, I was finally able to make all 
 performance comparisons that we discussed. It turns out that:
 BIDMat-cublasBIDMat
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo=
 =netlib-cublasnetlib-blasf2jblas

 Below is the link to the spreadsheet with full results.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
 378T9J5r7kwKSPkY/edit?usp=sharing

 One thing still needs exploration: does BIDMat-cublas perform 
 copying to/from machine’s RAM?

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, February 10, 2015 2:12 PM
 To: Evan R. Sparks
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks, Evan! It seems that ticket was marked as duplicate though 
 the original one discusses slightly different topic. I was able to 
 link netlib with MKL from BIDMat binaries. Indeed, MKL is 
 statically linked inside a 60MB library.

 |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
 Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
 1569,233228 |

 It turn out that pre-compiled MKL is faster than precompiled 
 OpenBlas on my machine. Probably, I’ll add two more columns with 
 locally compiled openblas and cuda.

 Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Monday, February 09, 2015 6:06 PM
 To: Ulanov, Alexander
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Great - perhaps we can move this discussion off-list and onto a 
 JIRA ticket? (Here's one: 
 https://issues.apache.org/jira/browse/SPARK-5705)

 It seems like this is going to be somewhat exploratory for a while 
 (and there's probably only a handful of us

RE: Using CUDA within Spark / boosting linear algebra

2015-03-09 Thread Sam Halliday
Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on
various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.com wrote:

 Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
 comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
 support of Double in the current source code), did the test with BIDMat and
 CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.


 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Best regards, Alexander

 -Original Message-
 From: Sam Halliday [mailto:sam.halli...@gmail.com]
 Sent: Tuesday, March 03, 2015 1:54 PM
 To: Xiangrui Meng; Joseph Bradley
 Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 BTW, is anybody on this list going to the London Meetup in a few weeks?


 https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

 Would be nice to meet other people working on the guts of Spark! :-)


 Xiangrui Meng men...@gmail.com writes:

  Hey Alexander,
 
  I don't quite understand the part where netlib-cublas is about 20x
  slower than netlib-openblas. What is the overhead of using a GPU BLAS
  with netlib-java?
 
  CC'ed Sam, the author of netlib-java.
 
  Best,
  Xiangrui
 
  On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com
 wrote:
  Better documentation for linking would be very helpful!  Here's a JIRA:
  https://issues.apache.org/jira/browse/SPARK-6019
 
 
  On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
  evan.spa...@gmail.com
  wrote:
 
  Thanks for compiling all the data and running these benchmarks,
  Alex. The big takeaways here can be seen with this chart:
 
  https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
  Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive
 
  1) A properly configured GPU matrix multiply implementation (e.g.
  BIDMat+GPU) can provide substantial (but less than an order of
  BIDMat+magnitude)
  benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
  netlib-java+openblas-compiled).
  2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
  worse than a well-tuned CPU implementation, particularly for larger
 matrices.
  (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
  basically agrees with the authors own benchmarks (
  https://github.com/fommil/netlib-java)
 
  I think that most of our users are in a situation where using GPUs
  may not be practical - although we could consider having a good GPU
  backend available as an option. However, *ALL* users of MLlib could
  benefit (potentially tremendously) from using a well-tuned CPU-based
  BLAS implementation. Perhaps we should consider updating the mllib
  guide with a more complete section for enabling high performance
  binaries on OSX and Linux? Or better, figure out a way for the
  system to fetch these automatically.
 
  - Evan
 
 
 
  On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
  alexander.ula...@hp.com wrote:
 
  Just to summarize this thread, I was finally able to make all
  performance comparisons that we discussed. It turns out that:
  BIDMat-cublasBIDMat
  MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo=
  =netlib-cublasnetlib-blasf2jblas
 
  Below is the link to the spreadsheet with full results.
 
  https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
  378T9J5r7kwKSPkY/edit?usp=sharing
 
  One thing still needs exploration: does BIDMat-cublas perform
  copying to/from machine’s RAM?
 
  -Original Message-
  From: Ulanov, Alexander
  Sent: Tuesday, February 10, 2015 2:12 PM
  To: Evan R. Sparks
  Cc: Joseph Bradley; dev@spark.apache.org
  Subject: RE: Using CUDA within Spark / boosting linear algebra
 
  Thanks, Evan! It seems that ticket was marked as duplicate though
  the original one discusses slightly different topic. I was able to
  link netlib with MKL from BIDMat binaries. Indeed, MKL is
  statically linked inside a 60MB library.
 
  |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
  Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
 
 +---+
  |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
  |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
  |1,638475459 |
  |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
  1569,233228 |
 
  It turn out that pre-compiled MKL is faster than precompiled
  OpenBlas on my machine. Probably, I’ll add two more columns with
  locally compiled openblas and cuda.
 
  Alexander
 
  From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
  Sent: Monday, February 09, 2015 6:06 PM
  To: Ulanov, Alexander
  Cc: Joseph Bradley; dev@spark.apache.org
  Subject: Re

Re: Using CUDA within Spark / boosting linear algebra

2015-03-03 Thread Sam Halliday
BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng men...@gmail.com writes:

 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x
 slower than netlib-openblas. What is the overhead of using a GPU BLAS
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running these benchmarks, Alex. The
 big takeaways here can be seen with this chart:

 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

 1) A properly configured GPU matrix multiply implementation (e.g.
 BIDMat+GPU) can provide substantial (but less than an order of magnitude)
 benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
 netlib-java+openblas-compiled).
 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
 than a well-tuned CPU implementation, particularly for larger matrices.
 (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
 basically agrees with the authors own benchmarks (
 https://github.com/fommil/netlib-java)

 I think that most of our users are in a situation where using GPUs may not
 be practical - although we could consider having a good GPU backend
 available as an option. However, *ALL* users of MLlib could benefit
 (potentially tremendously) from using a well-tuned CPU-based BLAS
 implementation. Perhaps we should consider updating the mllib guide with a
 more complete section for enabling high performance binaries on OSX and
 Linux? Or better, figure out a way for the system to fetch these
 automatically.

 - Evan



 On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
 alexander.ula...@hp.com wrote:

 Just to summarize this thread, I was finally able to make all performance
 comparisons that we discussed. It turns out that:
 BIDMat-cublasBIDMat
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

 Below is the link to the spreadsheet with full results.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 One thing still needs exploration: does BIDMat-cublas perform copying
 to/from machine’s RAM?

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, February 10, 2015 2:12 PM
 To: Evan R. Sparks
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks, Evan! It seems that ticket was marked as duplicate though the
 original one discusses slightly different topic. I was able to link netlib
 with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
 60MB library.

 |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
 Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
 1569,233228 |

 It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
 my machine. Probably, I’ll add two more columns with locally compiled
 openblas and cuda.

 Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Monday, February 09, 2015 6:06 PM
 To: Ulanov, Alexander
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Great - perhaps we can move this discussion off-list and onto a JIRA
 ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

 It seems like this is going to be somewhat exploratory for a while (and
 there's probably only a handful of us who really care about fast linear
 algebra!)

 - Evan

 On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 Hi Evan,

 Thank you for explanation and useful link. I am going to build OpenBLAS,
 link it with Netlib-java and perform benchmark again.

 Do I understand correctly that BIDMat binaries contain statically linked
 Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
 having MKL BLAS installed on my server. If it is true, I wonder if it is OK
 because Intel sells this library. Nevertheless, it seems that in my case
 precompiled MKL BLAS performs better than precompiled OpenBLAS given that
 BIDMat and Netlib-java are supposed to be on par

RE: Using CUDA within Spark / boosting linear algebra

2015-03-02 Thread Ulanov, Alexander
Thanks Sam for suggestion! I should try doing this. Now I suppose that 
netlib-java linked with cuBlas during the execution time does fall back to 
cblas library in my system, which is atlas. If I remove atlas, netlib (linked 
with cublas) fails with the message undefined symbol: cblas_dgemm.  

In the meantime, I have updated my spreadsheet with BIDMat-cuda results that 
does copy from main memory to GPU, multiplies and the copies it back to main 
memory (similar to what Xiangrui did). Surprisingly (for myself), the copying 
overhead seems quite small, especially for the bigger matrices.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

-Original Message-
From: Sam Halliday [mailto:sam.halli...@gmail.com] 
Sent: Monday, March 02, 2015 1:24 PM
To: Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra

That's correct. It's highly unusual for a libblas.so to only provide the 
Fortran API. Oh well... CBLAS sources are available in the netlib-java 
repository so you could simply compile them and link against whatever 
libblas.so[fortran] you like.

On 2 March 2015 at 21:04, Ulanov, Alexander alexander.ula...@hp.com wrote:
 Hi Xiangrui,

 Thanks for the link, I am currently trying to use nvblas. It seems that 
 netlib wrappers are implemented with C-BLAS interface and nvblas does not 
 have c-blas. I wonder how it is going to work. I'll keep you updated.

 Alexander

 -Original Message-
 From: Xiangrui Meng [mailto:men...@gmail.com]
 Sent: Monday, March 02, 2015 11:42 AM
 To: Sam Halliday
 Cc: Joseph Bradley; Ulanov, Alexander; dev; Evan R. Sparks
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday sam.halli...@gmail.com wrote:
 Also, check the JNILoader output.

 Remember, for netlib-java to use your system libblas all you need to 
 do is setup libblas.so.3 like any native application would expect.

 I haven't ever used the cublas real BLAS  implementation, so I'd be 
 interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to 
 check that all the runtime links are in order.


 There are two shared libraries in this hybrid setup. nvblas.so must be 
 loaded before libblas.so to intercept level 3 routines using GPU. More 
 details are at: http://docs.nvidia.com/cuda/nvblas/index.html#Usage

 Btw, I have some DGEMM wrappers in my netlib-java performance 
 module... and I also planned to write more in MultiBLAS (until I 
 mothballed the project for the hardware to catch up, which is 
 probably has and now I just need a reason to look at it)

 On 27 Feb 2015 20:26, Xiangrui Meng men...@gmail.com wrote:

 Hey Sam,

 The running times are not big O estimates:

  The CPU version finished in 12 seconds.
  The CPU-GPU-CPU version finished in 2.2 seconds.
  The GPU version finished in 1.7 seconds.

 I think there is something wrong with the netlib/cublas combination.
 Sam already mentioned that cuBLAS doesn't implement the CPU BLAS 
 interfaces. I checked the CUDA doc and it seems that to use GPU BLAS 
 through the CPU BLAS interface we need to use NVBLAS, which 
 intercepts some Level 3 CPU BLAS calls (including GEMM). So we need 
 to load nvblas.so first and then some CPU BLAS library in JNI. I 
 wonder whether the setup was correct.

 Alexander, could you check whether GPU is used in the netlib-cublas 
 experiments? You can tell it by watching CPU/GPU usage.

 Best,
 Xiangrui

 On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday 
 sam.halli...@gmail.com
 wrote:
  Don't use big O estimates, always measure. It used to work back 
  in the days when double multiplication was a bottleneck. The 
  computation cost is effectively free on both the CPU and GPU and 
  you're seeing pure copying costs. Also, I'm dubious that cublas is 
  doing what you think it is. Can you link me to the source code for 
  DGEMM?
 
  I show all of this in my talk, with explanations, I can't stress 
  enough how much I recommend that you watch it if you want to 
  understand high performance hardware acceleration for linear 
  algebra :-)
 
  On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote:
 
  The copying overhead should be quadratic on n, while the 
  computation cost is cubic on n. I can understand that 
  netlib-cublas is slower than netlib-openblas on small problems.
  But I'm surprised to see that it is still 20x slower on 
  1x1. I did the following on a g2.2xlarge instance with BIDMat:
 
  val n = 1
 
  val f = rand(n, n)
  flip; f*f; val rf = flop
 
  flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val 
  rg = flop
 
  flip; g*g; val rgg = flop
 
  The CPU version finished in 12 seconds.
  The CPU-GPU-CPU version finished in 2.2 seconds.
  The GPU version finished in 1.7 seconds.
 
  I'm not sure whether my CPU-GPU-CPU code simulates the 
  netlib-cublas path. But based on the result, the data copying 
  overhead is definitely not as big

RE: Using CUDA within Spark / boosting linear algebra

2015-03-02 Thread Ulanov, Alexander
Hi Xiangrui,

Thanks for the link, I am currently trying to use nvblas. It seems that netlib 
wrappers are implemented with C-BLAS interface and nvblas does not have c-blas. 
I wonder how it is going to work. I'll keep you updated.

Alexander

-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Monday, March 02, 2015 11:42 AM
To: Sam Halliday
Cc: Joseph Bradley; Ulanov, Alexander; dev; Evan R. Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra

On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday sam.halli...@gmail.com wrote:
 Also, check the JNILoader output.

 Remember, for netlib-java to use your system libblas all you need to 
 do is setup libblas.so.3 like any native application would expect.

 I haven't ever used the cublas real BLAS  implementation, so I'd be 
 interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to 
 check that all the runtime links are in order.


There are two shared libraries in this hybrid setup. nvblas.so must be loaded 
before libblas.so to intercept level 3 routines using GPU. More details are at: 
http://docs.nvidia.com/cuda/nvblas/index.html#Usage

 Btw, I have some DGEMM wrappers in my netlib-java performance 
 module... and I also planned to write more in MultiBLAS (until I 
 mothballed the project for the hardware to catch up, which is probably 
 has and now I just need a reason to look at it)

 On 27 Feb 2015 20:26, Xiangrui Meng men...@gmail.com wrote:

 Hey Sam,

 The running times are not big O estimates:

  The CPU version finished in 12 seconds.
  The CPU-GPU-CPU version finished in 2.2 seconds.
  The GPU version finished in 1.7 seconds.

 I think there is something wrong with the netlib/cublas combination.
 Sam already mentioned that cuBLAS doesn't implement the CPU BLAS 
 interfaces. I checked the CUDA doc and it seems that to use GPU BLAS 
 through the CPU BLAS interface we need to use NVBLAS, which 
 intercepts some Level 3 CPU BLAS calls (including GEMM). So we need 
 to load nvblas.so first and then some CPU BLAS library in JNI. I 
 wonder whether the setup was correct.

 Alexander, could you check whether GPU is used in the netlib-cublas 
 experiments? You can tell it by watching CPU/GPU usage.

 Best,
 Xiangrui

 On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday 
 sam.halli...@gmail.com
 wrote:
  Don't use big O estimates, always measure. It used to work back 
  in the days when double multiplication was a bottleneck. The 
  computation cost is effectively free on both the CPU and GPU and 
  you're seeing pure copying costs. Also, I'm dubious that cublas is 
  doing what you think it is. Can you link me to the source code for 
  DGEMM?
 
  I show all of this in my talk, with explanations, I can't stress 
  enough how much I recommend that you watch it if you want to 
  understand high performance hardware acceleration for linear 
  algebra :-)
 
  On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote:
 
  The copying overhead should be quadratic on n, while the 
  computation cost is cubic on n. I can understand that 
  netlib-cublas is slower than netlib-openblas on small problems. 
  But I'm surprised to see that it is still 20x slower on 
  1x1. I did the following on a g2.2xlarge instance with BIDMat:
 
  val n = 1
 
  val f = rand(n, n)
  flip; f*f; val rf = flop
 
  flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val 
  rg = flop
 
  flip; g*g; val rgg = flop
 
  The CPU version finished in 12 seconds.
  The CPU-GPU-CPU version finished in 2.2 seconds.
  The GPU version finished in 1.7 seconds.
 
  I'm not sure whether my CPU-GPU-CPU code simulates the 
  netlib-cublas path. But based on the result, the data copying 
  overhead is definitely not as big as 20x at n = 1.
 
  Best,
  Xiangrui
 
 
  On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday 
  sam.halli...@gmail.com
  wrote:
   I've had some email exchanges with the author of BIDMat: it does 
   exactly what you need to get the GPU benefit and writes higher 
   level algorithms entirely in the GPU kernels so that the memory 
   stays there as long as possible. The restriction with this 
   approach is that it is only offering high-level algorithms so is 
   not a toolkit for applied mathematics research and development 
   --- but it works well as a toolkit for higher level analysis 
   (e.g. for analysts and practitioners).
  
   I believe BIDMat's approach is the best way to get performance 
   out of GPU hardware at the moment but I also have strong 
   evidence to suggest that the hardware will catch up and the 
   memory transfer costs between CPU/GPU will disappear meaning 
   that there will be no need for custom GPU kernel 
   implementations. i.e. please continue to use BLAS primitives 
   when writing new algorithms and only go to the GPU for an 
   alternative optimised implementation.
  
   Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, 
   and offer an API that looks like BLAS but takes

Re: Using CUDA within Spark / boosting linear algebra

2015-03-02 Thread Xiangrui Meng
 the
   potential to eliminate the green line.
  
   Best regards,
   Sam
  
  
  
   Ulanov, Alexander alexander.ula...@hp.com writes:
  
   Evan, thank you for the summary. I would like to add some more
   observations. The GPU that I used is 2.5 times cheaper than the CPU
   ($250 vs
   $100). They both are 3 years old. I've also did a small test with
   modern
   hardware, and the new GPU nVidia Titan was slightly more than 1
   order of
   magnitude faster than Intel E5-2650 v2 for the same tests. However,
   it costs
   as much as CPU ($1200). My takeaway is that GPU is making a better
   price/value progress.
  
  
  
   Xiangrui, I was also surprised that BIDMat-cuda was faster than
   netlib-cuda and the most reasonable explanation is that it holds the
   result
   in GPU memory, as Sam suggested. At the same time, it is OK because
   you can
   copy the result back from GPU only when needed. However, to be sure,
   I am
   going to ask the developer of BIDMat on his upcoming talk.
  
  
  
   Best regards, Alexander
  
  
   From: Sam Halliday [mailto:sam.halli...@gmail.com]
   Sent: Thursday, February 26, 2015 1:56 PM
   To: Xiangrui Meng
   Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R.
   Sparks
   Subject: Re: Using CUDA within Spark / boosting linear algebra
  
  
   Btw, I wish people would stop cheating when comparing CPU and GPU
   timings for things like matrix multiply :-P
  
   Please always compare apples with apples and include the time it
   takes
   to set up the matrices, send it to the processing unit, doing the
   calculation AND copying it back to where you need to see the
   results.
  
   Ignoring this method will make you believe that your GPU is
   thousands
   of times faster than it really is. Again, jump to the end of my talk
   for
   graphs and more discussion  especially the bit about me being
   keen on
   funding to investigate APU hardware further ;-) (I believe it will
   solve the
   problem)
   On 26 Feb 2015 21:16, Xiangrui Meng
   men...@gmail.commailto:men...@gmail.com wrote:
   Hey Alexander,
  
   I don't quite understand the part where netlib-cublas is about 20x
   slower than netlib-openblas. What is the overhead of using a GPU
   BLAS
   with netlib-java?
  
   CC'ed Sam, the author of netlib-java.
  
   Best,
   Xiangrui
  
   On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
   jos...@databricks.commailto:jos...@databricks.com wrote:
   Better documentation for linking would be very helpful!  Here's a
   JIRA:
   https://issues.apache.org/jira/browse/SPARK-6019
  
  
   On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
   evan.spa...@gmail.commailto:evan.spa...@gmail.com
   wrote:
  
   Thanks for compiling all the data and running these benchmarks,
   Alex.
   The
   big takeaways here can be seen with this chart:
  
  
  
   https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive
  
   1) A properly configured GPU matrix multiply implementation (e.g.
   BIDMat+GPU) can provide substantial (but less than an order of
   magnitude)
   benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
   netlib-java+openblas-compiled).
   2) A poorly tuned CPU implementation can be 1-2 orders of
   magnitude
   worse
   than a well-tuned CPU implementation, particularly for larger
   matrices.
   (netlib-f2jblas or netlib-ref) This is not to pick on netlib -
   this
   basically agrees with the authors own benchmarks (
   https://github.com/fommil/netlib-java)
  
   I think that most of our users are in a situation where using GPUs
   may not
   be practical - although we could consider having a good GPU
   backend
   available as an option. However, *ALL* users of MLlib could
   benefit
   (potentially tremendously) from using a well-tuned CPU-based BLAS
   implementation. Perhaps we should consider updating the mllib
   guide
   with a
   more complete section for enabling high performance binaries on
   OSX
   and
   Linux? Or better, figure out a way for the system to fetch these
   automatically.
  
   - Evan
  
  
  
   On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
   alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
  
   Just to summarize this thread, I was finally able to make all
   performance
   comparisons that we discussed. It turns out that:
   BIDMat-cublasBIDMat
  
  
   MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas
  
   Below is the link to the spreadsheet with full results.
  
  
  
   https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
  
   One thing still needs exploration: does BIDMat-cublas perform
   copying
   to/from machine’s RAM?
  
   -Original Message-
   From: Ulanov, Alexander
   Sent: Tuesday, February 10, 2015 2:12 PM
   To: Evan R. Sparks
   Cc: Joseph Bradley;
   dev@spark.apache.orgmailto:dev

Re: Using CUDA within Spark / boosting linear algebra

2015-02-27 Thread Xiangrui Meng
 26, 2015 1:56 PM
  To: Xiangrui Meng
  Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R.
  Sparks
  Subject: Re: Using CUDA within Spark / boosting linear algebra
 
 
  Btw, I wish people would stop cheating when comparing CPU and GPU
  timings for things like matrix multiply :-P
 
  Please always compare apples with apples and include the time it takes
  to set up the matrices, send it to the processing unit, doing the
  calculation AND copying it back to where you need to see the results.
 
  Ignoring this method will make you believe that your GPU is thousands
  of times faster than it really is. Again, jump to the end of my talk for
  graphs and more discussion  especially the bit about me being keen on
  funding to investigate APU hardware further ;-) (I believe it will solve 
  the
  problem)
  On 26 Feb 2015 21:16, Xiangrui Meng
  men...@gmail.commailto:men...@gmail.com wrote:
  Hey Alexander,
 
  I don't quite understand the part where netlib-cublas is about 20x
  slower than netlib-openblas. What is the overhead of using a GPU BLAS
  with netlib-java?
 
  CC'ed Sam, the author of netlib-java.
 
  Best,
  Xiangrui
 
  On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
  jos...@databricks.commailto:jos...@databricks.com wrote:
  Better documentation for linking would be very helpful!  Here's a
  JIRA:
  https://issues.apache.org/jira/browse/SPARK-6019
 
 
  On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
  evan.spa...@gmail.commailto:evan.spa...@gmail.com
  wrote:
 
  Thanks for compiling all the data and running these benchmarks, Alex.
  The
  big takeaways here can be seen with this chart:
 
 
  https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive
 
  1) A properly configured GPU matrix multiply implementation (e.g.
  BIDMat+GPU) can provide substantial (but less than an order of
  magnitude)
  benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
  netlib-java+openblas-compiled).
  2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
  worse
  than a well-tuned CPU implementation, particularly for larger
  matrices.
  (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
  basically agrees with the authors own benchmarks (
  https://github.com/fommil/netlib-java)
 
  I think that most of our users are in a situation where using GPUs
  may not
  be practical - although we could consider having a good GPU backend
  available as an option. However, *ALL* users of MLlib could benefit
  (potentially tremendously) from using a well-tuned CPU-based BLAS
  implementation. Perhaps we should consider updating the mllib guide
  with a
  more complete section for enabling high performance binaries on OSX
  and
  Linux? Or better, figure out a way for the system to fetch these
  automatically.
 
  - Evan
 
 
 
  On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
  alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 
  Just to summarize this thread, I was finally able to make all
  performance
  comparisons that we discussed. It turns out that:
  BIDMat-cublasBIDMat
 
  MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas
 
  Below is the link to the spreadsheet with full results.
 
 
  https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
 
  One thing still needs exploration: does BIDMat-cublas perform
  copying
  to/from machine’s RAM?
 
  -Original Message-
  From: Ulanov, Alexander
  Sent: Tuesday, February 10, 2015 2:12 PM
  To: Evan R. Sparks
  Cc: Joseph Bradley;
  dev@spark.apache.orgmailto:dev@spark.apache.org
  Subject: RE: Using CUDA within Spark / boosting linear algebra
 
  Thanks, Evan! It seems that ticket was marked as duplicate though
  the
  original one discusses slightly different topic. I was able to link
  netlib
  with MKL from BIDMat binaries. Indeed, MKL is statically linked
  inside a
  60MB library.
 
  |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
  Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
 
  +---+
  |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
  |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
  |1,638475459 |
  |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
  1569,233228 |
 
  It turn out that pre-compiled MKL is faster than precompiled
  OpenBlas on
  my machine. Probably, I’ll add two more columns with locally
  compiled
  openblas and cuda.
 
  Alexander
 
  From: Evan R. Sparks
  [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
  Sent: Monday, February 09, 2015 6:06 PM
  To: Ulanov, Alexander
  Cc: Joseph Bradley;
  dev@spark.apache.orgmailto:dev@spark.apache.org
  Subject: Re: Using CUDA within Spark / boosting linear algebra
 
  Great - perhaps we can move

Re: Using CUDA within Spark / boosting linear algebra

2015-02-27 Thread Sam Halliday
. I've also did a small test with
 modern
   hardware, and the new GPU nVidia Titan was slightly more than 1
 order of
   magnitude faster than Intel E5-2650 v2 for the same tests. However,
 it costs
   as much as CPU ($1200). My takeaway is that GPU is making a better
   price/value progress.
  
  
  
   Xiangrui, I was also surprised that BIDMat-cuda was faster than
   netlib-cuda and the most reasonable explanation is that it holds the
 result
   in GPU memory, as Sam suggested. At the same time, it is OK because
 you can
   copy the result back from GPU only when needed. However, to be sure,
 I am
   going to ask the developer of BIDMat on his upcoming talk.
  
  
  
   Best regards, Alexander
  
  
   From: Sam Halliday [mailto:sam.halli...@gmail.com]
   Sent: Thursday, February 26, 2015 1:56 PM
   To: Xiangrui Meng
   Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R.
   Sparks
   Subject: Re: Using CUDA within Spark / boosting linear algebra
  
  
   Btw, I wish people would stop cheating when comparing CPU and GPU
   timings for things like matrix multiply :-P
  
   Please always compare apples with apples and include the time it
 takes
   to set up the matrices, send it to the processing unit, doing the
   calculation AND copying it back to where you need to see the results.
  
   Ignoring this method will make you believe that your GPU is thousands
   of times faster than it really is. Again, jump to the end of my talk
 for
   graphs and more discussion  especially the bit about me being
 keen on
   funding to investigate APU hardware further ;-) (I believe it will
 solve the
   problem)
   On 26 Feb 2015 21:16, Xiangrui Meng
   men...@gmail.commailto:men...@gmail.com wrote:
   Hey Alexander,
  
   I don't quite understand the part where netlib-cublas is about 20x
   slower than netlib-openblas. What is the overhead of using a GPU BLAS
   with netlib-java?
  
   CC'ed Sam, the author of netlib-java.
  
   Best,
   Xiangrui
  
   On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
   jos...@databricks.commailto:jos...@databricks.com wrote:
   Better documentation for linking would be very helpful!  Here's a
   JIRA:
   https://issues.apache.org/jira/browse/SPARK-6019
  
  
   On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
   evan.spa...@gmail.commailto:evan.spa...@gmail.com
   wrote:
  
   Thanks for compiling all the data and running these benchmarks,
 Alex.
   The
   big takeaways here can be seen with this chart:
  
  
  
 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive
  
   1) A properly configured GPU matrix multiply implementation (e.g.
   BIDMat+GPU) can provide substantial (but less than an order of
   magnitude)
   benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
   netlib-java+openblas-compiled).
   2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
   worse
   than a well-tuned CPU implementation, particularly for larger
   matrices.
   (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
   basically agrees with the authors own benchmarks (
   https://github.com/fommil/netlib-java)
  
   I think that most of our users are in a situation where using GPUs
   may not
   be practical - although we could consider having a good GPU backend
   available as an option. However, *ALL* users of MLlib could benefit
   (potentially tremendously) from using a well-tuned CPU-based BLAS
   implementation. Perhaps we should consider updating the mllib guide
   with a
   more complete section for enabling high performance binaries on OSX
   and
   Linux? Or better, figure out a way for the system to fetch these
   automatically.
  
   - Evan
  
  
  
   On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
   alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
  
   Just to summarize this thread, I was finally able to make all
   performance
   comparisons that we discussed. It turns out that:
   BIDMat-cublasBIDMat
  
  
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas
  
   Below is the link to the spreadsheet with full results.
  
  
  
 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
  
   One thing still needs exploration: does BIDMat-cublas perform
   copying
   to/from machine’s RAM?
  
   -Original Message-
   From: Ulanov, Alexander
   Sent: Tuesday, February 10, 2015 2:12 PM
   To: Evan R. Sparks
   Cc: Joseph Bradley;
   dev@spark.apache.orgmailto:dev@spark.apache.org
   Subject: RE: Using CUDA within Spark / boosting linear algebra
  
   Thanks, Evan! It seems that ticket was marked as duplicate though
   the
   original one discusses slightly different topic. I was able to
 link
   netlib
   with MKL from BIDMat binaries. Indeed, MKL is statically linked
   inside a
   60MB library.
  
   |A*B  size | BIDMat MKL | Breeze+Netlib-MKL

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday
Don't use big O estimates, always measure. It used to work back in the
days when double multiplication was a bottleneck. The computation cost is
effectively free on both the CPU and GPU and you're seeing pure copying
costs. Also, I'm dubious that cublas is doing what you think it is. Can you
link me to the source code for DGEMM?

I show all of this in my talk, with explanations, I can't stress enough how
much I recommend that you watch it if you want to understand high
performance hardware acceleration for linear algebra :-)
On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote:

 The copying overhead should be quadratic on n, while the computation
 cost is cubic on n. I can understand that netlib-cublas is slower than
 netlib-openblas on small problems. But I'm surprised to see that it is
 still 20x slower on 1x1. I did the following on a g2.2xlarge
 instance with BIDMat:

 val n = 1

 val f = rand(n, n)
 flip; f*f; val rf = flop

 flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop

 flip; g*g; val rgg = flop

 The CPU version finished in 12 seconds.
 The CPU-GPU-CPU version finished in 2.2 seconds.
 The GPU version finished in 1.7 seconds.

 I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas
 path. But based on the result, the data copying overhead is definitely
 not as big as 20x at n = 1.

 Best,
 Xiangrui


 On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com
 wrote:
  I've had some email exchanges with the author of BIDMat: it does exactly
  what you need to get the GPU benefit and writes higher level algorithms
  entirely in the GPU kernels so that the memory stays there as long as
  possible. The restriction with this approach is that it is only offering
  high-level algorithms so is not a toolkit for applied mathematics
  research and development --- but it works well as a toolkit for higher
  level analysis (e.g. for analysts and practitioners).
 
  I believe BIDMat's approach is the best way to get performance out of
  GPU hardware at the moment but I also have strong evidence to suggest
  that the hardware will catch up and the memory transfer costs between
  CPU/GPU will disappear meaning that there will be no need for custom GPU
  kernel implementations. i.e. please continue to use BLAS primitives when
  writing new algorithms and only go to the GPU for an alternative
  optimised implementation.
 
  Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
  an API that looks like BLAS but takes pointers to special regions in the
  GPU memory region. Somebody has written a wrapper around CUDA to create
  a proper BLAS library but it only gives marginal performance over the
  CPU because of the memory transfer overhead.
 
  This slide from my talk
 
http://fommil.github.io/scalax14/#/11/2
 
  says it all. X axis is matrix size, Y axis is logarithmic time to do
  DGEMM. Black line is the cheating time for the GPU and the green line
  is after copying the memory to/from the GPU memory. APUs have the
  potential to eliminate the green line.
 
  Best regards,
  Sam
 
 
 
  Ulanov, Alexander alexander.ula...@hp.com writes:
 
  Evan, thank you for the summary. I would like to add some more
 observations. The GPU that I used is 2.5 times cheaper than the CPU ($250
 vs $100). They both are 3 years old. I've also did a small test with modern
 hardware, and the new GPU nVidia Titan was slightly more than 1 order of
 magnitude faster than Intel E5-2650 v2 for the same tests. However, it
 costs as much as CPU ($1200). My takeaway is that GPU is making a better
 price/value progress.
 
 
 
  Xiangrui, I was also surprised that BIDMat-cuda was faster than
 netlib-cuda and the most reasonable explanation is that it holds the result
 in GPU memory, as Sam suggested. At the same time, it is OK because you can
 copy the result back from GPU only when needed. However, to be sure, I am
 going to ask the developer of BIDMat on his upcoming talk.
 
 
 
  Best regards, Alexander
 
 
  From: Sam Halliday [mailto:sam.halli...@gmail.com]
  Sent: Thursday, February 26, 2015 1:56 PM
  To: Xiangrui Meng
  Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R.
 Sparks
  Subject: Re: Using CUDA within Spark / boosting linear algebra
 
 
  Btw, I wish people would stop cheating when comparing CPU and GPU
 timings for things like matrix multiply :-P
 
  Please always compare apples with apples and include the time it takes
 to set up the matrices, send it to the processing unit, doing the
 calculation AND copying it back to where you need to see the results.
 
  Ignoring this method will make you believe that your GPU is thousands
 of times faster than it really is. Again, jump to the end of my talk for
 graphs and more discussion  especially the bit about me being keen on
 funding to investigate APU hardware further ;-) (I believe it will solve
 the problem)
  On 26 Feb 2015 21:16, Xiangrui Meng men

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday
Btw, I wish people would stop cheating when comparing CPU and GPU timings
for things like matrix multiply :-P

Please always compare apples with apples and include the time it takes to
set up the matrices, send it to the processing unit, doing the calculation
AND copying it back to where you need to see the results.

Ignoring this method will make you believe that your GPU is thousands of
times faster than it really is. Again, jump to the end of my talk for
graphs and more discussion  especially the bit about me being keen on
funding to investigate APU hardware further ;-) (I believe it will solve
the problem)
On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.com wrote:

 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x
 slower than netlib-openblas. What is the overhead of using a GPU BLAS
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com
 wrote:
  Better documentation for linking would be very helpful!  Here's a JIRA:
  https://issues.apache.org/jira/browse/SPARK-6019
 
 
  On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com
  wrote:
 
  Thanks for compiling all the data and running these benchmarks, Alex.
 The
  big takeaways here can be seen with this chart:
 
 
 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive
 
  1) A properly configured GPU matrix multiply implementation (e.g.
  BIDMat+GPU) can provide substantial (but less than an order of
 magnitude)
  benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
  netlib-java+openblas-compiled).
  2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
 worse
  than a well-tuned CPU implementation, particularly for larger matrices.
  (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
  basically agrees with the authors own benchmarks (
  https://github.com/fommil/netlib-java)
 
  I think that most of our users are in a situation where using GPUs may
 not
  be practical - although we could consider having a good GPU backend
  available as an option. However, *ALL* users of MLlib could benefit
  (potentially tremendously) from using a well-tuned CPU-based BLAS
  implementation. Perhaps we should consider updating the mllib guide
 with a
  more complete section for enabling high performance binaries on OSX and
  Linux? Or better, figure out a way for the system to fetch these
  automatically.
 
  - Evan
 
 
 
  On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
  alexander.ula...@hp.com wrote:
 
  Just to summarize this thread, I was finally able to make all
 performance
  comparisons that we discussed. It turns out that:
  BIDMat-cublasBIDMat
 
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas
 
  Below is the link to the spreadsheet with full results.
 
 
 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
 
  One thing still needs exploration: does BIDMat-cublas perform copying
  to/from machine’s RAM?
 
  -Original Message-
  From: Ulanov, Alexander
  Sent: Tuesday, February 10, 2015 2:12 PM
  To: Evan R. Sparks
  Cc: Joseph Bradley; dev@spark.apache.org
  Subject: RE: Using CUDA within Spark / boosting linear algebra
 
  Thanks, Evan! It seems that ticket was marked as duplicate though the
  original one discusses slightly different topic. I was able to link
 netlib
  with MKL from BIDMat binaries. Indeed, MKL is statically linked inside
 a
  60MB library.
 
  |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
  Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
 
 +---+
  |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
  |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
  |1,638475459 |
  |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
  1569,233228 |
 
  It turn out that pre-compiled MKL is faster than precompiled OpenBlas
 on
  my machine. Probably, I’ll add two more columns with locally compiled
  openblas and cuda.
 
  Alexander
 
  From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
  Sent: Monday, February 09, 2015 6:06 PM
  To: Ulanov, Alexander
  Cc: Joseph Bradley; dev@spark.apache.org
  Subject: Re: Using CUDA within Spark / boosting linear algebra
 
  Great - perhaps we can move this discussion off-list and onto a JIRA
  ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)
 
  It seems like this is going to be somewhat exploratory for a while (and
  there's probably only a handful of us who really care about fast linear
  algebra!)
 
  - Evan
 
  On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander 
  alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
  Hi Evan,
 
  Thank you for explanation and useful link

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Evan R. Sparks
I couldn't agree with you more, Sam. The GPU/Matrix guys typically don't
count their copy times, but claim that you should be doing *as much as
possible* on the GPU - so, maybe for some applications where you can
generate the data on the GPU this makes sense. But, in the context of Spark
we should be *very* careful about enumerating the applications we want GPU
support for and deciding whether it's appropriate to measure the overheads
of getting the data to the GPU.

On Thu, Feb 26, 2015 at 1:55 PM, Sam Halliday sam.halli...@gmail.com
wrote:

 Btw, I wish people would stop cheating when comparing CPU and GPU timings
 for things like matrix multiply :-P

 Please always compare apples with apples and include the time it takes to
 set up the matrices, send it to the processing unit, doing the calculation
 AND copying it back to where you need to see the results.

 Ignoring this method will make you believe that your GPU is thousands of
 times faster than it really is. Again, jump to the end of my talk for
 graphs and more discussion  especially the bit about me being keen on
 funding to investigate APU hardware further ;-) (I believe it will solve
 the problem)
 On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.com wrote:

 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x
 slower than netlib-openblas. What is the overhead of using a GPU BLAS
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com
 wrote:
  Better documentation for linking would be very helpful!  Here's a JIRA:
  https://issues.apache.org/jira/browse/SPARK-6019
 
 
  On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com
  wrote:
 
  Thanks for compiling all the data and running these benchmarks, Alex.
 The
  big takeaways here can be seen with this chart:
 
 
 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive
 
  1) A properly configured GPU matrix multiply implementation (e.g.
  BIDMat+GPU) can provide substantial (but less than an order of
 magnitude)
  benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
  netlib-java+openblas-compiled).
  2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
 worse
  than a well-tuned CPU implementation, particularly for larger matrices.
  (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
  basically agrees with the authors own benchmarks (
  https://github.com/fommil/netlib-java)
 
  I think that most of our users are in a situation where using GPUs may
 not
  be practical - although we could consider having a good GPU backend
  available as an option. However, *ALL* users of MLlib could benefit
  (potentially tremendously) from using a well-tuned CPU-based BLAS
  implementation. Perhaps we should consider updating the mllib guide
 with a
  more complete section for enabling high performance binaries on OSX and
  Linux? Or better, figure out a way for the system to fetch these
  automatically.
 
  - Evan
 
 
 
  On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
  alexander.ula...@hp.com wrote:
 
  Just to summarize this thread, I was finally able to make all
 performance
  comparisons that we discussed. It turns out that:
  BIDMat-cublasBIDMat
 
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas
 
  Below is the link to the spreadsheet with full results.
 
 
 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
 
  One thing still needs exploration: does BIDMat-cublas perform copying
  to/from machine’s RAM?
 
  -Original Message-
  From: Ulanov, Alexander
  Sent: Tuesday, February 10, 2015 2:12 PM
  To: Evan R. Sparks
  Cc: Joseph Bradley; dev@spark.apache.org
  Subject: RE: Using CUDA within Spark / boosting linear algebra
 
  Thanks, Evan! It seems that ticket was marked as duplicate though the
  original one discusses slightly different topic. I was able to link
 netlib
  with MKL from BIDMat binaries. Indeed, MKL is statically linked
 inside a
  60MB library.
 
  |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
  Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
 
 +---+
  |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
  |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
  |1,638475459 |
  |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
  1569,233228 |
 
  It turn out that pre-compiled MKL is faster than precompiled OpenBlas
 on
  my machine. Probably, I’ll add two more columns with locally compiled
  openblas and cuda.
 
  Alexander
 
  From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
  Sent: Monday, February 09, 2015 6:06 PM
  To: Ulanov, Alexander
  Cc: Joseph Bradley; dev

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Xiangrui Meng
Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running these benchmarks, Alex. The
 big takeaways here can be seen with this chart:

 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

 1) A properly configured GPU matrix multiply implementation (e.g.
 BIDMat+GPU) can provide substantial (but less than an order of magnitude)
 benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
 netlib-java+openblas-compiled).
 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
 than a well-tuned CPU implementation, particularly for larger matrices.
 (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
 basically agrees with the authors own benchmarks (
 https://github.com/fommil/netlib-java)

 I think that most of our users are in a situation where using GPUs may not
 be practical - although we could consider having a good GPU backend
 available as an option. However, *ALL* users of MLlib could benefit
 (potentially tremendously) from using a well-tuned CPU-based BLAS
 implementation. Perhaps we should consider updating the mllib guide with a
 more complete section for enabling high performance binaries on OSX and
 Linux? Or better, figure out a way for the system to fetch these
 automatically.

 - Evan



 On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
 alexander.ula...@hp.com wrote:

 Just to summarize this thread, I was finally able to make all performance
 comparisons that we discussed. It turns out that:
 BIDMat-cublasBIDMat
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

 Below is the link to the spreadsheet with full results.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 One thing still needs exploration: does BIDMat-cublas perform copying
 to/from machine’s RAM?

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, February 10, 2015 2:12 PM
 To: Evan R. Sparks
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks, Evan! It seems that ticket was marked as duplicate though the
 original one discusses slightly different topic. I was able to link netlib
 with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
 60MB library.

 |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
 Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
 1569,233228 |

 It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
 my machine. Probably, I’ll add two more columns with locally compiled
 openblas and cuda.

 Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Monday, February 09, 2015 6:06 PM
 To: Ulanov, Alexander
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Great - perhaps we can move this discussion off-list and onto a JIRA
 ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

 It seems like this is going to be somewhat exploratory for a while (and
 there's probably only a handful of us who really care about fast linear
 algebra!)

 - Evan

 On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 Hi Evan,

 Thank you for explanation and useful link. I am going to build OpenBLAS,
 link it with Netlib-java and perform benchmark again.

 Do I understand correctly that BIDMat binaries contain statically linked
 Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
 having MKL BLAS installed on my server. If it is true, I wonder if it is OK
 because Intel sells this library. Nevertheless, it seems that in my case
 precompiled MKL BLAS performs better than precompiled OpenBLAS given that
 BIDMat and Netlib-java are supposed to be on par with JNI overheads.

 Though, it might be interesting to link Netlib-java with Intel MKL, as
 you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
 (Netlib-java) interested to compare their libraries.

 Best regards, Alexander

 From: Evan R. Sparks [mailto:evan.spa

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday
Hi all,

I'm not surprised if the GPU is slow. It's about the bottleneck copying the
memory. Watch my talk, linked from the netlib-java github page, to
understand further. The only way to currently make use of a GPU is to do
all the operations using the GPU's kernel. You can find some prepackaged
high level algorithms than do this, but it's extremely limiting.

I believe hardware will fix this problem eventually, so I still advocate
using the netlib primitives. I'm particularly interested in APU approaches
and I'm very interested in finding somebody to fund me to look into it.
It's too much work for a side project.

Look on the last few slides of my talk to see the potential performance
gains.

Best regards, Sam
On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.com wrote:

 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x
 slower than netlib-openblas. What is the overhead of using a GPU BLAS
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com
 wrote:
  Better documentation for linking would be very helpful!  Here's a JIRA:
  https://issues.apache.org/jira/browse/SPARK-6019
 
 
  On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com
  wrote:
 
  Thanks for compiling all the data and running these benchmarks, Alex.
 The
  big takeaways here can be seen with this chart:
 
 
 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive
 
  1) A properly configured GPU matrix multiply implementation (e.g.
  BIDMat+GPU) can provide substantial (but less than an order of
 magnitude)
  benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
  netlib-java+openblas-compiled).
  2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
 worse
  than a well-tuned CPU implementation, particularly for larger matrices.
  (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
  basically agrees with the authors own benchmarks (
  https://github.com/fommil/netlib-java)
 
  I think that most of our users are in a situation where using GPUs may
 not
  be practical - although we could consider having a good GPU backend
  available as an option. However, *ALL* users of MLlib could benefit
  (potentially tremendously) from using a well-tuned CPU-based BLAS
  implementation. Perhaps we should consider updating the mllib guide
 with a
  more complete section for enabling high performance binaries on OSX and
  Linux? Or better, figure out a way for the system to fetch these
  automatically.
 
  - Evan
 
 
 
  On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
  alexander.ula...@hp.com wrote:
 
  Just to summarize this thread, I was finally able to make all
 performance
  comparisons that we discussed. It turns out that:
  BIDMat-cublasBIDMat
 
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas
 
  Below is the link to the spreadsheet with full results.
 
 
 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
 
  One thing still needs exploration: does BIDMat-cublas perform copying
  to/from machine’s RAM?
 
  -Original Message-
  From: Ulanov, Alexander
  Sent: Tuesday, February 10, 2015 2:12 PM
  To: Evan R. Sparks
  Cc: Joseph Bradley; dev@spark.apache.org
  Subject: RE: Using CUDA within Spark / boosting linear algebra
 
  Thanks, Evan! It seems that ticket was marked as duplicate though the
  original one discusses slightly different topic. I was able to link
 netlib
  with MKL from BIDMat binaries. Indeed, MKL is statically linked inside
 a
  60MB library.
 
  |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
  Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
 
 +---+
  |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
  |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
  |1,638475459 |
  |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
  1569,233228 |
 
  It turn out that pre-compiled MKL is faster than precompiled OpenBlas
 on
  my machine. Probably, I’ll add two more columns with locally compiled
  openblas and cuda.
 
  Alexander
 
  From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
  Sent: Monday, February 09, 2015 6:06 PM
  To: Ulanov, Alexander
  Cc: Joseph Bradley; dev@spark.apache.org
  Subject: Re: Using CUDA within Spark / boosting linear algebra
 
  Great - perhaps we can move this discussion off-list and onto a JIRA
  ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)
 
  It seems like this is going to be somewhat exploratory for a while (and
  there's probably only a handful of us who really care about fast linear
  algebra!)
 
  - Evan
 
  On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander 
  alexander.ula

RE: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Ulanov, Alexander
Typo - CPU was 2.5 cheaper (not GPU!)

-Original Message-
From: Ulanov, Alexander 
Sent: Thursday, February 26, 2015 2:01 PM
To: Sam Halliday; Xiangrui Meng
Cc: dev@spark.apache.org; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Evan, thank you for the summary. I would like to add some more observations. 
The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both 
are 3 years old. I've also did a small test with modern hardware, and the new 
GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel 
E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My 
takeaway is that GPU is making a better price/value progress.



Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and 
the most reasonable explanation is that it holds the result in GPU memory, as 
Sam suggested. At the same time, it is OK because you can copy the result back 
from GPU only when needed. However, to be sure, I am going to ask the developer 
of BIDMat on his upcoming talk.



Best regards, Alexander


From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Thursday, February 26, 2015 1:56 PM
To: Xiangrui Meng
Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra


Btw, I wish people would stop cheating when comparing CPU and GPU timings for 
things like matrix multiply :-P

Please always compare apples with apples and include the time it takes to set 
up the matrices, send it to the processing unit, doing the calculation AND 
copying it back to where you need to see the results.

Ignoring this method will make you believe that your GPU is thousands of times 
faster than it really is. Again, jump to the end of my talk for graphs and more 
discussion  especially the bit about me being keen on funding to 
investigate APU hardware further ;-) (I believe it will solve the problem) On 
26 Feb 2015 21:16, Xiangrui Meng men...@gmail.commailto:men...@gmail.com 
wrote:
Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x slower than 
netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
jos...@databricks.commailto:jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
 evan.spa...@gmail.commailto:evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running these benchmarks, Alex. 
 The big takeaways here can be seen with this chart:

 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZH
 l6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

 1) A properly configured GPU matrix multiply implementation (e.g.
 BIDMat+GPU) can provide substantial (but less than an order of 
 BIDMat+magnitude)
 benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
 netlib-java+openblas-compiled).
 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude 
 worse than a well-tuned CPU implementation, particularly for larger matrices.
 (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this 
 basically agrees with the authors own benchmarks (
 https://github.com/fommil/netlib-java)

 I think that most of our users are in a situation where using GPUs 
 may not be practical - although we could consider having a good GPU 
 backend available as an option. However, *ALL* users of MLlib could 
 benefit (potentially tremendously) from using a well-tuned CPU-based 
 BLAS implementation. Perhaps we should consider updating the mllib 
 guide with a more complete section for enabling high performance 
 binaries on OSX and Linux? Or better, figure out a way for the system 
 to fetch these automatically.

 - Evan



 On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander  
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:

 Just to summarize this thread, I was finally able to make all 
 performance comparisons that we discussed. It turns out that:
 BIDMat-cublasBIDMat
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==
 netlib-cublasnetlib-blasf2jblas

 Below is the link to the spreadsheet with full results.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx3
 78T9J5r7kwKSPkY/edit?usp=sharing

 One thing still needs exploration: does BIDMat-cublas perform 
 copying to/from machine’s RAM?

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, February 10, 2015 2:12 PM
 To: Evan R. Sparks
 Cc: Joseph Bradley; 
 dev@spark.apache.orgmailto:dev@spark.apache.org
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks, Evan! It seems that ticket was marked as duplicate though 
 the original one discusses slightly

RE: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday
I've had some email exchanges with the author of BIDMat: it does exactly
what you need to get the GPU benefit and writes higher level algorithms
entirely in the GPU kernels so that the memory stays there as long as
possible. The restriction with this approach is that it is only offering
high-level algorithms so is not a toolkit for applied mathematics
research and development --- but it works well as a toolkit for higher
level analysis (e.g. for analysts and practitioners).

I believe BIDMat's approach is the best way to get performance out of
GPU hardware at the moment but I also have strong evidence to suggest
that the hardware will catch up and the memory transfer costs between
CPU/GPU will disappear meaning that there will be no need for custom GPU
kernel implementations. i.e. please continue to use BLAS primitives when
writing new algorithms and only go to the GPU for an alternative
optimised implementation.

Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
an API that looks like BLAS but takes pointers to special regions in the
GPU memory region. Somebody has written a wrapper around CUDA to create
a proper BLAS library but it only gives marginal performance over the
CPU because of the memory transfer overhead.

This slide from my talk

  http://fommil.github.io/scalax14/#/11/2

says it all. X axis is matrix size, Y axis is logarithmic time to do
DGEMM. Black line is the cheating time for the GPU and the green line
is after copying the memory to/from the GPU memory. APUs have the
potential to eliminate the green line.

Best regards,
Sam


Ulanov, Alexander alexander.ula...@hp.com writes:

 Evan, thank you for the summary. I would like to add some more observations. 
 The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They 
 both are 3 years old. I've also did a small test with modern hardware, and 
 the new GPU nVidia Titan was slightly more than 1 order of magnitude faster 
 than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU 
 ($1200). My takeaway is that GPU is making a better price/value progress.



 Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda 
 and the most reasonable explanation is that it holds the result in GPU 
 memory, as Sam suggested. At the same time, it is OK because you can copy the 
 result back from GPU only when needed. However, to be sure, I am going to ask 
 the developer of BIDMat on his upcoming talk.



 Best regards, Alexander


 From: Sam Halliday [mailto:sam.halli...@gmail.com]
 Sent: Thursday, February 26, 2015 1:56 PM
 To: Xiangrui Meng
 Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
 Subject: Re: Using CUDA within Spark / boosting linear algebra


 Btw, I wish people would stop cheating when comparing CPU and GPU timings for 
 things like matrix multiply :-P

 Please always compare apples with apples and include the time it takes to set 
 up the matrices, send it to the processing unit, doing the calculation AND 
 copying it back to where you need to see the results.

 Ignoring this method will make you believe that your GPU is thousands of 
 times faster than it really is. Again, jump to the end of my talk for graphs 
 and more discussion  especially the bit about me being keen on funding to 
 investigate APU hardware further ;-) (I believe it will solve the problem)
 On 26 Feb 2015 21:16, Xiangrui Meng 
 men...@gmail.commailto:men...@gmail.com wrote:
 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x
 slower than netlib-openblas. What is the overhead of using a GPU BLAS
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
 jos...@databricks.commailto:jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
 evan.spa...@gmail.commailto:evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running these benchmarks, Alex. The
 big takeaways here can be seen with this chart:

 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

 1) A properly configured GPU matrix multiply implementation (e.g.
 BIDMat+GPU) can provide substantial (but less than an order of magnitude)
 benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
 netlib-java+openblas-compiled).
 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
 than a well-tuned CPU implementation, particularly for larger matrices.
 (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
 basically agrees with the authors own benchmarks (
 https://github.com/fommil/netlib-java)

 I think that most of our users are in a situation where using GPUs may not
 be practical - although we could consider having a good GPU

RE: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Ulanov, Alexander
Evan, thank you for the summary. I would like to add some more observations. 
The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both 
are 3 years old. I've also did a small test with modern hardware, and the new 
GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel 
E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My 
takeaway is that GPU is making a better price/value progress.



Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and 
the most reasonable explanation is that it holds the result in GPU memory, as 
Sam suggested. At the same time, it is OK because you can copy the result back 
from GPU only when needed. However, to be sure, I am going to ask the developer 
of BIDMat on his upcoming talk.



Best regards, Alexander


From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Thursday, February 26, 2015 1:56 PM
To: Xiangrui Meng
Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra


Btw, I wish people would stop cheating when comparing CPU and GPU timings for 
things like matrix multiply :-P

Please always compare apples with apples and include the time it takes to set 
up the matrices, send it to the processing unit, doing the calculation AND 
copying it back to where you need to see the results.

Ignoring this method will make you believe that your GPU is thousands of times 
faster than it really is. Again, jump to the end of my talk for graphs and more 
discussion  especially the bit about me being keen on funding to 
investigate APU hardware further ;-) (I believe it will solve the problem)
On 26 Feb 2015 21:16, Xiangrui Meng 
men...@gmail.commailto:men...@gmail.com wrote:
Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
jos...@databricks.commailto:jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
 evan.spa...@gmail.commailto:evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running these benchmarks, Alex. The
 big takeaways here can be seen with this chart:

 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

 1) A properly configured GPU matrix multiply implementation (e.g.
 BIDMat+GPU) can provide substantial (but less than an order of magnitude)
 benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
 netlib-java+openblas-compiled).
 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
 than a well-tuned CPU implementation, particularly for larger matrices.
 (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
 basically agrees with the authors own benchmarks (
 https://github.com/fommil/netlib-java)

 I think that most of our users are in a situation where using GPUs may not
 be practical - although we could consider having a good GPU backend
 available as an option. However, *ALL* users of MLlib could benefit
 (potentially tremendously) from using a well-tuned CPU-based BLAS
 implementation. Perhaps we should consider updating the mllib guide with a
 more complete section for enabling high performance binaries on OSX and
 Linux? Or better, figure out a way for the system to fetch these
 automatically.

 - Evan



 On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:

 Just to summarize this thread, I was finally able to make all performance
 comparisons that we discussed. It turns out that:
 BIDMat-cublasBIDMat
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

 Below is the link to the spreadsheet with full results.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 One thing still needs exploration: does BIDMat-cublas perform copying
 to/from machine’s RAM?

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, February 10, 2015 2:12 PM
 To: Evan R. Sparks
 Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks, Evan! It seems that ticket was marked as duplicate though the
 original one discusses slightly different topic. I was able to link netlib
 with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
 60MB library.

 |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
 Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Xiangrui Meng
The copying overhead should be quadratic on n, while the computation
cost is cubic on n. I can understand that netlib-cublas is slower than
netlib-openblas on small problems. But I'm surprised to see that it is
still 20x slower on 1x1. I did the following on a g2.2xlarge
instance with BIDMat:

val n = 1

val f = rand(n, n)
flip; f*f; val rf = flop

flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop

flip; g*g; val rgg = flop

The CPU version finished in 12 seconds.
The CPU-GPU-CPU version finished in 2.2 seconds.
The GPU version finished in 1.7 seconds.

I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas
path. But based on the result, the data copying overhead is definitely
not as big as 20x at n = 1.

Best,
Xiangrui


On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com wrote:
 I've had some email exchanges with the author of BIDMat: it does exactly
 what you need to get the GPU benefit and writes higher level algorithms
 entirely in the GPU kernels so that the memory stays there as long as
 possible. The restriction with this approach is that it is only offering
 high-level algorithms so is not a toolkit for applied mathematics
 research and development --- but it works well as a toolkit for higher
 level analysis (e.g. for analysts and practitioners).

 I believe BIDMat's approach is the best way to get performance out of
 GPU hardware at the moment but I also have strong evidence to suggest
 that the hardware will catch up and the memory transfer costs between
 CPU/GPU will disappear meaning that there will be no need for custom GPU
 kernel implementations. i.e. please continue to use BLAS primitives when
 writing new algorithms and only go to the GPU for an alternative
 optimised implementation.

 Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
 an API that looks like BLAS but takes pointers to special regions in the
 GPU memory region. Somebody has written a wrapper around CUDA to create
 a proper BLAS library but it only gives marginal performance over the
 CPU because of the memory transfer overhead.

 This slide from my talk

   http://fommil.github.io/scalax14/#/11/2

 says it all. X axis is matrix size, Y axis is logarithmic time to do
 DGEMM. Black line is the cheating time for the GPU and the green line
 is after copying the memory to/from the GPU memory. APUs have the
 potential to eliminate the green line.

 Best regards,
 Sam



 Ulanov, Alexander alexander.ula...@hp.com writes:

 Evan, thank you for the summary. I would like to add some more observations. 
 The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They 
 both are 3 years old. I've also did a small test with modern hardware, and 
 the new GPU nVidia Titan was slightly more than 1 order of magnitude faster 
 than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU 
 ($1200). My takeaway is that GPU is making a better price/value progress.



 Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda 
 and the most reasonable explanation is that it holds the result in GPU 
 memory, as Sam suggested. At the same time, it is OK because you can copy 
 the result back from GPU only when needed. However, to be sure, I am going 
 to ask the developer of BIDMat on his upcoming talk.



 Best regards, Alexander


 From: Sam Halliday [mailto:sam.halli...@gmail.com]
 Sent: Thursday, February 26, 2015 1:56 PM
 To: Xiangrui Meng
 Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
 Subject: Re: Using CUDA within Spark / boosting linear algebra


 Btw, I wish people would stop cheating when comparing CPU and GPU timings 
 for things like matrix multiply :-P

 Please always compare apples with apples and include the time it takes to 
 set up the matrices, send it to the processing unit, doing the calculation 
 AND copying it back to where you need to see the results.

 Ignoring this method will make you believe that your GPU is thousands of 
 times faster than it really is. Again, jump to the end of my talk for graphs 
 and more discussion  especially the bit about me being keen on funding 
 to investigate APU hardware further ;-) (I believe it will solve the problem)
 On 26 Feb 2015 21:16, Xiangrui Meng 
 men...@gmail.commailto:men...@gmail.com wrote:
 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x
 slower than netlib-openblas. What is the overhead of using a GPU BLAS
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
 jos...@databricks.commailto:jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
 evan.spa...@gmail.commailto:evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running

Re: Using CUDA within Spark / boosting linear algebra

2015-02-25 Thread Evan R. Sparks
Thanks for compiling all the data and running these benchmarks, Alex. The
big takeaways here can be seen with this chart:
https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

1) A properly configured GPU matrix multiply implementation (e.g.
BIDMat+GPU) can provide substantial (but less than an order of magnitude)
benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
netlib-java+openblas-compiled).
2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
than a well-tuned CPU implementation, particularly for larger matrices.
(netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
basically agrees with the authors own benchmarks (
https://github.com/fommil/netlib-java)

I think that most of our users are in a situation where using GPUs may not
be practical - although we could consider having a good GPU backend
available as an option. However, *ALL* users of MLlib could benefit
(potentially tremendously) from using a well-tuned CPU-based BLAS
implementation. Perhaps we should consider updating the mllib guide with a
more complete section for enabling high performance binaries on OSX and
Linux? Or better, figure out a way for the system to fetch these
automatically.

- Evan



On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Just to summarize this thread, I was finally able to make all performance
 comparisons that we discussed. It turns out that:
 BIDMat-cublasBIDMat
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

 Below is the link to the spreadsheet with full results.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 One thing still needs exploration: does BIDMat-cublas perform copying
 to/from machine’s RAM?

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, February 10, 2015 2:12 PM
 To: Evan R. Sparks
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks, Evan! It seems that ticket was marked as duplicate though the
 original one discusses slightly different topic. I was able to link netlib
 with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
 60MB library.

 |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
 Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459
 |
 |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
 1569,233228 |

 It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
 my machine. Probably, I’ll add two more columns with locally compiled
 openblas and cuda.

 Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Monday, February 09, 2015 6:06 PM
 To: Ulanov, Alexander
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Great - perhaps we can move this discussion off-list and onto a JIRA
 ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

 It seems like this is going to be somewhat exploratory for a while (and
 there's probably only a handful of us who really care about fast linear
 algebra!)

 - Evan

 On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.com
 mailto:alexander.ula...@hp.com wrote:
 Hi Evan,

 Thank you for explanation and useful link. I am going to build OpenBLAS,
 link it with Netlib-java and perform benchmark again.

 Do I understand correctly that BIDMat binaries contain statically linked
 Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
 having MKL BLAS installed on my server. If it is true, I wonder if it is OK
 because Intel sells this library. Nevertheless, it seems that in my case
 precompiled MKL BLAS performs better than precompiled OpenBLAS given that
 BIDMat and Netlib-java are supposed to be on par with JNI overheads.

 Though, it might be interesting to link Netlib-java with Intel MKL, as you
 suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java)
 interested to compare their libraries.

 Best regards, Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
 evan.spa...@gmail.com]
 Sent: Friday, February 06, 2015 5:58 PM

 To: Ulanov, Alexander
 Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I would build OpenBLAS yourself, since good BLAS performance comes from
 getting cache sizes, etc. set up correctly for your particular hardware -
 this is often a very tricky process (see, e.g. ATLAS), but we found that on
 relatively modern Xeon chips, OpenBLAS builds quickly and yields
 performance

RE: Using CUDA within Spark / boosting linear algebra

2015-02-12 Thread Ulanov, Alexander
Just to summarize this thread, I was finally able to make all performance 
comparisons that we discussed. It turns out that: 
BIDMat-cublasBIDMat 
MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

Below is the link to the spreadsheet with full results. 
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

One thing still needs exploration: does BIDMat-cublas perform copying to/from 
machine’s RAM?
 
-Original Message-
From: Ulanov, Alexander 
Sent: Tuesday, February 10, 2015 2:12 PM
To: Evan R. Sparks
Cc: Joseph Bradley; dev@spark.apache.org
Subject: RE: Using CUDA within Spark / boosting linear algebra

Thanks, Evan! It seems that ticket was marked as duplicate though the original 
one discusses slightly different topic. I was able to link netlib with MKL from 
BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library.

|A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat| 
Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
+---+
|100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 |
|1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 
|

It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my 
machine. Probably, I’ll add two more columns with locally compiled openblas and 
cuda.

Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Monday, February 09, 2015 6:06 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Great - perhaps we can move this discussion off-list and onto a JIRA ticket? 
(Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

It seems like this is going to be somewhat exploratory for a while (and there's 
probably only a handful of us who really care about fast linear algebra!)

- Evan

On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan,

Thank you for explanation and useful link. I am going to build OpenBLAS, link 
it with Netlib-java and perform benchmark again.

Do I understand correctly that BIDMat binaries contain statically linked Intel 
MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL 
BLAS installed on my server. If it is true, I wonder if it is OK because Intel 
sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS 
performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are 
supposed to be on par with JNI overheads.

Though, it might be interesting to link Netlib-java with Intel MKL, as you 
suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) 
interested to compare their libraries.

Best regards, Alexander

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:58 PM

To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I would build OpenBLAS yourself, since good BLAS performance comes from getting 
cache sizes, etc. set up correctly for your particular hardware - this is often 
a very tricky process (see, e.g. ATLAS), but we found that on relatively modern 
Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL.

To make sure the right library is getting used, you have to make sure it's 
first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will 
do the trick here.

For some examples of getting netlib-java setup on an ec2 node and some example 
benchmarking code we ran a while back, see: 
https://github.com/shivaram/matrix-bench

In particular - build-openblas-ec2.sh shows you how to build the library and 
set up symlinks correctly, and scala/run-netlib.sh shows you how to get the 
path setup and get that library picked up by netlib-java.

In this way - you could probably get cuBLAS set up to be used by netlib-java as 
well.

- Evan

On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Evan, could you elaborate on how to force BIDMat and netlib-java to force 
loading the right blas? For netlib, I there are few JVM flags, such as 
-Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can 
force it to use Java implementation. Not sure I understand how to force use a 
specific blas (not specific wrapper for blas).

Btw. I have installed openblas (yum install openblas), so I suppose that netlib 
is using it.

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev

RE: Using CUDA within Spark / boosting linear algebra

2015-02-10 Thread Ulanov, Alexander
Thanks, Evan! It seems that ticket was marked as duplicate though the original 
one discusses slightly different topic. I was able to link netlib with MKL from 
BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library.

|A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat| 
Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
+---+
|100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 |
|1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 
|

It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my 
machine. Probably, I’ll add two more columns with locally compiled openblas and 
cuda.

Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Monday, February 09, 2015 6:06 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Great - perhaps we can move this discussion off-list and onto a JIRA ticket? 
(Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

It seems like this is going to be somewhat exploratory for a while (and there's 
probably only a handful of us who really care about fast linear algebra!)

- Evan

On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan,

Thank you for explanation and useful link. I am going to build OpenBLAS, link 
it with Netlib-java and perform benchmark again.

Do I understand correctly that BIDMat binaries contain statically linked Intel 
MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL 
BLAS installed on my server. If it is true, I wonder if it is OK because Intel 
sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS 
performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are 
supposed to be on par with JNI overheads.

Though, it might be interesting to link Netlib-java with Intel MKL, as you 
suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) 
interested to compare their libraries.

Best regards, Alexander

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:58 PM

To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I would build OpenBLAS yourself, since good BLAS performance comes from getting 
cache sizes, etc. set up correctly for your particular hardware - this is often 
a very tricky process (see, e.g. ATLAS), but we found that on relatively modern 
Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL.

To make sure the right library is getting used, you have to make sure it's 
first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will 
do the trick here.

For some examples of getting netlib-java setup on an ec2 node and some example 
benchmarking code we ran a while back, see: 
https://github.com/shivaram/matrix-bench

In particular - build-openblas-ec2.sh shows you how to build the library and 
set up symlinks correctly, and scala/run-netlib.sh shows you how to get the 
path setup and get that library picked up by netlib-java.

In this way - you could probably get cuBLAS set up to be used by netlib-java as 
well.

- Evan

On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Evan, could you elaborate on how to force BIDMat and netlib-java to force 
loading the right blas? For netlib, I there are few JVM flags, such as 
-Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can 
force it to use Java implementation. Not sure I understand how to force use a 
specific blas (not specific wrapper for blas).

Btw. I have installed openblas (yum install openblas), so I suppose that netlib 
is using it.

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org

Subject: Re: Using CUDA within Spark / boosting linear algebra

Getting breeze to pick up the right blas library is critical for performance. I 
recommend using OpenBLAS (or MKL, if you already have it). It might make sense 
to force BIDMat to use the same underlying BLAS library as well.

On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan, Joseph

I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
netlib-java+breeze (sorry for weird table formatting):

|A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
Breeze+Netlib-java f2jblas

Re: Using CUDA within Spark / boosting linear algebra

2015-02-09 Thread Evan R. Sparks
Great - perhaps we can move this discussion off-list and onto a JIRA
ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

It seems like this is going to be somewhat exploratory for a while (and
there's probably only a handful of us who really care about fast linear
algebra!)

- Evan

On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

  Hi Evan,



 Thank you for explanation and useful link. I am going to build OpenBLAS,
 link it with Netlib-java and perform benchmark again.



 Do I understand correctly that BIDMat binaries contain statically linked
 Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
 having MKL BLAS installed on my server. If it is true, I wonder if it is OK
 because Intel sells this library. Nevertheless, it seems that in my case
 precompiled MKL BLAS performs better than precompiled OpenBLAS given that
 BIDMat and Netlib-java are supposed to be on par with JNI overheads.



 Though, it might be interesting to link Netlib-java with Intel MKL, as you
 suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java)
 interested to compare their libraries.



 Best regards, Alexander



 *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
 *Sent:* Friday, February 06, 2015 5:58 PM

 *To:* Ulanov, Alexander
 *Cc:* Joseph Bradley; dev@spark.apache.org
 *Subject:* Re: Using CUDA within Spark / boosting linear algebra



 I would build OpenBLAS yourself, since good BLAS performance comes from
 getting cache sizes, etc. set up correctly for your particular hardware -
 this is often a very tricky process (see, e.g. ATLAS), but we found that on
 relatively modern Xeon chips, OpenBLAS builds quickly and yields
 performance competitive with MKL.



 To make sure the right library is getting used, you have to make sure it's
 first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so
 will do the trick here.



 For some examples of getting netlib-java setup on an ec2 node and some
 example benchmarking code we ran a while back, see:
 https://github.com/shivaram/matrix-bench



 In particular - build-openblas-ec2.sh shows you how to build the library
 and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
 the path setup and get that library picked up by netlib-java.



 In this way - you could probably get cuBLAS set up to be used by
 netlib-java as well.



 - Evan



 On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:

  Evan, could you elaborate on how to force BIDMat and netlib-java to
 force loading the right blas? For netlib, I there are few JVM flags, such
 as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I
 can force it to use Java implementation. Not sure I understand how to force
 use a specific blas (not specific wrapper for blas).



 Btw. I have installed openblas (yum install openblas), so I suppose that
 netlib is using it.



 *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
 *Sent:* Friday, February 06, 2015 5:19 PM
 *To:* Ulanov, Alexander
 *Cc:* Joseph Bradley; dev@spark.apache.org


 *Subject:* Re: Using CUDA within Spark / boosting linear algebra



 Getting breeze to pick up the right blas library is critical for
 performance. I recommend using OpenBLAS (or MKL, if you already have it).
 It might make sense to force BIDMat to use the same underlying BLAS library
 as well.



 On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:

 Hi Evan, Joseph

 I did few matrix multiplication test and BIDMat seems to be ~10x faster
 than netlib-java+breeze (sorry for weird table formatting):

 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
 Linux, Scala 2.11.

 Later I will make tests with Cuda. I need to install new Cuda version for
 this purpose.

 Do you have any ideas why breeze-netlib with native blas is so much slower
 than BIDMat MKL?

 Best regards, Alexander

 From: Joseph Bradley [mailto:jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.org

 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Hi Alexander,

 Using GPUs with Spark would be very exciting.  Small comment: Concerning
 your question earlier about keeping data stored on the GPU rather than
 having to move it between main memory and GPU memory on each iteration, I
 would guess this would be critical to getting good performance.  If you
 could do multiple local iterations before aggregating results, then the
 cost of data movement to the GPU

Re: Using CUDA within Spark / boosting linear algebra

2015-02-09 Thread Chester @work
Maybe you can ask prof john canny himself:-)  as I invited him to give a talk 
at Alpine data labs in March's meetup (SF big Analytics  SF machine learning 
joined meetup) , 3/11. To be announced in next day or so. 

Chester

Sent from my iPhone

 On Feb 9, 2015, at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.com 
 wrote:
 
 Hi Evan,
 
 Thank you for explanation and useful link. I am going to build OpenBLAS, link 
 it with Netlib-java and perform benchmark again.
 
 Do I understand correctly that BIDMat binaries contain statically linked 
 Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having 
 MKL BLAS installed on my server. If it is true, I wonder if it is OK because 
 Intel sells this library. Nevertheless, it seems that in my case precompiled 
 MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and 
 Netlib-java are supposed to be on par with JNI overheads.
 
 Though, it might be interesting to link Netlib-java with Intel MKL, as you 
 suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) 
 interested to compare their libraries.
 
 Best regards, Alexander
 
 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Friday, February 06, 2015 5:58 PM
 To: Ulanov, Alexander
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra
 
 I would build OpenBLAS yourself, since good BLAS performance comes from 
 getting cache sizes, etc. set up correctly for your particular hardware - 
 this is often a very tricky process (see, e.g. ATLAS), but we found that on 
 relatively modern Xeon chips, OpenBLAS builds quickly and yields performance 
 competitive with MKL.
 
 To make sure the right library is getting used, you have to make sure it's 
 first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so 
 will do the trick here.
 
 For some examples of getting netlib-java setup on an ec2 node and some 
 example benchmarking code we ran a while back, see: 
 https://github.com/shivaram/matrix-bench
 
 In particular - build-openblas-ec2.sh shows you how to build the library and 
 set up symlinks correctly, and scala/run-netlib.sh shows you how to get the 
 path setup and get that library picked up by netlib-java.
 
 In this way - you could probably get cuBLAS set up to be used by netlib-java 
 as well.
 
 - Evan
 
 On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 Evan, could you elaborate on how to force BIDMat and netlib-java to force 
 loading the right blas? For netlib, I there are few JVM flags, such as 
 -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can 
 force it to use Java implementation. Not sure I understand how to force use a 
 specific blas (not specific wrapper for blas).
 
 Btw. I have installed openblas (yum install openblas), so I suppose that 
 netlib is using it.
 
 From: Evan R. Sparks 
 [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
 Sent: Friday, February 06, 2015 5:19 PM
 To: Ulanov, Alexander
 Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
 
 Subject: Re: Using CUDA within Spark / boosting linear algebra
 
 Getting breeze to pick up the right blas library is critical for performance. 
 I recommend using OpenBLAS (or MKL, if you already have it). It might make 
 sense to force BIDMat to use the same underlying BLAS library as well.
 
 On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 Hi Evan, Joseph
 
 I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
 netlib-java+breeze (sorry for weird table formatting):
 
 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |
 
 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, 
 Scala 2.11.
 
 Later I will make tests with Cuda. I need to install new Cuda version for 
 this purpose.
 
 Do you have any ideas why breeze-netlib with native blas is so much slower 
 than BIDMat MKL?
 
 Best regards, Alexander
 
 From: Joseph Bradley 
 [mailto:jos...@databricks.commailto:jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra
 
 Hi Alexander,
 
 Using GPUs with Spark would be very exciting.  Small comment: Concerning your 
 question earlier about keeping data stored on the GPU rather than having to 
 move it between main memory and GPU memory on each iteration, I would guess 
 this would be critical to getting good

RE: Using CUDA within Spark / boosting linear algebra

2015-02-09 Thread Ulanov, Alexander
Hi Evan,

Thank you for explanation and useful link. I am going to build OpenBLAS, link 
it with Netlib-java and perform benchmark again.

Do I understand correctly that BIDMat binaries contain statically linked Intel 
MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL 
BLAS installed on my server. If it is true, I wonder if it is OK because Intel 
sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS 
performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are 
supposed to be on par with JNI overheads.

Though, it might be interesting to link Netlib-java with Intel MKL, as you 
suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) 
interested to compare their libraries.

Best regards, Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:58 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I would build OpenBLAS yourself, since good BLAS performance comes from getting 
cache sizes, etc. set up correctly for your particular hardware - this is often 
a very tricky process (see, e.g. ATLAS), but we found that on relatively modern 
Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL.

To make sure the right library is getting used, you have to make sure it's 
first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will 
do the trick here.

For some examples of getting netlib-java setup on an ec2 node and some example 
benchmarking code we ran a while back, see: 
https://github.com/shivaram/matrix-bench

In particular - build-openblas-ec2.sh shows you how to build the library and 
set up symlinks correctly, and scala/run-netlib.sh shows you how to get the 
path setup and get that library picked up by netlib-java.

In this way - you could probably get cuBLAS set up to be used by netlib-java as 
well.

- Evan

On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Evan, could you elaborate on how to force BIDMat and netlib-java to force 
loading the right blas? For netlib, I there are few JVM flags, such as 
-Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can 
force it to use Java implementation. Not sure I understand how to force use a 
specific blas (not specific wrapper for blas).

Btw. I have installed openblas (yum install openblas), so I suppose that netlib 
is using it.

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org

Subject: Re: Using CUDA within Spark / boosting linear algebra

Getting breeze to pick up the right blas library is critical for performance. I 
recommend using OpenBLAS (or MKL, if you already have it). It might make sense 
to force BIDMat to use the same underlying BLAS library as well.

On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan, Joseph

I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
netlib-java+breeze (sorry for weird table formatting):

|A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
Breeze+Netlib-java f2jblas |
+---+
|100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
|1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, 
Scala 2.11.

Later I will make tests with Cuda. I need to install new Cuda version for this 
purpose.

Do you have any ideas why breeze-netlib with native blas is so much slower than 
BIDMat MKL?

Best regards, Alexander

From: Joseph Bradley 
[mailto:jos...@databricks.commailto:jos...@databricks.com]
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Hi Alexander,

Using GPUs with Spark would be very exciting.  Small comment: Concerning your 
question earlier about keeping data stored on the GPU rather than having to 
move it between main memory and GPU memory on each iteration, I would guess 
this would be critical to getting good performance.  If you could do multiple 
local iterations before aggregating results, then the cost of data movement to 
the GPU could be amortized (and I believe that is done in practice).  Having 
Spark be aware of the GPU and using it as another part of memory sounds like a 
much bigger undertaking.

Joseph

On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Thank you

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks
I would build OpenBLAS yourself, since good BLAS performance comes from
getting cache sizes, etc. set up correctly for your particular hardware -
this is often a very tricky process (see, e.g. ATLAS), but we found that on
relatively modern Xeon chips, OpenBLAS builds quickly and yields
performance competitive with MKL.

To make sure the right library is getting used, you have to make sure it's
first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so
will do the trick here.

For some examples of getting netlib-java setup on an ec2 node and some
example benchmarking code we ran a while back, see:
https://github.com/shivaram/matrix-bench

In particular - build-openblas-ec2.sh shows you how to build the library
and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
the path setup and get that library picked up by netlib-java.

In this way - you could probably get cuBLAS set up to be used by
netlib-java as well.

- Evan

On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

  Evan, could you elaborate on how to force BIDMat and netlib-java to
 force loading the right blas? For netlib, I there are few JVM flags, such
 as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I
 can force it to use Java implementation. Not sure I understand how to force
 use a specific blas (not specific wrapper for blas).



 Btw. I have installed openblas (yum install openblas), so I suppose that
 netlib is using it.



 *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
 *Sent:* Friday, February 06, 2015 5:19 PM
 *To:* Ulanov, Alexander
 *Cc:* Joseph Bradley; dev@spark.apache.org

 *Subject:* Re: Using CUDA within Spark / boosting linear algebra



 Getting breeze to pick up the right blas library is critical for
 performance. I recommend using OpenBLAS (or MKL, if you already have it).
 It might make sense to force BIDMat to use the same underlying BLAS library
 as well.



 On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:

 Hi Evan, Joseph

 I did few matrix multiplication test and BIDMat seems to be ~10x faster
 than netlib-java+breeze (sorry for weird table formatting):

 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
 Linux, Scala 2.11.

 Later I will make tests with Cuda. I need to install new Cuda version for
 this purpose.

 Do you have any ideas why breeze-netlib with native blas is so much slower
 than BIDMat MKL?

 Best regards, Alexander

 From: Joseph Bradley [mailto:jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.org

 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Hi Alexander,

 Using GPUs with Spark would be very exciting.  Small comment: Concerning
 your question earlier about keeping data stored on the GPU rather than
 having to move it between main memory and GPU memory on each iteration, I
 would guess this would be critical to getting good performance.  If you
 could do multiple local iterations before aggregating results, then the
 cost of data movement to the GPU could be amortized (and I believe that is
 done in practice).  Having Spark be aware of the GPU and using it as
 another part of memory sounds like a much bigger undertaking.

 Joseph

 On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:
 Thank you for explanation! I’ve watched the BIDMach presentation by John
 Canny and I am really inspired by his talk and comparisons with Spark MLlib.

 I am very interested to find out what will be better within Spark: BIDMat
 or netlib-java with CPU or GPU natives. Could you suggest a fair way to
 benchmark them? Currently I do benchmarks on artificial neural networks in
 batch mode. While it is not a “pure” test of linear algebra, it involves
 some other things that are essential to machine learning.

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 1:29 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I'd be surprised of BIDMat+OpenBLAS was significantly faster than
 netlib-java+OpenBLAS, but if it is much faster it's probably due to data
 layout and fewer levels of indirection - it's definitely a worthwhile
 experiment to run. The main speedups I've seen from using it come from
 highly optimized GPU code for linear algebra. I know that in the past Canny
 has gone as far as to write custom GPU kernels for performance-critical
 regions of code.[1]

 BIDMach is highly

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks
Getting breeze to pick up the right blas library is critical for
performance. I recommend using OpenBLAS (or MKL, if you already have it).
It might make sense to force BIDMat to use the same underlying BLAS library
as well.

On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Hi Evan, Joseph

 I did few matrix multiplication test and BIDMat seems to be ~10x faster
 than netlib-java+breeze (sorry for weird table formatting):

 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
 Linux, Scala 2.11.

 Later I will make tests with Cuda. I need to install new Cuda version for
 this purpose.

 Do you have any ideas why breeze-netlib with native blas is so much slower
 than BIDMat MKL?

 Best regards, Alexander

 From: Joseph Bradley [mailto:jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Hi Alexander,

 Using GPUs with Spark would be very exciting.  Small comment: Concerning
 your question earlier about keeping data stored on the GPU rather than
 having to move it between main memory and GPU memory on each iteration, I
 would guess this would be critical to getting good performance.  If you
 could do multiple local iterations before aggregating results, then the
 cost of data movement to the GPU could be amortized (and I believe that is
 done in practice).  Having Spark be aware of the GPU and using it as
 another part of memory sounds like a much bigger undertaking.

 Joseph

 On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:
 Thank you for explanation! I’ve watched the BIDMach presentation by John
 Canny and I am really inspired by his talk and comparisons with Spark MLlib.

 I am very interested to find out what will be better within Spark: BIDMat
 or netlib-java with CPU or GPU natives. Could you suggest a fair way to
 benchmark them? Currently I do benchmarks on artificial neural networks in
 batch mode. While it is not a “pure” test of linear algebra, it involves
 some other things that are essential to machine learning.

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 1:29 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I'd be surprised of BIDMat+OpenBLAS was significantly faster than
 netlib-java+OpenBLAS, but if it is much faster it's probably due to data
 layout and fewer levels of indirection - it's definitely a worthwhile
 experiment to run. The main speedups I've seen from using it come from
 highly optimized GPU code for linear algebra. I know that in the past Canny
 has gone as far as to write custom GPU kernels for performance-critical
 regions of code.[1]

 BIDMach is highly optimized for single node performance or performance on
 small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
 batched in that way) the performance tends to fall off. Canny argues for
 hardware/software codesign and as such prefers machine configurations that
 are quite different than what we find in most commodity cluster nodes -
 e.g. 10 disk cahnnels and 4 GPUs.

 In contrast, MLlib was designed for horizontal scalability on commodity
 clusters and works best on very big datasets - order of terabytes.

 For the most part, these projects developed concurrently to address
 slightly different use cases. That said, there may be bits of BIDMach we
 could repurpose for MLlib - keep in mind we need to be careful about
 maintaining cross-language compatibility for our Java and Python-users,
 though.

 - Evan

 [1] - http://arxiv.org/abs/1409.5402
 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

 On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com
 mailto:alexander.ula...@hp.com wrote:
 Hi Evan,

 Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
 what makes them faster than netlib-java?

 The same group has BIDMach library that implements machine learning. For
 some examples they use Caffe convolutional neural network library owned by
 another group in Berkeley. Could you elaborate on how these all might be
 connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
 you take BIDMach for optimization and learning?

 Best regards, Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
 evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 12:09 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.orgmailto:dev@spark.apache.org

RE: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Ulanov, Alexander
Evan, could you elaborate on how to force BIDMat and netlib-java to force 
loading the right blas? For netlib, I there are few JVM flags, such as 
-Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can 
force it to use Java implementation. Not sure I understand how to force use a 
specific blas (not specific wrapper for blas).

Btw. I have installed openblas (yum install openblas), so I suppose that netlib 
is using it.

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Getting breeze to pick up the right blas library is critical for performance. I 
recommend using OpenBLAS (or MKL, if you already have it). It might make sense 
to force BIDMat to use the same underlying BLAS library as well.

On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan, Joseph

I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
netlib-java+breeze (sorry for weird table formatting):

|A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
Breeze+Netlib-java f2jblas |
+---+
|100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
|1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, 
Scala 2.11.

Later I will make tests with Cuda. I need to install new Cuda version for this 
purpose.

Do you have any ideas why breeze-netlib with native blas is so much slower than 
BIDMat MKL?

Best regards, Alexander

From: Joseph Bradley 
[mailto:jos...@databricks.commailto:jos...@databricks.com]
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Hi Alexander,

Using GPUs with Spark would be very exciting.  Small comment: Concerning your 
question earlier about keeping data stored on the GPU rather than having to 
move it between main memory and GPU memory on each iteration, I would guess 
this would be critical to getting good performance.  If you could do multiple 
local iterations before aggregating results, then the cost of data movement to 
the GPU could be amortized (and I believe that is done in practice).  Having 
Spark be aware of the GPU and using it as another part of memory sounds like a 
much bigger undertaking.

Joseph

On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny 
and I am really inspired by his talk and comparisons with Spark MLlib.

I am very interested to find out what will be better within Spark: BIDMat or 
netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark 
them? Currently I do benchmarks on artificial neural networks in batch mode. 
While it is not a “pure” test of linear algebra, it involves some other things 
that are essential to machine learning.

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd be surprised of BIDMat+OpenBLAS was significantly faster than 
netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout 
and fewer levels of indirection - it's definitely a worthwhile experiment to 
run. The main speedups I've seen from using it come from highly optimized GPU 
code for linear algebra. I know that in the past Canny has gone as far as to 
write custom GPU kernels for performance-critical regions of code.[1]

BIDMach is highly optimized for single node performance or performance on small 
clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in 
that way) the performance tends to fall off. Canny argues for hardware/software 
codesign and as such prefers machine configurations that are quite different 
than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 
GPUs.

In contrast, MLlib was designed for horizontal scalability on commodity 
clusters and works best on very big datasets - order of terabytes.

For the most part, these projects developed concurrently to address slightly 
different use cases. That said, there may be bits of BIDMach we could repurpose 
for MLlib - keep in mind we need to be careful about maintaining cross-language 
compatibility for our Java and Python-users, though.

- Evan

[1] - http://arxiv.org/abs/1409.5402
[2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

On Thu, Feb 5, 2015 at 1

RE: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Ulanov, Alexander
Hi Evan, Joseph

I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
netlib-java+breeze (sorry for weird table formatting):

|A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
Breeze+Netlib-java f2jblas | 
+---+
|100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
|1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, 
Scala 2.11.

Later I will make tests with Cuda. I need to install new Cuda version for this 
purpose. 

Do you have any ideas why breeze-netlib with native blas is so much slower than 
BIDMat MKL?

Best regards, Alexander

From: Joseph Bradley [mailto:jos...@databricks.com] 
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Hi Alexander,

Using GPUs with Spark would be very exciting.  Small comment: Concerning your 
question earlier about keeping data stored on the GPU rather than having to 
move it between main memory and GPU memory on each iteration, I would guess 
this would be critical to getting good performance.  If you could do multiple 
local iterations before aggregating results, then the cost of data movement to 
the GPU could be amortized (and I believe that is done in practice).  Having 
Spark be aware of the GPU and using it as another part of memory sounds like a 
much bigger undertaking.

Joseph

On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com 
wrote:
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny 
and I am really inspired by his talk and comparisons with Spark MLlib.

I am very interested to find out what will be better within Spark: BIDMat or 
netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark 
them? Currently I do benchmarks on artificial neural networks in batch mode. 
While it is not a “pure” test of linear algebra, it involves some other things 
that are essential to machine learning.

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd be surprised of BIDMat+OpenBLAS was significantly faster than 
netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout 
and fewer levels of indirection - it's definitely a worthwhile experiment to 
run. The main speedups I've seen from using it come from highly optimized GPU 
code for linear algebra. I know that in the past Canny has gone as far as to 
write custom GPU kernels for performance-critical regions of code.[1]

BIDMach is highly optimized for single node performance or performance on small 
clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in 
that way) the performance tends to fall off. Canny argues for hardware/software 
codesign and as such prefers machine configurations that are quite different 
than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 
GPUs.

In contrast, MLlib was designed for horizontal scalability on commodity 
clusters and works best on very big datasets - order of terabytes.

For the most part, these projects developed concurrently to address slightly 
different use cases. That said, there may be bits of BIDMach we could repurpose 
for MLlib - keep in mind we need to be careful about maintaining cross-language 
compatibility for our Java and Python-users, though.

- Evan

[1] - http://arxiv.org/abs/1409.5402
[2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan,

Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what 
makes them faster than netlib-java?

The same group has BIDMach library that implements machine learning. For some 
examples they use Caffe convolutional neural network library owned by another 
group in Berkeley. Could you elaborate on how these all might be connected with 
Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach 
for optimization and learning?

Best regards, Alexander

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many 
cases.

You might consider taking a look at the codepaths that BIDMat 
(https://github.com/BIDData/BIDMat) takes and comparing them to 
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Nicholas Chammas
Lemme butt in randomly here and say there is an interesting discussion on
this Spark PR https://github.com/apache/spark/pull/4448 about
netlib-java, JBLAS, Breeze, and other things I know nothing of, that y'all
may find interesting. Among the participants is the author of netlib-java.

On Sun Feb 08 2015 at 2:48:19 AM Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Hi Evan, Joseph

 I did few matrix multiplication test and BIDMat seems to be ~10x faster
 than netlib-java+breeze (sorry for weird table formatting):

 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
 Linux, Scala 2.11.

 Later I will make tests with Cuda. I need to install new Cuda version for
 this purpose.

 Do you have any ideas why breeze-netlib with native blas is so much slower
 than BIDMat MKL?

 Best regards, Alexander

 From: Joseph Bradley [mailto:jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Hi Alexander,

 Using GPUs with Spark would be very exciting.  Small comment: Concerning
 your question earlier about keeping data stored on the GPU rather than
 having to move it between main memory and GPU memory on each iteration, I
 would guess this would be critical to getting good performance.  If you
 could do multiple local iterations before aggregating results, then the
 cost of data movement to the GPU could be amortized (and I believe that is
 done in practice).  Having Spark be aware of the GPU and using it as
 another part of memory sounds like a much bigger undertaking.

 Joseph

 On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:
 Thank you for explanation! I’ve watched the BIDMach presentation by John
 Canny and I am really inspired by his talk and comparisons with Spark MLlib.

 I am very interested to find out what will be better within Spark: BIDMat
 or netlib-java with CPU or GPU natives. Could you suggest a fair way to
 benchmark them? Currently I do benchmarks on artificial neural networks in
 batch mode. While it is not a “pure” test of linear algebra, it involves
 some other things that are essential to machine learning.

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 1:29 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I'd be surprised of BIDMat+OpenBLAS was significantly faster than
 netlib-java+OpenBLAS, but if it is much faster it's probably due to data
 layout and fewer levels of indirection - it's definitely a worthwhile
 experiment to run. The main speedups I've seen from using it come from
 highly optimized GPU code for linear algebra. I know that in the past Canny
 has gone as far as to write custom GPU kernels for performance-critical
 regions of code.[1]

 BIDMach is highly optimized for single node performance or performance on
 small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
 batched in that way) the performance tends to fall off. Canny argues for
 hardware/software codesign and as such prefers machine configurations that
 are quite different than what we find in most commodity cluster nodes -
 e.g. 10 disk cahnnels and 4 GPUs.

 In contrast, MLlib was designed for horizontal scalability on commodity
 clusters and works best on very big datasets - order of terabytes.

 For the most part, these projects developed concurrently to address
 slightly different use cases. That said, there may be bits of BIDMach we
 could repurpose for MLlib - keep in mind we need to be careful about
 maintaining cross-language compatibility for our Java and Python-users,
 though.

 - Evan

 [1] - http://arxiv.org/abs/1409.5402
 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

 On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com
 mailto:alexander.ula...@hp.com wrote:
 Hi Evan,

 Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
 what makes them faster than netlib-java?

 The same group has BIDMach library that implements machine learning. For
 some examples they use Caffe convolutional neural network library owned by
 another group in Berkeley. Could you elaborate on how these all might be
 connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
 you take BIDMach for optimization and learning?

 Best regards, Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
 evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 12:09 PM
 To: Ulanov

RE: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Ulanov, Alexander
Hi Evan,

Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what 
makes them faster than netlib-java?

The same group has BIDMach library that implements machine learning. For some 
examples they use Caffe convolutional neural network library owned by another 
group in Berkeley. Could you elaborate on how these all might be connected with 
Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach 
for optimization and learning?

Best regards, Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many 
cases.

You might consider taking a look at the codepaths that BIDMat 
(https://github.com/BIDData/BIDMat) takes and comparing them to 
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to 
make this work really fast from Scala. I've run it on my laptop and compared to 
MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of 
layers of indirection here and you really want to avoid data copying as much as 
possible.

We could also consider swapping out BIDMat for Breeze, but that would be a big 
project and if we can figure out how to get breeze+cublas to comparable 
performance that would be a big win.

On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Dear Spark developers,

I am exploring how to make linear algebra operations faster within Spark. One 
way of doing this is to use Scala Breeze library that is bundled with Spark. 
For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS 
(basic linear algebra subprograms) and LAPACK native binaries if they are 
available on the worker node. It also has its own optimized Java implementation 
of BLAS. It is worth mentioning, that native binaries provide better 
performance only for BLAS level 3, i.e. matrix-matrix operations or general 
matrix multiplication (GEMM). This is confirmed by GEMM test on Netlib-java 
page https://github.com/fommil/netlib-java. I also confirmed it with my 
experiments with training of artificial neural network 
https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I 
would like to boost performance more.

GPU is supposed to work fast with linear algebra and there is Nvidia CUDA 
implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU 
and I was able to do the following. I linked cublas (instead of cpu-based blas) 
with Netlib-java wrapper and put it into Spark, so Breeze/Netlib is using it. 
Then I did some performance measurements with regards to artificial neural 
network batch learning in Spark MLlib that involves matrix-matrix 
multiplications. It turns out that for matrices of size less than ~1000x780 GPU 
cublas has the same speed as CPU blas. Cublas becomes slower for bigger 
matrices. It worth mentioning that it is was not a test for ONLY multiplication 
since there are other operations involved. One of the reasons for slowdown 
might be the overhead of copying the matrices from computer memory to graphic 
card memory and back.

So, few questions:
1) Do these results with CUDA make sense?
2) If the problem is with copy overhead, are there any libraries that allow to 
force intermediate results to stay in graphic card memory thus removing the 
overhead?
3) Any other options to speed-up linear algebra in Spark?

Thank you, Alexander

-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.orgmailto:dev-unsubscr...@spark.apache.org
For additional commands, e-mail: 
dev-h...@spark.apache.orgmailto:dev-h...@spark.apache.org



Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Evan R. Sparks
I'd be surprised of BIDMat+OpenBLAS was significantly faster than
netlib-java+OpenBLAS, but if it is much faster it's probably due to data
layout and fewer levels of indirection - it's definitely a worthwhile
experiment to run. The main speedups I've seen from using it come from
highly optimized GPU code for linear algebra. I know that in the past Canny
has gone as far as to write custom GPU kernels for performance-critical
regions of code.[1]

BIDMach is highly optimized for single node performance or performance on
small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
batched in that way) the performance tends to fall off. Canny argues for
hardware/software codesign and as such prefers machine configurations that
are quite different than what we find in most commodity cluster nodes -
e.g. 10 disk cahnnels and 4 GPUs.

In contrast, MLlib was designed for horizontal scalability on commodity
clusters and works best on very big datasets - order of terabytes.

For the most part, these projects developed concurrently to address
slightly different use cases. That said, there may be bits of BIDMach we
could repurpose for MLlib - keep in mind we need to be careful about
maintaining cross-language compatibility for our Java and Python-users,
though.

- Evan

[1] - http://arxiv.org/abs/1409.5402
[2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

  Hi Evan,



 Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
 what makes them faster than netlib-java?



 The same group has BIDMach library that implements machine learning. For
 some examples they use Caffe convolutional neural network library owned by
 another group in Berkeley. Could you elaborate on how these all might be
 connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
 you take BIDMach for optimization and learning?



 Best regards, Alexander



 *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
 *Sent:* Thursday, February 05, 2015 12:09 PM
 *To:* Ulanov, Alexander
 *Cc:* dev@spark.apache.org
 *Subject:* Re: Using CUDA within Spark / boosting linear algebra



 I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
 many cases.



 You might consider taking a look at the codepaths that BIDMat (
 https://github.com/BIDData/BIDMat) takes and comparing them to
 netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
 to make this work really fast from Scala. I've run it on my laptop and
 compared to MKL and in certain cases it's 10x faster at matrix multiply.
 There are a lot of layers of indirection here and you really want to avoid
 data copying as much as possible.



 We could also consider swapping out BIDMat for Breeze, but that would be a
 big project and if we can figure out how to get breeze+cublas to comparable
 performance that would be a big win.



 On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander 
 alexander.ula...@hp.com wrote:

 Dear Spark developers,

 I am exploring how to make linear algebra operations faster within Spark.
 One way of doing this is to use Scala Breeze library that is bundled with
 Spark. For matrix operations, it employs Netlib-java that has a Java
 wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
 binaries if they are available on the worker node. It also has its own
 optimized Java implementation of BLAS. It is worth mentioning, that native
 binaries provide better performance only for BLAS level 3, i.e.
 matrix-matrix operations or general matrix multiplication (GEMM). This is
 confirmed by GEMM test on Netlib-java page
 https://github.com/fommil/netlib-java. I also confirmed it with my
 experiments with training of artificial neural network
 https://github.com/apache/spark/pull/1290#issuecomment-70313952. However,
 I would like to boost performance more.

 GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
 implementation of BLAS, called cublas. I have one Linux server with Nvidia
 GPU and I was able to do the following. I linked cublas (instead of
 cpu-based blas) with Netlib-java wrapper and put it into Spark, so
 Breeze/Netlib is using it. Then I did some performance measurements with
 regards to artificial neural network batch learning in Spark MLlib that
 involves matrix-matrix multiplications. It turns out that for matrices of
 size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
 becomes slower for bigger matrices. It worth mentioning that it is was not
 a test for ONLY multiplication since there are other operations involved.
 One of the reasons for slowdown might be the overhead of copying the
 matrices from computer memory to graphic card memory and back.

 So, few questions:
 1) Do these results with CUDA make sense?
 2) If the problem is with copy overhead, are there any libraries that
 allow to force intermediate results to stay in graphic card

RE: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Ulanov, Alexander
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny 
and I am really inspired by his talk and comparisons with Spark MLlib.

I am very interested to find out what will be better within Spark: BIDMat or 
netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark 
them? Currently I do benchmarks on artificial neural networks in batch mode. 
While it is not a “pure” test of linear algebra, it involves some other things 
that are essential to machine learning.

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd be surprised of BIDMat+OpenBLAS was significantly faster than 
netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout 
and fewer levels of indirection - it's definitely a worthwhile experiment to 
run. The main speedups I've seen from using it come from highly optimized GPU 
code for linear algebra. I know that in the past Canny has gone as far as to 
write custom GPU kernels for performance-critical regions of code.[1]

BIDMach is highly optimized for single node performance or performance on small 
clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in 
that way) the performance tends to fall off. Canny argues for hardware/software 
codesign and as such prefers machine configurations that are quite different 
than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 
GPUs.

In contrast, MLlib was designed for horizontal scalability on commodity 
clusters and works best on very big datasets - order of terabytes.

For the most part, these projects developed concurrently to address slightly 
different use cases. That said, there may be bits of BIDMach we could repurpose 
for MLlib - keep in mind we need to be careful about maintaining cross-language 
compatibility for our Java and Python-users, though.

- Evan

[1] - http://arxiv.org/abs/1409.5402
[2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan,

Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what 
makes them faster than netlib-java?

The same group has BIDMach library that implements machine learning. For some 
examples they use Caffe convolutional neural network library owned by another 
group in Berkeley. Could you elaborate on how these all might be connected with 
Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach 
for optimization and learning?

Best regards, Alexander

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many 
cases.

You might consider taking a look at the codepaths that BIDMat 
(https://github.com/BIDData/BIDMat) takes and comparing them to 
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to 
make this work really fast from Scala. I've run it on my laptop and compared to 
MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of 
layers of indirection here and you really want to avoid data copying as much as 
possible.

We could also consider swapping out BIDMat for Breeze, but that would be a big 
project and if we can figure out how to get breeze+cublas to comparable 
performance that would be a big win.

On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Dear Spark developers,

I am exploring how to make linear algebra operations faster within Spark. One 
way of doing this is to use Scala Breeze library that is bundled with Spark. 
For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS 
(basic linear algebra subprograms) and LAPACK native binaries if they are 
available on the worker node. It also has its own optimized Java implementation 
of BLAS. It is worth mentioning, that native binaries provide better 
performance only for BLAS level 3, i.e. matrix-matrix operations or general 
matrix multiplication (GEMM). This is confirmed by GEMM test on Netlib-java 
page https://github.com/fommil/netlib-java. I also confirmed it with my 
experiments with training of artificial neural network 
https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I 
would like to boost performance more.

GPU is supposed to work fast with linear algebra and there is Nvidia CUDA 
implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU 
and I was able to do the following. I linked cublas (instead of cpu-based blas) 
with Netlib-java wrapper and put

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Joseph Bradley
Hi Alexander,

Using GPUs with Spark would be very exciting.  Small comment: Concerning
your question earlier about keeping data stored on the GPU rather than
having to move it between main memory and GPU memory on each iteration, I
would guess this would be critical to getting good performance.  If you
could do multiple local iterations before aggregating results, then the
cost of data movement to the GPU could be amortized (and I believe that is
done in practice).  Having Spark be aware of the GPU and using it as
another part of memory sounds like a much bigger undertaking.

Joseph

On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Thank you for explanation! I’ve watched the BIDMach presentation by John
 Canny and I am really inspired by his talk and comparisons with Spark MLlib.

 I am very interested to find out what will be better within Spark: BIDMat
 or netlib-java with CPU or GPU natives. Could you suggest a fair way to
 benchmark them? Currently I do benchmarks on artificial neural networks in
 batch mode. While it is not a “pure” test of linear algebra, it involves
 some other things that are essential to machine learning.

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 1:29 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I'd be surprised of BIDMat+OpenBLAS was significantly faster than
 netlib-java+OpenBLAS, but if it is much faster it's probably due to data
 layout and fewer levels of indirection - it's definitely a worthwhile
 experiment to run. The main speedups I've seen from using it come from
 highly optimized GPU code for linear algebra. I know that in the past Canny
 has gone as far as to write custom GPU kernels for performance-critical
 regions of code.[1]

 BIDMach is highly optimized for single node performance or performance on
 small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
 batched in that way) the performance tends to fall off. Canny argues for
 hardware/software codesign and as such prefers machine configurations that
 are quite different than what we find in most commodity cluster nodes -
 e.g. 10 disk cahnnels and 4 GPUs.

 In contrast, MLlib was designed for horizontal scalability on commodity
 clusters and works best on very big datasets - order of terabytes.

 For the most part, these projects developed concurrently to address
 slightly different use cases. That said, there may be bits of BIDMach we
 could repurpose for MLlib - keep in mind we need to be careful about
 maintaining cross-language compatibility for our Java and Python-users,
 though.

 - Evan

 [1] - http://arxiv.org/abs/1409.5402
 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

 On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com
 mailto:alexander.ula...@hp.com wrote:
 Hi Evan,

 Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
 what makes them faster than netlib-java?

 The same group has BIDMach library that implements machine learning. For
 some examples they use Caffe convolutional neural network library owned by
 another group in Berkeley. Could you elaborate on how these all might be
 connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
 you take BIDMach for optimization and learning?

 Best regards, Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
 evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 12:09 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
 many cases.

 You might consider taking a look at the codepaths that BIDMat (
 https://github.com/BIDData/BIDMat) takes and comparing them to
 netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
 to make this work really fast from Scala. I've run it on my laptop and
 compared to MKL and in certain cases it's 10x faster at matrix multiply.
 There are a lot of layers of indirection here and you really want to avoid
 data copying as much as possible.

 We could also consider swapping out BIDMat for Breeze, but that would be a
 big project and if we can figure out how to get breeze+cublas to comparable
 performance that would be a big win.

 On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 Dear Spark developers,

 I am exploring how to make linear algebra operations faster within Spark.
 One way of doing this is to use Scala Breeze library that is bundled with
 Spark. For matrix operations, it employs Netlib-java that has a Java
 wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
 binaries if they are available on the worker node. It also has its own
 optimized Java implementation of BLAS. It is worth mentioning

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Evan R. Sparks
I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
many cases.

You might consider taking a look at the codepaths that BIDMat (
https://github.com/BIDData/BIDMat) takes and comparing them to
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
to make this work really fast from Scala. I've run it on my laptop and
compared to MKL and in certain cases it's 10x faster at matrix multiply.
There are a lot of layers of indirection here and you really want to avoid
data copying as much as possible.

We could also consider swapping out BIDMat for Breeze, but that would be a
big project and if we can figure out how to get breeze+cublas to comparable
performance that would be a big win.

On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Dear Spark developers,

 I am exploring how to make linear algebra operations faster within Spark.
 One way of doing this is to use Scala Breeze library that is bundled with
 Spark. For matrix operations, it employs Netlib-java that has a Java
 wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
 binaries if they are available on the worker node. It also has its own
 optimized Java implementation of BLAS. It is worth mentioning, that native
 binaries provide better performance only for BLAS level 3, i.e.
 matrix-matrix operations or general matrix multiplication (GEMM). This is
 confirmed by GEMM test on Netlib-java page
 https://github.com/fommil/netlib-java. I also confirmed it with my
 experiments with training of artificial neural network
 https://github.com/apache/spark/pull/1290#issuecomment-70313952. However,
 I would like to boost performance more.

 GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
 implementation of BLAS, called cublas. I have one Linux server with Nvidia
 GPU and I was able to do the following. I linked cublas (instead of
 cpu-based blas) with Netlib-java wrapper and put it into Spark, so
 Breeze/Netlib is using it. Then I did some performance measurements with
 regards to artificial neural network batch learning in Spark MLlib that
 involves matrix-matrix multiplications. It turns out that for matrices of
 size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
 becomes slower for bigger matrices. It worth mentioning that it is was not
 a test for ONLY multiplication since there are other operations involved.
 One of the reasons for slowdown might be the overhead of copying the
 matrices from computer memory to graphic card memory and back.

 So, few questions:
 1) Do these results with CUDA make sense?
 2) If the problem is with copy overhead, are there any libraries that
 allow to force intermediate results to stay in graphic card memory thus
 removing the overhead?
 3) Any other options to speed-up linear algebra in Spark?

 Thank you, Alexander

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Ulanov, Alexander
Dear Spark developers,

I am exploring how to make linear algebra operations faster within Spark. One 
way of doing this is to use Scala Breeze library that is bundled with Spark. 
For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS 
(basic linear algebra subprograms) and LAPACK native binaries if they are 
available on the worker node. It also has its own optimized Java implementation 
of BLAS. It is worth mentioning, that native binaries provide better 
performance only for BLAS level 3, i.e. matrix-matrix operations or general 
matrix multiplication (GEMM). This is confirmed by GEMM test on Netlib-java 
page https://github.com/fommil/netlib-java. I also confirmed it with my 
experiments with training of artificial neural network 
https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I 
would like to boost performance more.

GPU is supposed to work fast with linear algebra and there is Nvidia CUDA 
implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU 
and I was able to do the following. I linked cublas (instead of cpu-based blas) 
with Netlib-java wrapper and put it into Spark, so Breeze/Netlib is using it. 
Then I did some performance measurements with regards to artificial neural 
network batch learning in Spark MLlib that involves matrix-matrix 
multiplications. It turns out that for matrices of size less than ~1000x780 GPU 
cublas has the same speed as CPU blas. Cublas becomes slower for bigger 
matrices. It worth mentioning that it is was not a test for ONLY multiplication 
since there are other operations involved. One of the reasons for slowdown 
might be the overhead of copying the matrices from computer memory to graphic 
card memory and back. 

So, few questions:
1) Do these results with CUDA make sense? 
2) If the problem is with copy overhead, are there any libraries that allow to 
force intermediate results to stay in graphic card memory thus removing the 
overhead?
3) Any other options to speed-up linear algebra in Spark?

Thank you, Alexander

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org