subject:"Using CUDA within Spark \/ boosting linear algebra"

Re: Using CUDA within Spark / boosting linear algebra

2016-02-04 Thread Max Grossman

Allen,

Currently it only supports OpenCL because the code generator we’ve extended 
targeted OpenCL. There’s no technical reason that CUDA couldn’t be supported if 
people would be interested in that, but it would require a rewrite of some of 
the code generator as well as some ifdefs in the runtime to allow us to compile 
with either OpenCL or CUDA support. There are actually a few components that 
support both OpenCL and CUDA for when they’ve been reused for other projects 
that did use CUDA, just not all of them.

Thanks,

Max

> On Feb 4, 2016, at 9:42 AM, Allen Zhang <allenzhang...@126.com> wrote:
> 
> Hi Max,
> 
> I will look at it tomorrow. but a quick question, does it support CUDA from 
> Nvidia, not only OpenCL?
> 
> Thanks,
> Allen
> 
> 
> 
> 
> 
> At 2016-02-04 23:13:05, "Max Grossman" <j...@rice.edu> wrote:
> Hi all,
> 
> I’m jumping on this thread to point out another Spark+GPU project for people 
> to take a look at: https://github.com/agrippa/spark-swat 
> <https://github.com/agrippa/spark-swat>
> 
> SWAT (Spark with Accelerated Tasks) is a third-party JAR sitting on top of 
> Spark that uses runtime code generation to convert user-written 
> transformations into OpenCL kernels. SWAT’s lightweight runtime supports 
> multi-GPU systems, managing each device and its memory automatically. You 
> write your own Spark programs, and the runtime takes care of offloading your 
> transformations to the GPUs in your system:
> 
> val rdd = CLWrapper.cl(sc.objectFile(inputPath))
> val next = rdd.map(i => 2 * i).collect
> 
> SWAT primarily distinguishes itself in programmability: an explicit goal of 
> this project is to have as few user-visible API changes as possible from what 
> people have come to know and love in Spark. There are a number of 
> fixed-function GPU libraries out there now, so we wanted to look instead at 
> something that could be used to build new but still well-performing Spark 
> apps.
> 
> SWAT is currently more of a research project than a production-ready system, 
> so there’s a chance it won’t work out-of-the-box on some systems. With that 
> said, it does have fairly comprehensive functional and code generation 
> testing. If you’re interested in trying it out and having trouble setting up, 
> feel free to contact me directly. And of course, any questions or feedback 
> from the community are always welcome.
> 
> Thanks,
> 
> Max
> 
>> On Jan 22, 2016, at 3:42 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com 
>> <mailto:ishiz...@jp.ibm.com>> wrote:
>> 
>> Hi Alexander,
>> The goal of our columnar to effectively drive GPUs in Spark. One of 
>> important items is to effectively and easily enable highly-tuned libraries 
>> for GPU such as BIDMach.
>> 
>> We will enable BIDMach with our columnar storage. On the other hand, it is 
>> not easy task to scaling BIDMach with current Spark. I expect that this talk 
>> would help us.
>> http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565
>>  
>> <http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565>
>> 
>> We appreciate your great feedback.
>> 
>> Best Regards,
>> Kazuaki Ishizaki, Ph.D., Senior research staff member, IBM Research - Tokyo
>> 
>> 
>> 
>> From:"Ulanov, Alexander" <alexander.ula...@hpe.com 
>> <mailto:alexander.ula...@hpe.com>>
>> To:Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org 
>> <mailto:dev@spark.apache.org>" <dev@spark.apache.org 
>> <mailto:dev@spark.apache.org>>, Joseph Bradley <jos...@databricks.com 
>> <mailto:jos...@databricks.com>>
>> Cc:John Canny <ca...@berkeley.edu <mailto:ca...@berkeley.edu>>, 
>> "Evan R. Sparks" <evan.spa...@gmail.com <mailto:evan.spa...@gmail.com>>, 
>> Xiangrui Meng <men...@gmail.com <mailto:men...@gmail.com>>, Sam Halliday 
>> <sam.halli...@gmail.com <mailto:sam.halli...@gmail.com>>
>> Date:2016/01/22 04:20
>> Subject:RE: Using CUDA within Spark / boosting linear algebra
>> 
>> 
>> 
>> Hi Kazuaki,
>>  
>> Indeed, moving data to/from GPU is costly and this benchmark summarizes the 
>> costs for moving different data sizes with regards to matrices 
>> multiplication. These costs are paid for the convenience of using the 
>> standard BLAS API that Nvidia NVBLAS provides. The thing is that there are 
>> no code changes required (in Spark), one just needs to reference BLAS 
>> implementation with the syst

Re: Using CUDA within Spark / boosting linear algebra

2016-02-04 Thread Max Grossman

Hi all,

I’m jumping on this thread to point out another Spark+GPU project for people to 
take a look at: https://github.com/agrippa/spark-swat 
<https://github.com/agrippa/spark-swat>

SWAT (Spark with Accelerated Tasks) is a third-party JAR sitting on top of 
Spark that uses runtime code generation to convert user-written transformations 
into OpenCL kernels. SWAT’s lightweight runtime supports multi-GPU systems, 
managing each device and its memory automatically. You write your own Spark 
programs, and the runtime takes care of offloading your transformations to the 
GPUs in your system:

val rdd = CLWrapper.cl(sc.objectFile(inputPath))
val next = rdd.map(i => 2 * i).collect

SWAT primarily distinguishes itself in programmability: an explicit goal of 
this project is to have as few user-visible API changes as possible from what 
people have come to know and love in Spark. There are a number of 
fixed-function GPU libraries out there now, so we wanted to look instead at 
something that could be used to build new but still well-performing Spark apps.

SWAT is currently more of a research project than a production-ready system, so 
there’s a chance it won’t work out-of-the-box on some systems. With that said, 
it does have fairly comprehensive functional and code generation testing. If 
you’re interested in trying it out and having trouble setting up, feel free to 
contact me directly. And of course, any questions or feedback from the 
community are always welcome.

Thanks,

Max

> On Jan 22, 2016, at 3:42 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com> wrote:
> 
> Hi Alexander,
> The goal of our columnar to effectively drive GPUs in Spark. One of important 
> items is to effectively and easily enable highly-tuned libraries for GPU such 
> as BIDMach.
> 
> We will enable BIDMach with our columnar storage. On the other hand, it is 
> not easy task to scaling BIDMach with current Spark. I expect that this talk 
> would help us.
> http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565
>  
> <http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565>
> 
> We appreciate your great feedback.
> 
> Best Regards,
> Kazuaki Ishizaki, Ph.D., Senior research staff member, IBM Research - Tokyo
> 
> 
> 
> From:"Ulanov, Alexander" <alexander.ula...@hpe.com>
> To:Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org" 
> <dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>
> Cc:John Canny <ca...@berkeley.edu>, "Evan R. Sparks" 
> <evan.spa...@gmail.com>, Xiangrui Meng <men...@gmail.com>, Sam Halliday 
> <sam.halli...@gmail.com>
> Date:2016/01/22 04:20
> Subject:RE: Using CUDA within Spark / boosting linear algebra
> 
> 
> 
> Hi Kazuaki,
>  
> Indeed, moving data to/from GPU is costly and this benchmark summarizes the 
> costs for moving different data sizes with regards to matrices 
> multiplication. These costs are paid for the convenience of using the 
> standard BLAS API that Nvidia NVBLAS provides. The thing is that there are no 
> code changes required (in Spark), one just needs to reference BLAS 
> implementation with the system variable. Naturally, hardware-specific 
> implementation will always be faster than default. The benchmark results show 
> that fact by comparing jCuda (by means of BIDMat) and NVBLAS. However, it 
> also shows that it worth using NVBLAS for large matrices because it can take 
> advantage of several GPUs and it will be faster despite the copying overhead. 
> That is also a known thing advertised by Nvidia.
>  
> By the way, I don’t think that the column/row friendly format is an issue, 
> because one can use transposed matrices to fit the required format. I believe 
> that is just a software preference.
>  
> My suggestion with regards to your prototype would be to make comparisons 
> with Spark’s implementation of logistic regression (that does not take 
> advantage of GPU) and also with BIDMach’s (that takes advantage of GPUs). It 
> will give the users a better understanding of your’s implementation 
> performance. Currently you compare it with Spark’s example logistic 
> regression implementation that is supposed to be a reference for learning 
> Spark rather than benchmarking its performance.
>  
> Best regards, Alexander
>  
> From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com 
> <mailto:ishiz...@jp.ibm.com>] 
> Sent: Thursday, January 21, 2016 3:34 AM
> To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley
> Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>  
> Dear all,
> 
> >>>> Hi

RE: Using CUDA within Spark / boosting linear algebra

2016-01-22 Thread Kazuaki Ishizaki

Hi Allen,
Thank you for your feedback.
An API to launch GPU kernels with JCuda is the our first step. A purpose 
to release our prototype is to get feedback. In the future, we may use 
other wrappers instead of JCuda.

We are very appreciate it if you would suggest or propose APIs to 
effectively exploit GPUs such as BIDMat in Spark.
If we would run BIDMat with our columnar strorage, the performance boost 
would be good as others reported.

Best Regards,
Kazuaki Ishizaki,

From:   "Allen Zhang" <allenzhang...@126.com>
To: Kazuaki Ishizaki/Japan/IBM@IBMJP
Cc: "dev@spark.apache.org" <dev@spark.apache.org>, "Ulanov, Alexander" 
<alexander.ula...@hpe.com>, "Joseph Bradley" <jos...@databricks.com>, 
"John Canny" <ca...@berkeley.edu>, "Evan R. Sparks" 
<evan.spa...@gmail.com>, "Xiangrui Meng" <men...@gmail.com>, "Sam 
Halliday" <sam.halli...@gmail.com>
Date:   2016/01/21 21:05
Subject:RE: Using CUDA within Spark / boosting linear algebra

Hi Kazuaki,

Jcuda is actually a wrapper of the **pure** CUDA, as your wiki page shows 
that 3.15x performance boost of logistic regression seems slower than 
BIDMat-cublas or pure CUDA.
Could you elaborate on why you chose Jcuda other then JNI to call CUDA 
directly?

Regards,
Allen Zhang

At 2016-01-21 19:34:14, "Kazuaki Ishizaki" <ishiz...@jp.ibm.com> wrote:
Dear all,

>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as 
another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph

As Joseph pointed out before, there are two potential issues to 
efficiently exploit GPUs in Spark.
(1) the cost of data movement between CPU and GPU
(2) the cost of encoding/decoding between current row-format and 
GPU-friendly column format

Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues 
by supporting data partition caching in GPU device memory and by providing 
binary column storage for data partition. We really appreciate it if you 
would give us comments, suggestions, or feedback.

Best Regards
Kazuaki Ishizaki

From:"Ulanov, Alexander" <alexander.ula...@hpe.com>
To:    Sam Halliday <sam.halli...@gmail.com>, John Canny <
ca...@berkeley.edu>
Cc:Xiangrui Meng <men...@gmail.com>, "dev@spark.apache.org" <
dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>, "Evan R. 
Sparks" <evan.spa...@gmail.com>
Date:2016/01/21 11:07
Subject:RE: Using CUDA within Spark / boosting linear algebra

Hi Everyone,

I’ve updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.

This time I computed average and median of 10 runs for each of experiment 
and approximated FLOPS.

Results are available at google docs (old experiments are in the other 2 
sheets):
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Benchmark code:
https://github.com/avulanov/scala-blas

Best regards, Alexander

From: Sam Halliday [mailto:sam.halli...@gmail.com] 
Sent: Thursday, March 26, 2015 9:27 AM
To: John Canny
Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; 
Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra

John, I have to disagree with you there. Dense matrices come up a lot in 
industry,  although your personal experience may be different. 
On 26 Mar 2015 16:20, "John Canny" <ca...@berkeley.edu> wrote:
I mentioned this earlier in the thread, but I'll put it out again. Dense 
BLAS are not very important for most machine learning workloads: at least 
for non-image workloads in industry (and for image processing you would 
probably want a deep learning/SGD solution with convolution kernels). e.g. 
it was only relevant for 1/7 of our recent benchmarks, which should be a 
reasonable sample. What really matters is sparse BLAS performance. BIDMat 
is still an order of magnitude faster there. Those kernels are only in 
BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data. 

Its also the case that the overa

RE: Using CUDA within Spark / boosting linear algebra

2016-01-22 Thread Kazuaki Ishizaki

Hi Alexander,
The goal of our columnar to effectively drive GPUs in Spark. One of 
important items is to effectively and easily enable highly-tuned libraries 
for GPU such as BIDMach.

We will enable BIDMach with our columnar storage. On the other hand, it is 
not easy task to scaling BIDMach with current Spark. I expect that this 
talk would help us.
http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565

We appreciate your great feedback.

Best Regards,
Kazuaki Ishizaki, Ph.D., Senior research staff member, IBM Research - 
Tokyo



From:   "Ulanov, Alexander" <alexander.ula...@hpe.com>
To: Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org" 
<dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>
Cc: John Canny <ca...@berkeley.edu>, "Evan R. Sparks" 
<evan.spa...@gmail.com>, Xiangrui Meng <men...@gmail.com>, Sam Halliday 
<sam.halli...@gmail.com>
Date:   2016/01/22 04:20
Subject:RE: Using CUDA within Spark / boosting linear algebra



Hi Kazuaki,
 
Indeed, moving data to/from GPU is costly and this benchmark summarizes 
the costs for moving different data sizes with regards to matrices 
multiplication. These costs are paid for the convenience of using the 
standard BLAS API that Nvidia NVBLAS provides. The thing is that there are 
no code changes required (in Spark), one just needs to reference BLAS 
implementation with the system variable. Naturally, hardware-specific 
implementation will always be faster than default. The benchmark results 
show that fact by comparing jCuda (by means of BIDMat) and NVBLAS. 
However, it also shows that it worth using NVBLAS for large matrices 
because it can take advantage of several GPUs and it will be faster 
despite the copying overhead. That is also a known thing advertised by 
Nvidia.
 
By the way, I don’t think that the column/row friendly format is an 
issue, because one can use transposed matrices to fit the required format. 
I believe that is just a software preference.
 
My suggestion with regards to your prototype would be to make comparisons 
with Spark’s implementation of logistic regression (that does not take 
advantage of GPU) and also with BIDMach’s (that takes advantage of GPUs). 
It will give the users a better understanding of your’s implementation 
performance. Currently you compare it with Spark’s example logistic 
regression implementation that is supposed to be a reference for learning 
Spark rather than benchmarking its performance.
 
Best regards, Alexander
 
From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com] 
Sent: Thursday, January 21, 2016 3:34 AM
To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley
Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday
Subject: RE: Using CUDA within Spark / boosting linear algebra
 
Dear all,

>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as 
another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph

As Joseph pointed out before, there are two potential issues to 
efficiently exploit GPUs in Spark.
(1) the cost of data movement between CPU and GPU
(2) the cost of encoding/decoding between current row-format and 
GPU-friendly column format

Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues 
by supporting data partition caching in GPU device memory and by providing 
binary column storage for data partition. We really appreciate it if you 
would give us comments, suggestions, or feedback.

Best Regards
Kazuaki Ishizaki



From:"Ulanov, Alexander" <alexander.ula...@hpe.com>
To:Sam Halliday <sam.halli...@gmail.com>, John Canny <
ca...@berkeley.edu>
Cc:    Xiangrui Meng <men...@gmail.com>, "dev@spark.apache.org" <
dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>, "Evan R. 
Sparks" <evan.spa...@gmail.com>
Date:2016/01/21 11:07
Subject:RE: Using CUDA within Spark / boosting linear algebra




Hi Everyone,
 
I’ve updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.
 
This time I computed average and median of 10 runs for each of experiment 
and

RE: Using CUDA within Spark / boosting linear algebra

2016-01-21 Thread Allen Zhang

Hi Kazuaki,

Jcuda is actually a wrapper of the **pure** CUDA, as your wiki page shows that 
3.15x performance boost of logistic regression seems slower than BIDMat-cublas 
or pure CUDA.
Could you elaborate on why you chose Jcuda other then JNI to call CUDA directly?

Regards,
Allen Zhang

At 2016-01-21 19:34:14, "Kazuaki Ishizaki" <ishiz...@jp.ibm.com> wrote:
Dear all,

>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as another 
>>>> part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph

As Joseph pointed out before, there are two potential issues to efficiently 
exploit GPUs in Spark.
(1) the cost of data movement between CPU and GPU
(2) the cost of encoding/decoding between current row-format and GPU-friendly 
column format

Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues by 
supporting data partition caching in GPU device memory and by providing binary 
column storage for data partition. We really appreciate it if you would give us 
comments, suggestions, or feedback.

Best Regards
Kazuaki Ishizaki

From:"Ulanov, Alexander" <alexander.ula...@hpe.com>
To:Sam Halliday <sam.halli...@gmail.com>, John Canny 
<ca...@berkeley.edu>
Cc:Xiangrui Meng <men...@gmail.com>, "dev@spark.apache.org" 
<dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>, "Evan R. 
Sparks" <evan.spa...@gmail.com>
Date:2016/01/21 11:07
Subject:RE: Using CUDA within Spark / boosting linear algebra

Hi Everyone,

I’ve updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.

This time I computed average and median of 10 runs for each of experiment and 
approximated FLOPS.

Results are available at google docs (old experiments are in the other 2 
sheets):
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
Benchmark code:
https://github.com/avulanov/scala-blas

Best regards, Alexander

From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Thursday, March 26, 2015 9:27 AM
To: John Canny
Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; 
Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra

John, I have to disagree with you there. Dense matrices come up a lot in 
industry,  although your personal experience may be different.
On 26 Mar 2015 16:20, "John Canny" <ca...@berkeley.edu> wrote:
I mentioned this earlier in the thread, but I'll put it out again. Dense BLAS 
are not very important for most machine learning workloads: at least for 
non-image workloads in industry (and for image processing you would probably 
want a deep learning/SGD solution with convolution kernels). e.g. it was only 
relevant for 1/7 of our recent benchmarks, which should be a reasonable sample. 
What really matters is sparse BLAS performance. BIDMat is still an order of 
magnitude faster there. Those kernels are only in BIDMat, since NVIDIAs sparse 
BLAS dont perform well on power-law data.

Its also the case that the overall performance of an algorithm is determined by 
the slowest kernel, not the fastest. If the goal is to get closer to BIDMach's 
performance on typical problems, you need to make sure that every kernel goes 
at comparable speed. So the real question is how much faster MLLib routines do 
on a complete problem with/without GPU acceleration. For BIDMach, its close to 
a factor of 10. But that required running entirely on the GPU, and making sure 
every kernel is close to its limit.

-John

If you think nvblas would be helpful, you should try it in some end-to-end 
benchmarks.
On 3/25/15, 6:23 PM, Evan R. Sparks wrote:
Yeah, much more reasonable - nice to know that we can get full GPU performance 
from breeze/netlib-java - meaning there's no compelling performance reason to 
switch out our current linear algebra library (at least as far as this 
benchmark is concerned).

Instead, it looks like a user guide for configuring Spark/MLlib to use the 
right BLAS library will get us most of the way there. Or, would it make sense 
to finally ship openblas compiled for so

RE: Using CUDA within Spark / boosting linear algebra

2016-01-21 Thread Kazuaki Ishizaki

Dear all,

>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as 
another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph

As Joseph pointed out before, there are two potential issues to 
efficiently exploit GPUs in Spark.
(1) the cost of data movement between CPU and GPU
(2) the cost of encoding/decoding between current row-format and 
GPU-friendly column format

Our prototype http://kiszk.github.io/spark-gpu/ addresses these two issues 
by supporting data partition caching in GPU device memory and by providing 
binary column storage for data partition. We really appreciate it if you 
would give us comments, suggestions, or feedback.

Best Regards
Kazuaki Ishizaki

From:   "Ulanov, Alexander" <alexander.ula...@hpe.com>
To: Sam Halliday <sam.halli...@gmail.com>, John Canny 
<ca...@berkeley.edu>
Cc: Xiangrui Meng <men...@gmail.com>, "dev@spark.apache.org" 
<dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>, "Evan R. 
Sparks" <evan.spa...@gmail.com>
Date:   2016/01/21 11:07
Subject:RE: Using CUDA within Spark / boosting linear algebra

Hi Everyone,

I’ve updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.

This time I computed average and median of 10 runs for each of experiment 
and approximated FLOPS.

Results are available at google docs (old experiments are in the other 2 
sheets):
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
Benchmark code:
https://github.com/avulanov/scala-blas

Best regards, Alexander

From: Sam Halliday [mailto:sam.halli...@gmail.com] 
Sent: Thursday, March 26, 2015 9:27 AM
To: John Canny
Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; 
Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra

John, I have to disagree with you there. Dense matrices come up a lot in 
industry,  although your personal experience may be different. 
On 26 Mar 2015 16:20, "John Canny" <ca...@berkeley.edu> wrote:
I mentioned this earlier in the thread, but I'll put it out again. Dense 
BLAS are not very important for most machine learning workloads: at least 
for non-image workloads in industry (and for image processing you would 
probably want a deep learning/SGD solution with convolution kernels). e.g. 
it was only relevant for 1/7 of our recent benchmarks, which should be a 
reasonable sample. What really matters is sparse BLAS performance. BIDMat 
is still an order of magnitude faster there. Those kernels are only in 
BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data. 

Its also the case that the overall performance of an algorithm is 
determined by the slowest kernel, not the fastest. If the goal is to get 
closer to BIDMach's performance on typical problems, you need to make sure 
that every kernel goes at comparable speed. So the real question is how 
much faster MLLib routines do on a complete problem with/without GPU 
acceleration. For BIDMach, its close to a factor of 10. But that required 
running entirely on the GPU, and making sure every kernel is close to its 
limit.

-John

If you think nvblas would be helpful, you should try it in some end-to-end 
benchmarks. 
On 3/25/15, 6:23 PM, Evan R. Sparks wrote:
Yeah, much more reasonable - nice to know that we can get full GPU 
performance from breeze/netlib-java - meaning there's no compelling 
performance reason to switch out our current linear algebra library (at 
least as far as this benchmark is concerned). 

Instead, it looks like a user guide for configuring Spark/MLlib to use the 
right BLAS library will get us most of the way there. Or, would it make 
sense to finally ship openblas compiled for some common platforms (64-bit 
linux, windows, mac) directly with Spark - hopefully eliminating the jblas 
warnings once and for all for most users? (Licensing is BSD) Or am I 
missing something?

On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <
alexander.ula...@hp.com> wrote:
As everyone suggested, the results were too good to be true, so I 
double-checked them. It turns that nvblas did not do multip

RE: Using CUDA within Spark / boosting linear algebra

2016-01-21 Thread Ulanov, Alexander

Hi Kazuaki,

Indeed, moving data to/from GPU is costly and this benchmark summarizes the 
costs for moving different data sizes with regards to matrices multiplication. 
These costs are paid for the convenience of using the standard BLAS API that 
Nvidia NVBLAS provides. The thing is that there are no code changes required 
(in Spark), one just needs to reference BLAS implementation with the system 
variable. Naturally, hardware-specific implementation will always be faster 
than default. The benchmark results show that fact by comparing jCuda (by means 
of BIDMat) and NVBLAS. However, it also shows that it worth using NVBLAS for 
large matrices because it can take advantage of several GPUs and it will be 
faster despite the copying overhead. That is also a known thing advertised by 
Nvidia.

By the way, I don't think that the column/row friendly format is an issue, 
because one can use transposed matrices to fit the required format. I believe 
that is just a software preference.

My suggestion with regards to your prototype would be to make comparisons with 
Spark's implementation of logistic regression (that does not take advantage of 
GPU) and also with BIDMach's (that takes advantage of GPUs). It will give the 
users a better understanding of your's implementation performance. Currently 
you compare it with Spark's example logistic regression implementation that is 
supposed to be a reference for learning Spark rather than benchmarking its 
performance.

Best regards, Alexander

From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com]
Sent: Thursday, January 21, 2016 3:34 AM
To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley
Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday
Subject: RE: Using CUDA within Spark / boosting linear algebra

Dear all,

>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as another 
>>>> part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph

As Joseph pointed out before, there are two potential issues to efficiently 
exploit GPUs in Spark.
(1) the cost of data movement between CPU and GPU
(2) the cost of encoding/decoding between current row-format and GPU-friendly 
column format

Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues by 
supporting data partition caching in GPU device memory and by providing binary 
column storage for data partition. We really appreciate it if you would give us 
comments, suggestions, or feedback.

Best Regards
Kazuaki Ishizaki



From:"Ulanov, Alexander" 
<alexander.ula...@hpe.com<mailto:alexander.ula...@hpe.com>>
To:Sam Halliday 
<sam.halli...@gmail.com<mailto:sam.halli...@gmail.com>>, John Canny 
<ca...@berkeley.edu<mailto:ca...@berkeley.edu>>
Cc:Xiangrui Meng <men...@gmail.com<mailto:men...@gmail.com>>, 
"dev@spark.apache.org<mailto:dev@spark.apache.org>" 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>, Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>>, "Evan R. Sparks" 
<evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>
Date:2016/01/21 11:07
Subject:RE: Using CUDA within Spark / boosting linear algebra




Hi Everyone,

I've updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.

This time I computed average and median of 10 runs for each of experiment and 
approximated FLOPS.

Results are available at google docs (old experiments are in the other 2 
sheets):
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
Benchmark code:
https://github.com/avulanov/scala-blas

Best regards, Alexander


From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Thursday, March 26, 2015 9:27 AM
To: John Canny
Cc: Xiangrui Meng; dev@spark.apache.org<mailto:dev@spark.apache.org>; Joseph 
Bradley; Evan R. Sparks; Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra


John, I have to disagree with you there. Dense matrices come up a lot in 
industry,  although your personal experience may be different.
On 26 Mar 2015 16:20, "J

RE: Using CUDA within Spark / boosting linear algebra

2016-01-20 Thread Ulanov, Alexander

Hi Everyone,

I’ve updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.

This time I computed average and median of 10 runs for each of experiment and 
approximated FLOPS.

Results are available at google docs (old experiments are in the other 2 
sheets):
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
Benchmark code:
https://github.com/avulanov/scala-blas

Best regards, Alexander


From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Thursday, March 26, 2015 9:27 AM
To: John Canny
Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; 
Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra


John, I have to disagree with you there. Dense matrices come up a lot in 
industry,  although your personal experience may be different.
On 26 Mar 2015 16:20, "John Canny" 
<ca...@berkeley.edu<mailto:ca...@berkeley.edu>> wrote:
I mentioned this earlier in the thread, but I'll put it out again. Dense BLAS 
are not very important for most machine learning workloads: at least for 
non-image workloads in industry (and for image processing you would probably 
want a deep learning/SGD solution with convolution kernels). e.g. it was only 
relevant for 1/7 of our recent benchmarks, which should be a reasonable sample. 
What really matters is sparse BLAS performance. BIDMat is still an order of 
magnitude faster there. Those kernels are only in BIDMat, since NVIDIAs sparse 
BLAS dont perform well on power-law data.

Its also the case that the overall performance of an algorithm is determined by 
the slowest kernel, not the fastest. If the goal is to get closer to BIDMach's 
performance on typical problems, you need to make sure that every kernel goes 
at comparable speed. So the real question is how much faster MLLib routines do 
on a complete problem with/without GPU acceleration. For BIDMach, its close to 
a factor of 10. But that required running entirely on the GPU, and making sure 
every kernel is close to its limit.

-John

If you think nvblas would be helpful, you should try it in some end-to-end 
benchmarks.
On 3/25/15, 6:23 PM, Evan R. Sparks wrote:
Yeah, much more reasonable - nice to know that we can get full GPU performance 
from breeze/netlib-java - meaning there's no compelling performance reason to 
switch out our current linear algebra library (at least as far as this 
benchmark is concerned).

Instead, it looks like a user guide for configuring Spark/MLlib to use the 
right BLAS library will get us most of the way there. Or, would it make sense 
to finally ship openblas compiled for some common platforms (64-bit linux, 
windows, mac) directly with Spark - hopefully eliminating the jblas warnings 
once and for all for most users? (Licensing is BSD) Or am I missing something?

On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
As everyone suggested, the results were too good to be true, so I 
double-checked them. It turns that nvblas did not do multiplication due to 
parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My 
previously posted results with nvblas are matrices copying only. The default 
NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked 
other values that worked. As a result, netlib+nvblas is on par with 
BIDMat-cuda. As promised, I am going to post a how-to for nvblas configuration.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing



-Original Message-
From: Ulanov, Alexander
Sent: Wednesday, March 25, 2015 2:31 PM
To: Sam Halliday
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>; Xiangrui Meng; Joseph 
Bradley; Evan R. Sparks; jfcanny
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has exceptional 
performance for big matrices with Double, faster than BIDMat-cuda with Float. 
But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL 
might be a better choice. This correlates with original nvblas presentation on 
GPU conf 2013 (slide 21): 
http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of performance of 
different libraries. I just want to pick a library that does at best dense 
matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran blas 
functions, at the same time netlib-java uses C cblas functions. So, one needs 
cblas shared library to use nvblas through ne

Re: Using CUDA within Spark / boosting linear algebra

2015-03-30 Thread Xiangrui Meng

Hi Alex,

Since it is non-trivial to make nvblas work with netlib-java, it would
be great if you can send the instructions to netlib-java as part of
the README. Hopefully we don't need to modify netlib-java code to use
nvblas.

Best,
Xiangrui

On Thu, Mar 26, 2015 at 9:54 AM, Sean Owen so...@cloudera.com wrote:
 The license issue is with libgfortran, rather than OpenBLAS.

 (FWIW I am going through the motions to get OpenBLAS set up by default
 on CDH in the near future, and the hard part is just handling
 libgfortran.)

 On Thu, Mar 26, 2015 at 4:07 PM, Evan R. Sparks evan.spa...@gmail.com wrote:
 Alright Sam - you are the expert here. If the GPL issues are unavoidable,
 that's fine - what is the exact bit of code that is GPL?

 The suggestion to use OpenBLAS is not to say it's the best option, but that
 it's a *free, reasonable default* for many users - keep in mind the most
 common deployment for Spark/MLlib is on 64-bit linux on EC2[1].
 Additionally, for many of the problems we're targeting, this reasonable
 default can provide a 1-2 orders of magnitude improvement in performance
 over the f2jblas implementation that netlib-java falls back on.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

RE: Using CUDA within Spark / boosting linear algebra

2015-03-30 Thread Ulanov, Alexander

Hi Sam, 

What is the best way to do it? Should I clone netlib-java, edit readme.md and 
make a PR?

Best regards, Alexander


-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Monday, March 30, 2015 2:43 PM
To: Sean Owen
Cc: Evan R. Sparks; Sam Halliday; dev@spark.apache.org; Ulanov, Alexander; 
jfcanny
Subject: Re: Using CUDA within Spark / boosting linear algebra

Hi Alex,

Since it is non-trivial to make nvblas work with netlib-java, it would be great 
if you can send the instructions to netlib-java as part of the README. 
Hopefully we don't need to modify netlib-java code to use nvblas.

Best,
Xiangrui

On Thu, Mar 26, 2015 at 9:54 AM, Sean Owen so...@cloudera.com wrote:
 The license issue is with libgfortran, rather than OpenBLAS.

 (FWIW I am going through the motions to get OpenBLAS set up by default 
 on CDH in the near future, and the hard part is just handling
 libgfortran.)

 On Thu, Mar 26, 2015 at 4:07 PM, Evan R. Sparks evan.spa...@gmail.com wrote:
 Alright Sam - you are the expert here. If the GPL issues are 
 unavoidable, that's fine - what is the exact bit of code that is GPL?

 The suggestion to use OpenBLAS is not to say it's the best option, 
 but that it's a *free, reasonable default* for many users - keep in 
 mind the most common deployment for Spark/MLlib is on 64-bit linux on EC2[1].
 Additionally, for many of the problems we're targeting, this 
 reasonable default can provide a 1-2 orders of magnitude improvement 
 in performance over the f2jblas implementation that netlib-java falls back 
 on.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For 
 additional commands, e-mail: dev-h...@spark.apache.org

Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread John Canny

I mentioned this earlier in the thread, but I'll put it out again. Dense 
BLAS are not very important for most machine learning workloads: at 
least for non-image workloads in industry (and for image processing you 
would probably want a deep learning/SGD solution with convolution 
kernels). e.g. it was only relevant for 1/7 of our recent benchmarks, 
which should be a reasonable sample. What really matters is sparse BLAS 
performance. BIDMat is still an order of magnitude faster there. Those 
kernels are only in BIDMat, since NVIDIAs sparse BLAS dont perform well 
on power-law data.


Its also the case that the overall performance of an algorithm is 
determined by the slowest kernel, not the fastest. If the goal is to get 
closer to BIDMach's performance on typical problems, you need to make 
sure that every kernel goes at comparable speed. So the real question is 
how much faster MLLib routines do on a complete problem with/without GPU 
acceleration. For BIDMach, its close to a factor of 10. But that 
required running entirely on the GPU, and making sure every kernel is 
close to its limit.


-John

If you think nvblas would be helpful, you should try it in some 
end-to-end benchmarks.

On 3/25/15, 6:23 PM, Evan R. Sparks wrote:
Yeah, much more reasonable - nice to know that we can get full GPU 
performance from breeze/netlib-java - meaning there's no compelling 
performance reason to switch out our current linear algebra library 
(at least as far as this benchmark is concerned).


Instead, it looks like a user guide for configuring Spark/MLlib to use 
the right BLAS library will get us most of the way there. Or, would it 
make sense to finally ship openblas compiled for some common platforms 
(64-bit linux, windows, mac) directly with Spark - hopefully 
eliminating the jblas warnings once and for all for most users? 
(Licensing is BSD) Or am I missing something?


On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander 
alexander.ula...@hp.com mailto:alexander.ula...@hp.com wrote:


As everyone suggested, the results were too good to be true, so I
double-checked them. It turns that nvblas did not do
multiplication due to parameter NVBLAS_TILE_DIM from nvblas.conf
and returned zero matrix. My previously posted results with nvblas
are matrices copying only. The default NVBLAS_TILE_DIM==2048 is
too big for my graphic card/matrix size. I handpicked other values
that worked. As a result, netlib+nvblas is on par with
BIDMat-cuda. As promised, I am going to post a how-to for nvblas
configuration.


https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing



-Original Message-
From: Ulanov, Alexander
Sent: Wednesday, March 25, 2015 2:31 PM
To: Sam Halliday
Cc: dev@spark.apache.org mailto:dev@spark.apache.org; Xiangrui
Meng; Joseph Bradley; Evan R. Sparks; jfcanny
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has
exceptional performance for big matrices with Double, faster than
BIDMat-cuda with Float. But for smaller matrices, if you will copy
them to/from GPU, OpenBlas or MKL might be a better choice. This
correlates with original nvblas presentation on GPU conf 2013
(slide 21):

http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

My results:

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of
performance of different libraries. I just want to pick a library
that does at best dense matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has
Fortran blas functions, at the same time netlib-java uses C cblas
functions. So, one needs cblas shared library to use nvblas
through netlib-java. Fedora does not have cblas (but Debian and
Ubuntu have), so I needed to compile it. I could not use cblas
from Atlas or Openblas because they link to their implementation
and not to Fortran blas.

Best regards, Alexander

-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.org mailto:dev@spark.apache.org; Xiangrui
Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas
functions should replace current blas functions calls after
executing LD_PRELOAD as suggested in
http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to
netlib-java. It seems to work for simple Java example, but I
cannot make it work with Spark. I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64

Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Sam Halliday

I'm not at all surprised ;-) I fully expect the GPU performance to get
better automatically as the hardware improves.

Netlib natives still need to be shipped separately. I'd also oppose any
move to make Open BLAS the default - is not always better and I think
natives really need DevOps buy-in. It's not the right solution for
everybody.
On 26 Mar 2015 01:23, Evan R. Sparks evan.spa...@gmail.com wrote:

Yeah, much more reasonable - nice to know that we can get full GPU
performance from breeze/netlib-java - meaning there's no compelling
performance reason to switch out our current linear algebra library (at
least as far as this benchmark is concerned).

Instead, it looks like a user guide for configuring Spark/MLlib to use the
right BLAS library will get us most of the way there. Or, would it make
sense to finally ship openblas compiled for some common platforms (64-bit
linux, windows, mac) directly with Spark - hopefully eliminating the jblas
warnings once and for all for most users? (Licensing is BSD) Or am I
missing something?

On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:

As everyone suggested, the results were too good to be true, so I
double-checked them. It turns that nvblas did not do multiplication due to
parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My
previously posted results with nvblas are matrices copying only. The
default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
handpicked other values that worked. As a result, netlib+nvblas is on par
with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
configuration.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

-Original Message-
From: Ulanov, Alexander
Sent: Wednesday, March 25, 2015 2:31 PM
To: Sam Halliday
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks;
jfcanny
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has
exceptional performance for big matrices with Double, faster than
BIDMat-cuda with Float. But for smaller matrices, if you will copy them
to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
original nvblas presentation on GPU conf 2013 (slide 21):
http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

My results:

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of performance of
different libraries. I just want to pick a library that does at best dense
matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran blas
functions, at the same time netlib-java uses C cblas functions. So, one
needs cblas shared library to use nvblas through netlib-java. Fedora does
not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
could not use cblas from Atlas or Openblas because they link to their
implementation and not to Fortran blas.

Best regards, Alexander

-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas functions
should replace current blas functions calls after executing LD_PRELOAD as
suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
changes to netlib-java. It seems to work for simple Java example, but I
cannot make it work with Spark. I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
--driver-memory 4G In nvidia-smi I observe that Java is to use GPU:

+-+
| Processes: GPU
Memory |
| GPU PID Type Process name Usage
|

+-+

In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded
/tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly used.
However, matrix multiplication does executes on CPU since I see 16% of CPU
used and 0% of GPU used. I also checked different matrix sizes, from
100x100 to 12000x12000

Could you

Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Sam Halliday

Btw, OpenBLAS requires GPL runtime binaries which are typically considered
system libraries (and these fall under something similar to the Java
classpath exception rule)... so it's basically impossible to distribute
OpenBLAS the way you're suggesting, sorry. Indeed, there is work ongoing in
Spark right now to clear up something of this nature.

On a more technical level, I'd recommend watching my talk at ScalaX which
explains in detail why high performance only comes from machine optimised
binaries, which requires DevOps buy-in (and, I'd recommend using MKL anyway
on the CPU, not OpenBLAS).

On an even deeper level, using natives has consequences to JIT and GC which
isn't suitable for everybody and we'd really like people to go into that
with their eyes wide open.
On 26 Mar 2015 07:43, Sam Halliday sam.halli...@gmail.com wrote:

I'm not at all surprised ;-) I fully expect the GPU performance to get
better automatically as the hardware improves.

Instead, it looks like a user guide for configuring Spark/MLlib to use
the right BLAS library will get us most of the way there. Or, would it make
sense to finally ship openblas compiled for some common platforms (64-bit
linux, windows, mac) directly with Spark - hopefully eliminating the jblas
warnings once and for all for most users? (Licensing is BSD) Or am I
missing something?

On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

-Original Message-
From: Ulanov, Alexander
Sent: Wednesday, March 25, 2015 2:31 PM
To: Sam Halliday
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R.
Sparks; jfcanny
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi again,

My results:

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of performance of
different libraries. I just want to pick a library that does at best dense
matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran
blas functions, at the same time netlib-java uses C cblas functions. So,
one needs cblas shared library to use nvblas through netlib-java. Fedora
does not have cblas (but Debian and Ubuntu have), so I needed to compile
it. I could not use cblas from Atlas or Openblas because they link to their
implementation and not to Fortran blas.

Best regards, Alexander

Hi,

+-+
| Processes

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander

Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has exceptional 
performance for big matrices with Double, faster than BIDMat-cuda with Float. 
But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL 
might be a better choice. This correlates with original nvblas presentation on 
GPU conf 2013 (slide 21): 
http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
 
My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
 

Just in case, these tests are not for generalization of performance of 
different libraries. I just want to pick a library that does at best dense 
matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran blas 
functions, at the same time netlib-java uses C cblas functions. So, one needs 
cblas shared library to use nvblas through netlib-java. Fedora does not have 
cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use 
cblas from Atlas or Openblas because they link to their implementation and not 
to Fortran blas.

Best regards, Alexander

-Original Message-
From: Ulanov, Alexander 
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas functions should 
replace current blas functions calls after executing LD_PRELOAD as suggested in 
http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. 
It seems to work for simple Java example, but I cannot make it work with Spark. 
I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell 
--driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
+-+
| Processes:   GPU Memory |
|  GPU   PID  Type  Process name   Usage  |
|=|
|0  8873C   bash39MiB |
|0  8910C   /usr/lib/jvm/java-1.7.0/bin/java39MiB |
+-+

In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded 
/tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly used. However, 
matrix multiplication does executes on CPU since I see 16% of CPU used and 0% 
of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander



From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on 
various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the 
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the 
support of Double in the current source code), did the test with BIDMat and CPU 
Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday 
[mailto:sam.halli...@gmail.commailto:sam.halli...@gmail.com]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; 
dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng men...@gmail.commailto:men...@gmail.com writes:

 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x 
 slower than netlib-openblas. What is the overhead of using a GPU BLAS 
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
 jos...@databricks.commailto:jos...@databricks.com wrote:
 Better documentation for linking

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Sam Halliday

That would be a difficult task that would only benefit users of
netlib-java. MultiBLAS is easily implemented (although a lot of
boilerplate) and benefits all BLAS users on the system.

If anyone knows of a funding route for it, I'd love to hear from them,
because it's too much work for me to take on at the moment as hobby.
On 25 Mar 2015 22:16, Dmitriy Lyubimov dlie...@gmail.com wrote:

Sam,

whould it be easier to hack netlib-java to allow multiple (configurable)
library contexts? And so enable 3rd party configurations and optimizers to
make their own choices until then?

On Wed, Mar 25, 2015 at 3:07 PM, Sam Halliday sam.halli...@gmail.com
wrote:

Yeah, MultiBLAS... it is dynamic.

Except, I haven't written it yet :-P
On 25 Mar 2015 22:06, Ulanov, Alexander alexander.ula...@hp.com
wrote:

Netlib knows nothing about GPU (or CPU), it just uses cblas symbols
from the provided libblas.so.3 library at the runtime. So, you can switch
at the runtime by providing another library. Sam, please suggest if there
is another way.

*From:* Dmitriy Lyubimov [mailto:dlie...@gmail.com]
*Sent:* Wednesday, March 25, 2015 2:55 PM
*To:* Ulanov, Alexander
*Cc:* Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph
Bradley; Evan R. Sparks; jfcanny
*Subject:* Re: Using CUDA within Spark / boosting linear algebra

Alexander,

does using netlib imply that one cannot switch between CPU and GPU blas
alternatives at will at the same time? the choice is always determined by
linking aliternatives to libblas.so, right?

On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:

Hi again,

My results:

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of performance of
different libraries. I just want to pick a library that does at best dense
matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran
blas functions, at the same time netlib-java uses C cblas functions. So,
one needs cblas shared library to use nvblas through netlib-java. Fedora
does not have cblas (but Debian and Ubuntu have), so I needed to compile
it. I could not use cblas from Atlas or Openblas because they link to their
implementation and not to Fortran blas.

Best regards, Alexander

-Original Message-
From: Ulanov, Alexander

Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

+-+
| Processes: GPU
Memory |
| GPU PID Type Process name
Usage |

+-+

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander

From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Sam Halliday

If you write it up I'll add it to the netlib-java wiki :-)

BTW, does it automatically flip between cpu/GPU? I've a project called
MultiBLAS which was going to do this, it should be easy (but boring to
write)
On 25 Mar 2015 22:00, Evan R. Sparks evan.spa...@gmail.com wrote:

Alex - great stuff, and the nvblas numbers are pretty remarkable (almost
too good... did you check the results for correctness? - also, is it
possible that the unified memory model of nvblas is somehow hiding pci
transfer time?)

this last bit (getting nvblas + netlib-java to play together) sounds like
it's non-trivial and took you a while to figure out! Would you mind posting
a gist or something of maybe the shell scripts/exports you used to make
this work - I can imagine it being highly useful for others in the future.

Thanks!
Evan

On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:

Hi again,

My results:

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of performance of
different libraries. I just want to pick a library that does at best dense
matrices multiplication for my task.

Best regards, Alexander

Hi,

+-+
| Processes: GPU
Memory |
| GPU PID Type Process name Usage
|

+-+

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander

Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance
on various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto:
alexander.ula...@hp.com wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
support of Double in the current source code), did the test with BIDMat and
CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday [mailto:sam.halli...@gmail.commailto:
sam.halli...@gmail.com

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander

Sure, I will write a how-to after I re-check the results.

-Original Message-
From: Sam Halliday [mailto:sam.halli...@gmail.com] 
Sent: Wednesday, March 25, 2015 3:04 PM
To: Evan R. Sparks; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

If you write it up I'll add it to the netlib-java wiki :-)

BTW, does it automatically flip between cpu/GPU? I've a project called 
MultiBLAS which was going to do this, it should be easy (but boring to
write)
On 25 Mar 2015 22:00, Evan R. Sparks evan.spa...@gmail.com wrote:

 Alex - great stuff, and the nvblas numbers are pretty remarkable 
 (almost too good... did you check the results for correctness? - also, 
 is it possible that the unified memory model of nvblas is somehow 
 hiding pci transfer time?)

 this last bit (getting nvblas + netlib-java to play together) sounds 
 like it's non-trivial and took you a while to figure out! Would you 
 mind posting a gist or something of maybe the shell scripts/exports 
 you used to make this work - I can imagine it being highly useful for others 
 in the future.

 Thanks!
 Evan

 On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander  
 alexander.ula...@hp.com wrote:

 Hi again,

 I finally managed to use nvblas within Spark+netlib-java. It has 
 exceptional performance for big matrices with Double, faster than 
 BIDMat-cuda with Float. But for smaller matrices, if you will copy 
 them to/from GPU, OpenBlas or MKL might be a better choice. This 
 correlates with original nvblas presentation on GPU conf 2013 (slide 21):
 http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3
 108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

 My results:

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx37
 8T9J5r7kwKSPkY/edit?usp=sharing

 Just in case, these tests are not for generalization of performance 
 of different libraries. I just want to pick a library that does at 
 best dense matrices multiplication for my task.

 P.S. My previous issue with nvblas was the following: it has Fortran 
 blas functions, at the same time netlib-java uses C cblas functions. 
 So, one needs cblas shared library to use nvblas through netlib-java. 
 Fedora does not have cblas (but Debian and Ubuntu have), so I needed 
 to compile it. I could not use cblas from Atlas or Openblas because 
 they link to their implementation and not to Fortran blas.

 Best regards, Alexander

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, March 24, 2015 6:57 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. 
 Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi,

 I am trying to use nvblas with netlib-java from Spark. nvblas 
 functions should replace current blas functions calls after executing 
 LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage 
 without any changes to netlib-java. It seems to work for simple Java 
 example, but I cannot make it work with Spark. I run the following:
 export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell 
 --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:

 +-+
 | Processes:   GPU
 Memory |
 |  GPU   PID  Type  Process name   Usage
 |

 |=|
 |0  8873C   bash
 39MiB |
 |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
 39MiB |

 +-+

 In Spark shell I do matrix multiplication and see the following:
 15/03/25 06:48:01 INFO JniLoader: successfully loaded 
 /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
 So I am sure that netlib-native is loaded and cblas supposedly used.
 However, matrix multiplication does executes on CPU since I see 16% 
 of CPU used and 0% of GPU used. I also checked different matrix 
 sizes, from
 100x100 to 12000x12000

 Could you suggest might the LD_PRELOAD not affect Spark shell?

 Best regards, Alexander

 From: Sam Halliday [mailto:sam.halli...@gmail.com]
 Sent: Monday, March 09, 2015 6:01 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. 
 Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks so much for following up on this!

 Hmm, I wonder if we should have a concerted effort to chart 
 performance on various pieces of hardware...
 On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto:
 alexander.ula...@hp.com wrote:
 Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added 
 the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I 
 see the support of Double in the current source code), did the test

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander

Netlib knows nothing about GPU (or CPU), it just uses cblas symbols from the 
provided libblas.so.3 library at the runtime. So, you can switch at the runtime 
by providing another library. Sam, please suggest if there is another way.

From: Dmitriy Lyubimov [mailto:dlie...@gmail.com]
Sent: Wednesday, March 25, 2015 2:55 PM
To: Ulanov, Alexander
Cc: Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. 
Sparks; jfcanny
Subject: Re: Using CUDA within Spark / boosting linear algebra

Alexander,

does using netlib imply that one cannot switch between CPU and GPU blas 
alternatives at will at the same time? the choice is always determined by 
linking aliternatives to libblas.so, right?

On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has exceptional 
performance for big matrices with Double, faster than BIDMat-cuda with Float. 
But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL 
might be a better choice. This correlates with original nvblas presentation on 
GPU conf 2013 (slide 21): 
http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of performance of 
different libraries. I just want to pick a library that does at best dense 
matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran blas 
functions, at the same time netlib-java uses C cblas functions. So, one needs 
cblas shared library to use nvblas through netlib-java. Fedora does not have 
cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use 
cblas from Atlas or Openblas because they link to their implementation and not 
to Fortran blas.

Best regards, Alexander

-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org; Xiangrui Meng; Joseph 
Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas functions should 
replace current blas functions calls after executing LD_PRELOAD as suggested in 
http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. 
It seems to work for simple Java example, but I cannot make it work with Spark. 
I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell 
--driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
+-+
| Processes:   GPU Memory |
|  GPU   PID  Type  Process name   Usage  |
|=|
|0  8873C   bash39MiB |
|0  8910C   /usr/lib/jvm/java-1.7.0/bin/java39MiB |
+-+

In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded 
/tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly used. However, 
matrix multiplication does executes on CPU since I see 16% of CPU used and 0% 
of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander

From: Sam Halliday 
[mailto:sam.halli...@gmail.commailto:sam.halli...@gmail.com]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org; Xiangrui Meng; Joseph 
Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on 
various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.com
 wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the 
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the 
support of Double in the current source code), did the test with BIDMat and CPU 
Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander

As everyone suggested, the results were too good to be true, so I 
double-checked them. It turns that nvblas did not do multiplication due to 
parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My 
previously posted results with nvblas are matrices copying only. The default 
NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked 
other values that worked. As a result, netlib+nvblas is on par with 
BIDMat-cuda. As promised, I am going to post a how-to for nvblas configuration.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing



-Original Message-
From: Ulanov, Alexander 
Sent: Wednesday, March 25, 2015 2:31 PM
To: Sam Halliday
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has exceptional 
performance for big matrices with Double, faster than BIDMat-cuda with Float. 
But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL 
might be a better choice. This correlates with original nvblas presentation on 
GPU conf 2013 (slide 21): 
http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
 
My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
 

Just in case, these tests are not for generalization of performance of 
different libraries. I just want to pick a library that does at best dense 
matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran blas 
functions, at the same time netlib-java uses C cblas functions. So, one needs 
cblas shared library to use nvblas through netlib-java. Fedora does not have 
cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use 
cblas from Atlas or Openblas because they link to their implementation and not 
to Fortran blas.

Best regards, Alexander

-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas functions should 
replace current blas functions calls after executing LD_PRELOAD as suggested in 
http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. 
It seems to work for simple Java example, but I cannot make it work with Spark. 
I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell 
--driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
+-+
| Processes:   GPU Memory |
|  GPU   PID  Type  Process name   Usage  |
|=|
|0  8873C   bash39MiB |
|0  8910C   /usr/lib/jvm/java-1.7.0/bin/java39MiB |
+-+

In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded 
/tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly used. However, 
matrix multiplication does executes on CPU since I see 16% of CPU used and 0% 
of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander



From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on 
various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the 
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the 
support of Double in the current source code), did the test with BIDMat and CPU 
Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday 
[mailto:sam.halli...@gmail.commailto:sam.halli...@gmail.com]
Sent: Tuesday, March 03

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Evan R. Sparks

Thanks!
Evan

On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

Hi again,

My results:

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of performance of
different libraries. I just want to pick a library that does at best dense
matrices multiplication for my task.

Best regards, Alexander

Hi,

+-+
| Processes: GPU
Memory |
| GPU PID Type Process name Usage
|

+-+

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander

Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on
various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto:
alexander.ula...@hp.com wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
support of Double in the current source code), did the test with BIDMat and
CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday [mailto:sam.halli...@gmail.commailto:
sam.halli...@gmail.com]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.orgmailto:
dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread jfcanny

Alex,
I think you should recheck your numbers. Both BIDMat and nvblas are
wrappers for cublas. The speeds are identical, except on machines that
have multiple GPUs which nvblas exploits and cublas doesnt.

It would be a good idea to add a column with Gflop throughput. Your
numbers for BIDMat 10kx10k multiply give about 300 single float gflops,
which seems about right for a Quadro 4000 (current generation devices
are 10x faster than a 4000).

Your numbers for netlib-nvblas would indicate a double float throughput
of 8 tflops, which is physically impossible on that device.

It shouldnt matter which interface you use if you have a single GPU.

-John

On 3/25/2015 2:34 PM, Ulanov, Alexander [via Apache Spark Developers
List] wrote:
Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has
exceptional performance for big matrices with Double, faster than
BIDMat-cuda with Float. But for smaller matrices, if you will copy
them to/from GPU, OpenBlas or MKL might be a better choice. This
correlates with original nvblas presentation on GPU conf 2013 (slide
21):
http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of performance of
different libraries. I just want to pick a library that does at best
dense matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran
blas functions, at the same time netlib-java uses C cblas functions.
So, one needs cblas shared library to use nvblas through netlib-java.
Fedora does not have cblas (but Debian and Ubuntu have), so I needed
to compile it. I could not use cblas from Atlas or Openblas because
they link to their implementation and not to Fortran blas.

Best regards, Alexander

-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas
functions should replace current blas functions calls after executing
LD_PRELOAD as suggested in
http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to
netlib-java. It seems to work for simple Java example, but I cannot
make it work with Spark. I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
--driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
+-+

|0 8873C bash 39MiB |
|0 8910C /usr/lib/jvm/java-1.7.0/bin/java 39MiB |
+-+

In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded
/tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly used.
However, matrix multiplication does executes on CPU since I see 16% of
CPU used and 0% of GPU used. I also checked different matrix sizes,
from 100x100 to 12000x12000

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander

From: Sam Halliday [mailto:[hidden email]]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart
performance on various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander [hidden
email]mailto:[hidden email] wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added
the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I
see the support of Double in the current source code), did the test
with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with
netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday [mailto:[hidden email]mailto:[hidden email]]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]mailto:[hidden
email]
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going

Re: Using CUDA within Spark / boosting linear algebra

2015-03-13 Thread Chester At Work

Reyonld,

Prof Canny gives me the slides yesterday I will posted the link to the
slides to both SF BIg Analytics and SF Machine Learning meetups.

Chester

Sent from my iPad

On Mar 12, 2015, at 22:53, Reynold Xin r...@databricks.com wrote:

Thanks for chiming in, John. I missed your meetup last night - do you have
any writeups or slides about roofline design? In particular, I'm curious
about what optimizations are available for power-law dense * sparse? (I
don't have any background in optimizations)

On Thu, Mar 12, 2015 at 8:50 PM, jfcanny ca...@berkeley.edu wrote:

If you're contemplating GPU acceleration in Spark, its important to look
beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the
datasets we've tested in BIDMach, and we've tried to make them
representative of industry machine learning workloads. Unless you're
crunching images or audio, the majority of data will be very sparse and
power law distributed. You need a good sparse BLAS, and in practice it
seems
like you need a sparse BLAS tailored for power-law data. We had to write
our
own since the NVIDIA libraries didnt perform well on typical power-law
data.
Intel MKL sparse BLAS also have issues and we only use some of them.

You also need 2D reductions, scan operations, slicing, element-wise
transcendental functions and operators, many kinds of sort, random number
generators etc, and some kind of memory management strategy. Some of this
was layered on top of Thrust in BIDMat, but most had to be written from
scratch. Its all been rooflined, typically to memory throughput of current
GPUs (around 200 GB/s).

When you have all this you can write Learning Algorithms in the same
high-level primitives available in Breeze or Numpy/Scipy. Its literally the
same in BIDMat, since the generic matrix operations are implemented on both
CPU and GPU, so the same code runs on either platform.

A lesser known fact is that GPUs are around 10x faster for *all* those
operations, not just dense BLAS. Its mostly due to faster streaming memory
speeds, but some kernels (random number generation and transcendentals) are
more than an order of magnitude thanks to some specialized hardware for
power series on the GPU chip.

When you have all this there is no need to move data back and forth across
the PCI bus. The CPU only has to pull chunks of data off disk, unpack them,
and feed them to the available GPUs. Most models fit comfortably in GPU
memory these days (4-12 GB). With minibatch algorithms you can push TBs of
data through the GPU this way.

--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html
Sent from the Apache Spark Developers List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Using CUDA within Spark / boosting linear algebra

2015-03-13 Thread jfcanny

Hi Reynold,
I left Chester with a copy of the slides, so I assume they'll be posted
on the SF ML or Big Data sites. We have a draft paper under review. I
can ask the co-authors about arxiv'ing it.

We have a few heuristics for power-law data. One of them is to keep the
feature set sorted by frequency. Power-law data has roughly the same
mass in each power-of-two range of feature frequency. By keeping the
most frequent features together, you get a lot more value out of the
caches on the device (even GPUs have them, albeit smaller ones). e.g.
with 100 million features, 1/2 of the feature instances will be in the
range 1...,10,000. If they're consecutive they will all hit a fast
cache. Another 1/4 will be in 1,...,1,000,000 hitting the next cache etc.

Another is to subdivide sparse matrices using the vector of elements
rather than rows or columns. Splitting power-law matrices by either rows
or columns gives very uneven splits. That means we store sparse matrices
in coordinate form rather than compressed row or column format.

Other than that, rooflining gives you a goal that you should be able to
reach. If you arent at the limit, just knowing that gives you a target
to aim at. You can try profiling the kernel to figure out why its slower
than it should be. There are a few common reasons (low occupancy,
imbalanced thread blocks, thread divergence) that you can discover with
the profiler. Then hopefully you can solve them.

-John

On 3/12/2015 10:56 PM, rxin [via Apache Spark Developers List] wrote:
Thanks for chiming in, John. I missed your meetup last night - do you
have
any writeups or slides about roofline design? In particular, I'm curious
about what optimizations are available for power-law dense * sparse? (I
don't have any background in optimizations)

On Thu, Mar 12, 2015 at 8:50 PM, jfcanny [hidden email]
/user/SendEmail.jtp?type=nodenode=11022i=0 wrote:

If you're contemplating GPU acceleration in Spark, its important to
look
beyond BLAS. Dense BLAS probably account for only 10% of the cycles
in the
datasets we've tested in BIDMach, and we've tried to make them
representative of industry machine learning workloads. Unless you're
crunching images or audio, the majority of data will be very sparse and
power law distributed. You need a good sparse BLAS, and in practice it
seems
like you need a sparse BLAS tailored for power-law data. We had to
write
our
own since the NVIDIA libraries didnt perform well on typical power-law
data.
Intel MKL sparse BLAS also have issues and we only use some of them.

You also need 2D reductions, scan operations, slicing, element-wise
transcendental functions and operators, many kinds of sort, random
number
generators etc, and some kind of memory management strategy. Some of
this
was layered on top of Thrust in BIDMat, but most had to be written from
scratch. Its all been rooflined, typically to memory throughput of
current
GPUs (around 200 GB/s).

When you have all this you can write Learning Algorithms in the same
high-level primitives available in Breeze or Numpy/Scipy. Its
literally the
same in BIDMat, since the generic matrix operations are implemented
on both
CPU and GPU, so the same code runs on either platform.

A lesser known fact is that GPUs are around 10x faster for *all* those
operations, not just dense BLAS. Its mostly due to faster streaming
memory
speeds, but some kernels (random number generation and
transcendentals) are
more than an order of magnitude thanks to some specialized hardware for
power series on the GPU chip.

When you have all this there is no need to move data back and forth
across
the PCI bus. The CPU only has to pull chunks of data off disk,
unpack them,
and feed them to the available GPUs. Most models fit comfortably in GPU
memory these days (4-12 GB). With minibatch algorithms you can push
TBs of
data through the GPU this way.

--
View this message in context:

http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html
Sent from the Apache Spark Developers List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: [hidden email]
/user/SendEmail.jtp?type=nodenode=11022i=1
For additional commands, e-mail: [hidden email]
/user/SendEmail.jtp?type=nodenode=11022i=2

If you reply to this email, your message will be added to the
discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11022.html

To unsubscribe from Using CUDA within Spark / boosting linear algebra,
click here
http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode

RE: Using CUDA within Spark / boosting linear algebra

2015-03-10 Thread Ulanov, Alexander

I can run benchmark on another machine with GPU nVidia Titan and Intel Xeon 
E5-2650 v2, although it runs Windows and I have to run Linux tests in 
VirtualBox.

It would be also interesting to add results on netlib+nvblas, however I am not 
sure I understand in details how to build this and will appreciate any help 
from you ☺

From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on 
various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the 
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the 
support of Double in the current source code), did the test with BIDMat and CPU 
Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday 
[mailto:sam.halli...@gmail.commailto:sam.halli...@gmail.com]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; 
dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng men...@gmail.commailto:men...@gmail.com writes:

 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x
 slower than netlib-openblas. What is the overhead of using a GPU BLAS
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
 jos...@databricks.commailto:jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
 evan.spa...@gmail.commailto:evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running these benchmarks,
 Alex. The big takeaways here can be seen with this chart:

 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
 Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

 1) A properly configured GPU matrix multiply implementation (e.g.
 BIDMat+GPU) can provide substantial (but less than an order of
 BIDMat+magnitude)
 benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
 netlib-java+openblas-compiled).
 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
 worse than a well-tuned CPU implementation, particularly for larger 
 matrices.
 (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
 basically agrees with the authors own benchmarks (
 https://github.com/fommil/netlib-java)

 I think that most of our users are in a situation where using GPUs
 may not be practical - although we could consider having a good GPU
 backend available as an option. However, *ALL* users of MLlib could
 benefit (potentially tremendously) from using a well-tuned CPU-based
 BLAS implementation. Perhaps we should consider updating the mllib
 guide with a more complete section for enabling high performance
 binaries on OSX and Linux? Or better, figure out a way for the
 system to fetch these automatically.

 - Evan



 On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:

 Just to summarize this thread, I was finally able to make all
 performance comparisons that we discussed. It turns out that:
 BIDMat-cublasBIDMat
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo=
 =netlib-cublasnetlib-blasf2jblas

 Below is the link to the spreadsheet with full results.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
 378T9J5r7kwKSPkY/edit?usp=sharing

 One thing still needs exploration: does BIDMat-cublas perform
 copying to/from machine’s RAM?

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, February 10, 2015 2:12 PM
 To: Evan R. Sparks
 Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks, Evan! It seems that ticket was marked as duplicate though
 the original one discusses slightly different topic. I was able to
 link netlib with MKL from BIDMat binaries. Indeed, MKL is
 statically linked inside a 60MB library.

 |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
 Breeze+Netlib-OpenBlas(native

RE: Using CUDA within Spark / boosting linear algebra

2015-03-09 Thread Ulanov, Alexander

Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the 
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the 
support of Double in the current source code), did the test with BIDMat and CPU 
Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday [mailto:sam.halli...@gmail.com] 
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng men...@gmail.com writes:

 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x 
 slower than netlib-openblas. What is the overhead of using a GPU BLAS 
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
 evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running these benchmarks, 
 Alex. The big takeaways here can be seen with this chart:

 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
 Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

 1) A properly configured GPU matrix multiply implementation (e.g.
 BIDMat+GPU) can provide substantial (but less than an order of 
 BIDMat+magnitude)
 benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
 netlib-java+openblas-compiled).
 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude 
 worse than a well-tuned CPU implementation, particularly for larger 
 matrices.
 (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this 
 basically agrees with the authors own benchmarks (
 https://github.com/fommil/netlib-java)

 I think that most of our users are in a situation where using GPUs 
 may not be practical - although we could consider having a good GPU 
 backend available as an option. However, *ALL* users of MLlib could 
 benefit (potentially tremendously) from using a well-tuned CPU-based 
 BLAS implementation. Perhaps we should consider updating the mllib 
 guide with a more complete section for enabling high performance 
 binaries on OSX and Linux? Or better, figure out a way for the 
 system to fetch these automatically.

 - Evan



 On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander  
 alexander.ula...@hp.com wrote:

 Just to summarize this thread, I was finally able to make all 
 performance comparisons that we discussed. It turns out that:
 BIDMat-cublasBIDMat
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo=
 =netlib-cublasnetlib-blasf2jblas

 Below is the link to the spreadsheet with full results.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
 378T9J5r7kwKSPkY/edit?usp=sharing

 One thing still needs exploration: does BIDMat-cublas perform 
 copying to/from machine’s RAM?

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, February 10, 2015 2:12 PM
 To: Evan R. Sparks
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks, Evan! It seems that ticket was marked as duplicate though 
 the original one discusses slightly different topic. I was able to 
 link netlib with MKL from BIDMat binaries. Indeed, MKL is 
 statically linked inside a 60MB library.

 |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
 Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
 1569,233228 |

 It turn out that pre-compiled MKL is faster than precompiled 
 OpenBlas on my machine. Probably, I’ll add two more columns with 
 locally compiled openblas and cuda.

 Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Monday, February 09, 2015 6:06 PM
 To: Ulanov, Alexander
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Great - perhaps we can move this discussion off-list and onto a 
 JIRA ticket? (Here's one: 
 https://issues.apache.org/jira/browse/SPARK-5705)

 It seems like this is going to be somewhat exploratory for a while 
 (and there's probably only a handful of us

RE: Using CUDA within Spark / boosting linear algebra

2015-03-09 Thread Sam Halliday

Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on
various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.com wrote:

Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
support of Double in the current source code), did the test with BIDMat and
CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)

Xiangrui Meng men...@gmail.com writes:

Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com
wrote:
Better documentation for linking would be very helpful! Here's a JIRA:
https://issues.apache.org/jira/browse/SPARK-6019

On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
evan.spa...@gmail.com
wrote:

Thanks for compiling all the data and running these benchmarks,
Alex. The big takeaways here can be seen with this chart:

https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

1) A properly configured GPU matrix multiply implementation (e.g.
BIDMat+GPU) can provide substantial (but less than an order of
BIDMat+magnitude)
benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
netlib-java+openblas-compiled).
2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
worse than a well-tuned CPU implementation, particularly for larger
matrices.
(netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
basically agrees with the authors own benchmarks (
https://github.com/fommil/netlib-java)

I think that most of our users are in a situation where using GPUs
may not be practical - although we could consider having a good GPU
backend available as an option. However, *ALL* users of MLlib could
benefit (potentially tremendously) from using a well-tuned CPU-based
BLAS implementation. Perhaps we should consider updating the mllib
guide with a more complete section for enabling high performance
binaries on OSX and Linux? Or better, figure out a way for the
system to fetch these automatically.

- Evan

On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:

Just to summarize this thread, I was finally able to make all
performance comparisons that we discussed. It turns out that:
BIDMat-cublasBIDMat
MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo=
=netlib-cublasnetlib-blasf2jblas

Below is the link to the spreadsheet with full results.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
378T9J5r7kwKSPkY/edit?usp=sharing

One thing still needs exploration: does BIDMat-cublas perform
copying to/from machine’s RAM?

-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, February 10, 2015 2:12 PM
To: Evan R. Sparks
Cc: Joseph Bradley; dev@spark.apache.org
Subject: RE: Using CUDA within Spark / boosting linear algebra

Thanks, Evan! It seems that ticket was marked as duplicate though
the original one discusses slightly different topic. I was able to
link netlib with MKL from BIDMat binaries. Indeed, MKL is
statically linked inside a 60MB library.

+---+
|100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
|1,638475459 |
|1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
1569,233228 |

It turn out that pre-compiled MKL is faster than precompiled
OpenBlas on my machine. Probably, I’ll add two more columns with
locally compiled openblas and cuda.

Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Monday, February 09, 2015 6:06 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re

Re: Using CUDA within Spark / boosting linear algebra

2015-03-03 Thread Sam Halliday

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)

Xiangrui Meng men...@gmail.com writes:

Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote:
Better documentation for linking would be very helpful! Here's a JIRA:
https://issues.apache.org/jira/browse/SPARK-6019

On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

Thanks for compiling all the data and running these benchmarks, Alex. The
big takeaways here can be seen with this chart:

https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

1) A properly configured GPU matrix multiply implementation (e.g.
BIDMat+GPU) can provide substantial (but less than an order of magnitude)
benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
netlib-java+openblas-compiled).
2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
than a well-tuned CPU implementation, particularly for larger matrices.
(netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
basically agrees with the authors own benchmarks (
https://github.com/fommil/netlib-java)

I think that most of our users are in a situation where using GPUs may not
be practical - although we could consider having a good GPU backend
available as an option. However, *ALL* users of MLlib could benefit
(potentially tremendously) from using a well-tuned CPU-based BLAS
implementation. Perhaps we should consider updating the mllib guide with a
more complete section for enabling high performance binaries on OSX and
Linux? Or better, figure out a way for the system to fetch these
automatically.

- Evan

On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:

Just to summarize this thread, I was finally able to make all performance
comparisons that we discussed. It turns out that:
BIDMat-cublasBIDMat
MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

Below is the link to the spreadsheet with full results.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

One thing still needs exploration: does BIDMat-cublas perform copying
to/from machine’s RAM?

Thanks, Evan! It seems that ticket was marked as duplicate though the
original one discusses slightly different topic. I was able to link netlib
with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
60MB library.

|A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat|
Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
+---+
|100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
|1,638475459 |
|1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
1569,233228 |

It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
my machine. Probably, I’ll add two more columns with locally compiled
openblas and cuda.

Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Monday, February 09, 2015 6:06 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Great - perhaps we can move this discussion off-list and onto a JIRA
ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

It seems like this is going to be somewhat exploratory for a while (and
there's probably only a handful of us who really care about fast linear
algebra!)

- Evan

On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan,

Thank you for explanation and useful link. I am going to build OpenBLAS,
link it with Netlib-java and perform benchmark again.

Do I understand correctly that BIDMat binaries contain statically linked
Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
having MKL BLAS installed on my server. If it is true, I wonder if it is OK
because Intel sells this library. Nevertheless, it seems that in my case
precompiled MKL BLAS performs better than precompiled OpenBLAS given that
BIDMat and Netlib-java are supposed to be on par

RE: Using CUDA within Spark / boosting linear algebra

2015-03-02 Thread Ulanov, Alexander

Thanks Sam for suggestion! I should try doing this. Now I suppose that 
netlib-java linked with cuBlas during the execution time does fall back to 
cblas library in my system, which is atlas. If I remove atlas, netlib (linked 
with cublas) fails with the message undefined symbol: cblas_dgemm.  

In the meantime, I have updated my spreadsheet with BIDMat-cuda results that 
does copy from main memory to GPU, multiplies and the copies it back to main 
memory (similar to what Xiangrui did). Surprisingly (for myself), the copying 
overhead seems quite small, especially for the bigger matrices.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

-Original Message-
From: Sam Halliday [mailto:sam.halli...@gmail.com] 
Sent: Monday, March 02, 2015 1:24 PM
To: Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra

That's correct. It's highly unusual for a libblas.so to only provide the 
Fortran API. Oh well... CBLAS sources are available in the netlib-java 
repository so you could simply compile them and link against whatever 
libblas.so[fortran] you like.

On 2 March 2015 at 21:04, Ulanov, Alexander alexander.ula...@hp.com wrote:
 Hi Xiangrui,

 Thanks for the link, I am currently trying to use nvblas. It seems that 
 netlib wrappers are implemented with C-BLAS interface and nvblas does not 
 have c-blas. I wonder how it is going to work. I'll keep you updated.

 Alexander

 -Original Message-
 From: Xiangrui Meng [mailto:men...@gmail.com]
 Sent: Monday, March 02, 2015 11:42 AM
 To: Sam Halliday
 Cc: Joseph Bradley; Ulanov, Alexander; dev; Evan R. Sparks
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday sam.halli...@gmail.com wrote:
 Also, check the JNILoader output.

 Remember, for netlib-java to use your system libblas all you need to 
 do is setup libblas.so.3 like any native application would expect.

 I haven't ever used the cublas real BLAS  implementation, so I'd be 
 interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to 
 check that all the runtime links are in order.


 There are two shared libraries in this hybrid setup. nvblas.so must be 
 loaded before libblas.so to intercept level 3 routines using GPU. More 
 details are at: http://docs.nvidia.com/cuda/nvblas/index.html#Usage

 Btw, I have some DGEMM wrappers in my netlib-java performance 
 module... and I also planned to write more in MultiBLAS (until I 
 mothballed the project for the hardware to catch up, which is 
 probably has and now I just need a reason to look at it)

 On 27 Feb 2015 20:26, Xiangrui Meng men...@gmail.com wrote:

 Hey Sam,

 The running times are not big O estimates:

  The CPU version finished in 12 seconds.
  The CPU-GPU-CPU version finished in 2.2 seconds.
  The GPU version finished in 1.7 seconds.

 I think there is something wrong with the netlib/cublas combination.
 Sam already mentioned that cuBLAS doesn't implement the CPU BLAS 
 interfaces. I checked the CUDA doc and it seems that to use GPU BLAS 
 through the CPU BLAS interface we need to use NVBLAS, which 
 intercepts some Level 3 CPU BLAS calls (including GEMM). So we need 
 to load nvblas.so first and then some CPU BLAS library in JNI. I 
 wonder whether the setup was correct.

 Alexander, could you check whether GPU is used in the netlib-cublas 
 experiments? You can tell it by watching CPU/GPU usage.

 Best,
 Xiangrui

 On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday 
 sam.halli...@gmail.com
 wrote:
  Don't use big O estimates, always measure. It used to work back 
  in the days when double multiplication was a bottleneck. The 
  computation cost is effectively free on both the CPU and GPU and 
  you're seeing pure copying costs. Also, I'm dubious that cublas is 
  doing what you think it is. Can you link me to the source code for 
  DGEMM?
 
  I show all of this in my talk, with explanations, I can't stress 
  enough how much I recommend that you watch it if you want to 
  understand high performance hardware acceleration for linear 
  algebra :-)
 
  On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote:
 
  The copying overhead should be quadratic on n, while the 
  computation cost is cubic on n. I can understand that 
  netlib-cublas is slower than netlib-openblas on small problems.
  But I'm surprised to see that it is still 20x slower on 
  1x1. I did the following on a g2.2xlarge instance with BIDMat:
 
  val n = 1
 
  val f = rand(n, n)
  flip; f*f; val rf = flop
 
  flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val 
  rg = flop
 
  flip; g*g; val rgg = flop
 
  The CPU version finished in 12 seconds.
  The CPU-GPU-CPU version finished in 2.2 seconds.
  The GPU version finished in 1.7 seconds.
 
  I'm not sure whether my CPU-GPU-CPU code simulates the 
  netlib-cublas path. But based on the result, the data copying 
  overhead is definitely not as big

RE: Using CUDA within Spark / boosting linear algebra

2015-03-02 Thread Ulanov, Alexander

Hi Xiangrui,

Thanks for the link, I am currently trying to use nvblas. It seems that netlib 
wrappers are implemented with C-BLAS interface and nvblas does not have c-blas. 
I wonder how it is going to work. I'll keep you updated.

Alexander

-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Monday, March 02, 2015 11:42 AM
To: Sam Halliday
Cc: Joseph Bradley; Ulanov, Alexander; dev; Evan R. Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra

On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday sam.halli...@gmail.com wrote:
 Also, check the JNILoader output.

 Remember, for netlib-java to use your system libblas all you need to 
 do is setup libblas.so.3 like any native application would expect.

 I haven't ever used the cublas real BLAS  implementation, so I'd be 
 interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to 
 check that all the runtime links are in order.


There are two shared libraries in this hybrid setup. nvblas.so must be loaded 
before libblas.so to intercept level 3 routines using GPU. More details are at: 
http://docs.nvidia.com/cuda/nvblas/index.html#Usage

 Btw, I have some DGEMM wrappers in my netlib-java performance 
 module... and I also planned to write more in MultiBLAS (until I 
 mothballed the project for the hardware to catch up, which is probably 
 has and now I just need a reason to look at it)

 On 27 Feb 2015 20:26, Xiangrui Meng men...@gmail.com wrote:

 Hey Sam,

 The running times are not big O estimates:

  The CPU version finished in 12 seconds.
  The CPU-GPU-CPU version finished in 2.2 seconds.
  The GPU version finished in 1.7 seconds.

 I think there is something wrong with the netlib/cublas combination.
 Sam already mentioned that cuBLAS doesn't implement the CPU BLAS 
 interfaces. I checked the CUDA doc and it seems that to use GPU BLAS 
 through the CPU BLAS interface we need to use NVBLAS, which 
 intercepts some Level 3 CPU BLAS calls (including GEMM). So we need 
 to load nvblas.so first and then some CPU BLAS library in JNI. I 
 wonder whether the setup was correct.

 Alexander, could you check whether GPU is used in the netlib-cublas 
 experiments? You can tell it by watching CPU/GPU usage.

 Best,
 Xiangrui

 On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday 
 sam.halli...@gmail.com
 wrote:
  Don't use big O estimates, always measure. It used to work back 
  in the days when double multiplication was a bottleneck. The 
  computation cost is effectively free on both the CPU and GPU and 
  you're seeing pure copying costs. Also, I'm dubious that cublas is 
  doing what you think it is. Can you link me to the source code for 
  DGEMM?
 
  I show all of this in my talk, with explanations, I can't stress 
  enough how much I recommend that you watch it if you want to 
  understand high performance hardware acceleration for linear 
  algebra :-)
 
  On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote:
 
  The copying overhead should be quadratic on n, while the 
  computation cost is cubic on n. I can understand that 
  netlib-cublas is slower than netlib-openblas on small problems. 
  But I'm surprised to see that it is still 20x slower on 
  1x1. I did the following on a g2.2xlarge instance with BIDMat:
 
  val n = 1
 
  val f = rand(n, n)
  flip; f*f; val rf = flop
 
  flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val 
  rg = flop
 
  flip; g*g; val rgg = flop
 
  The CPU version finished in 12 seconds.
  The CPU-GPU-CPU version finished in 2.2 seconds.
  The GPU version finished in 1.7 seconds.
 
  I'm not sure whether my CPU-GPU-CPU code simulates the 
  netlib-cublas path. But based on the result, the data copying 
  overhead is definitely not as big as 20x at n = 1.
 
  Best,
  Xiangrui
 
 
  On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday 
  sam.halli...@gmail.com
  wrote:
   I've had some email exchanges with the author of BIDMat: it does 
   exactly what you need to get the GPU benefit and writes higher 
   level algorithms entirely in the GPU kernels so that the memory 
   stays there as long as possible. The restriction with this 
   approach is that it is only offering high-level algorithms so is 
   not a toolkit for applied mathematics research and development 
   --- but it works well as a toolkit for higher level analysis 
   (e.g. for analysts and practitioners).
  
   I believe BIDMat's approach is the best way to get performance 
   out of GPU hardware at the moment but I also have strong 
   evidence to suggest that the hardware will catch up and the 
   memory transfer costs between CPU/GPU will disappear meaning 
   that there will be no need for custom GPU kernel 
   implementations. i.e. please continue to use BLAS primitives 
   when writing new algorithms and only go to the GPU for an 
   alternative optimised implementation.
  
   Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, 
   and offer an API that looks like BLAS but takes

Re: Using CUDA within Spark / boosting linear algebra

2015-03-02 Thread Xiangrui Meng

the
potential to eliminate the green line.

Best regards,
Sam

Ulanov, Alexander alexander.ula...@hp.com writes:

Evan, thank you for the summary. I would like to add some more
observations. The GPU that I used is 2.5 times cheaper than the CPU
($250 vs
$100). They both are 3 years old. I've also did a small test with
modern
hardware, and the new GPU nVidia Titan was slightly more than 1
order of
magnitude faster than Intel E5-2650 v2 for the same tests. However,
it costs
as much as CPU ($1200). My takeaway is that GPU is making a better
price/value progress.

Xiangrui, I was also surprised that BIDMat-cuda was faster than
netlib-cuda and the most reasonable explanation is that it holds the
result
in GPU memory, as Sam suggested. At the same time, it is OK because
you can
copy the result back from GPU only when needed. However, to be sure,
I am
going to ask the developer of BIDMat on his upcoming talk.

Best regards, Alexander

From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Thursday, February 26, 2015 1:56 PM
To: Xiangrui Meng
Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R.
Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra

Btw, I wish people would stop cheating when comparing CPU and GPU
timings for things like matrix multiply :-P

Please always compare apples with apples and include the time it
takes
to set up the matrices, send it to the processing unit, doing the
calculation AND copying it back to where you need to see the
results.

Ignoring this method will make you believe that your GPU is
thousands
of times faster than it really is. Again, jump to the end of my talk
for
graphs and more discussion especially the bit about me being
keen on
funding to investigate APU hardware further ;-) (I believe it will
solve the
problem)
On 26 Feb 2015 21:16, Xiangrui Meng
men...@gmail.commailto:men...@gmail.com wrote:
Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU
BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
jos...@databricks.commailto:jos...@databricks.com wrote:
Better documentation for linking would be very helpful! Here's a
JIRA:
https://issues.apache.org/jira/browse/SPARK-6019

On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
evan.spa...@gmail.commailto:evan.spa...@gmail.com
wrote:

Thanks for compiling all the data and running these benchmarks,
Alex.
The
big takeaways here can be seen with this chart:

https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

1) A properly configured GPU matrix multiply implementation (e.g.
BIDMat+GPU) can provide substantial (but less than an order of
magnitude)
benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
netlib-java+openblas-compiled).
2) A poorly tuned CPU implementation can be 1-2 orders of
magnitude
worse
than a well-tuned CPU implementation, particularly for larger
matrices.
(netlib-f2jblas or netlib-ref) This is not to pick on netlib -
this
basically agrees with the authors own benchmarks (
https://github.com/fommil/netlib-java)

I think that most of our users are in a situation where using GPUs
may not
be practical - although we could consider having a good GPU
backend
available as an option. However, *ALL* users of MLlib could
benefit
(potentially tremendously) from using a well-tuned CPU-based BLAS
implementation. Perhaps we should consider updating the mllib
guide
with a
more complete section for enabling high performance binaries on
OSX
and
Linux? Or better, figure out a way for the system to fetch these
automatically.

- Evan

On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:

Just to summarize this thread, I was finally able to make all
performance
comparisons that we discussed. It turns out that:
BIDMat-cublasBIDMat

MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

Below is the link to the spreadsheet with full results.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

One thing still needs exploration: does BIDMat-cublas perform
copying
to/from machine’s RAM?

-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, February 10, 2015 2:12 PM
To: Evan R. Sparks
Cc: Joseph Bradley;
dev@spark.apache.orgmailto:dev

Re: Using CUDA within Spark / boosting linear algebra

2015-02-27 Thread Xiangrui Meng

26, 2015 1:56 PM
To: Xiangrui Meng
Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R.
Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra

Btw, I wish people would stop cheating when comparing CPU and GPU
timings for things like matrix multiply :-P

Please always compare apples with apples and include the time it takes
to set up the matrices, send it to the processing unit, doing the
calculation AND copying it back to where you need to see the results.

Ignoring this method will make you believe that your GPU is thousands
of times faster than it really is. Again, jump to the end of my talk for
graphs and more discussion especially the bit about me being keen on
funding to investigate APU hardware further ;-) (I believe it will solve
the
problem)
On 26 Feb 2015 21:16, Xiangrui Meng
men...@gmail.commailto:men...@gmail.com wrote:
Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
evan.spa...@gmail.commailto:evan.spa...@gmail.com
wrote:

Thanks for compiling all the data and running these benchmarks, Alex.
The
big takeaways here can be seen with this chart:

https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

1) A properly configured GPU matrix multiply implementation (e.g.
BIDMat+GPU) can provide substantial (but less than an order of
magnitude)
benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
netlib-java+openblas-compiled).
2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
worse
than a well-tuned CPU implementation, particularly for larger
matrices.
(netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
basically agrees with the authors own benchmarks (
https://github.com/fommil/netlib-java)

I think that most of our users are in a situation where using GPUs
may not
be practical - although we could consider having a good GPU backend
available as an option. However, *ALL* users of MLlib could benefit
(potentially tremendously) from using a well-tuned CPU-based BLAS
implementation. Perhaps we should consider updating the mllib guide
with a
more complete section for enabling high performance binaries on OSX
and
Linux? Or better, figure out a way for the system to fetch these
automatically.

- Evan

On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:

Just to summarize this thread, I was finally able to make all
performance
comparisons that we discussed. It turns out that:
BIDMat-cublasBIDMat

MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

Below is the link to the spreadsheet with full results.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

One thing still needs exploration: does BIDMat-cublas perform
copying
to/from machine’s RAM?

-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, February 10, 2015 2:12 PM
To: Evan R. Sparks
Cc: Joseph Bradley;
dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: RE: Using CUDA within Spark / boosting linear algebra

Thanks, Evan! It seems that ticket was marked as duplicate though
the
original one discusses slightly different topic. I was able to link
netlib
with MKL from BIDMat binaries. Indeed, MKL is statically linked
inside a
60MB library.

It turn out that pre-compiled MKL is faster than precompiled
OpenBlas on
my machine. Probably, I’ll add two more columns with locally
compiled
openblas and cuda.

Alexander

From: Evan R. Sparks
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Monday, February 09, 2015 6:06 PM
To: Ulanov, Alexander
Cc: Joseph Bradley;
dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Great - perhaps we can move

Re: Using CUDA within Spark / boosting linear algebra

2015-02-27 Thread Sam Halliday

. I've also did a small test with
modern
hardware, and the new GPU nVidia Titan was slightly more than 1
order of
magnitude faster than Intel E5-2650 v2 for the same tests. However,
it costs
as much as CPU ($1200). My takeaway is that GPU is making a better
price/value progress.

Best regards, Alexander

Btw, I wish people would stop cheating when comparing CPU and GPU
timings for things like matrix multiply :-P

Ignoring this method will make you believe that your GPU is thousands
of times faster than it really is. Again, jump to the end of my talk
for
graphs and more discussion especially the bit about me being
keen on
funding to investigate APU hardware further ;-) (I believe it will
solve the
problem)
On 26 Feb 2015 21:16, Xiangrui Meng
men...@gmail.commailto:men...@gmail.com wrote:
Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
evan.spa...@gmail.commailto:evan.spa...@gmail.com
wrote:

Thanks for compiling all the data and running these benchmarks,
Alex.
The
big takeaways here can be seen with this chart:

https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

1) A properly configured GPU matrix multiply implementation (e.g.
BIDMat+GPU) can provide substantial (but less than an order of
magnitude)
benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
netlib-java+openblas-compiled).
2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
worse
than a well-tuned CPU implementation, particularly for larger
matrices.
(netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
basically agrees with the authors own benchmarks (
https://github.com/fommil/netlib-java)

I think that most of our users are in a situation where using GPUs
may not
be practical - although we could consider having a good GPU backend
available as an option. However, *ALL* users of MLlib could benefit
(potentially tremendously) from using a well-tuned CPU-based BLAS
implementation. Perhaps we should consider updating the mllib guide
with a
more complete section for enabling high performance binaries on OSX
and
Linux? Or better, figure out a way for the system to fetch these
automatically.

- Evan

On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:

Just to summarize this thread, I was finally able to make all
performance
comparisons that we discussed. It turns out that:
BIDMat-cublasBIDMat

MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

Below is the link to the spreadsheet with full results.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

One thing still needs exploration: does BIDMat-cublas perform
copying
to/from machine’s RAM?

Thanks, Evan! It seems that ticket was marked as duplicate though
the
original one discusses slightly different topic. I was able to
link
netlib
with MKL from BIDMat binaries. Indeed, MKL is statically linked
inside a
60MB library.

|A*B size | BIDMat MKL | Breeze+Netlib-MKL

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday

Don't use big O estimates, always measure. It used to work back in the
days when double multiplication was a bottleneck. The computation cost is
effectively free on both the CPU and GPU and you're seeing pure copying
costs. Also, I'm dubious that cublas is doing what you think it is. Can you
link me to the source code for DGEMM?

I show all of this in my talk, with explanations, I can't stress enough how
much I recommend that you watch it if you want to understand high
performance hardware acceleration for linear algebra :-)
On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote:

 The copying overhead should be quadratic on n, while the computation
 cost is cubic on n. I can understand that netlib-cublas is slower than
 netlib-openblas on small problems. But I'm surprised to see that it is
 still 20x slower on 1x1. I did the following on a g2.2xlarge
 instance with BIDMat:

 val n = 1

 val f = rand(n, n)
 flip; f*f; val rf = flop

 flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop

 flip; g*g; val rgg = flop

 The CPU version finished in 12 seconds.
 The CPU-GPU-CPU version finished in 2.2 seconds.
 The GPU version finished in 1.7 seconds.

 I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas
 path. But based on the result, the data copying overhead is definitely
 not as big as 20x at n = 1.

 Best,
 Xiangrui


 On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com
 wrote:
  I've had some email exchanges with the author of BIDMat: it does exactly
  what you need to get the GPU benefit and writes higher level algorithms
  entirely in the GPU kernels so that the memory stays there as long as
  possible. The restriction with this approach is that it is only offering
  high-level algorithms so is not a toolkit for applied mathematics
  research and development --- but it works well as a toolkit for higher
  level analysis (e.g. for analysts and practitioners).
 
  I believe BIDMat's approach is the best way to get performance out of
  GPU hardware at the moment but I also have strong evidence to suggest
  that the hardware will catch up and the memory transfer costs between
  CPU/GPU will disappear meaning that there will be no need for custom GPU
  kernel implementations. i.e. please continue to use BLAS primitives when
  writing new algorithms and only go to the GPU for an alternative
  optimised implementation.
 
  Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
  an API that looks like BLAS but takes pointers to special regions in the
  GPU memory region. Somebody has written a wrapper around CUDA to create
  a proper BLAS library but it only gives marginal performance over the
  CPU because of the memory transfer overhead.
 
  This slide from my talk
 
http://fommil.github.io/scalax14/#/11/2
 
  says it all. X axis is matrix size, Y axis is logarithmic time to do
  DGEMM. Black line is the cheating time for the GPU and the green line
  is after copying the memory to/from the GPU memory. APUs have the
  potential to eliminate the green line.
 
  Best regards,
  Sam
 
 
 
  Ulanov, Alexander alexander.ula...@hp.com writes:
 
  Evan, thank you for the summary. I would like to add some more
 observations. The GPU that I used is 2.5 times cheaper than the CPU ($250
 vs $100). They both are 3 years old. I've also did a small test with modern
 hardware, and the new GPU nVidia Titan was slightly more than 1 order of
 magnitude faster than Intel E5-2650 v2 for the same tests. However, it
 costs as much as CPU ($1200). My takeaway is that GPU is making a better
 price/value progress.
 
 
 
  Xiangrui, I was also surprised that BIDMat-cuda was faster than
 netlib-cuda and the most reasonable explanation is that it holds the result
 in GPU memory, as Sam suggested. At the same time, it is OK because you can
 copy the result back from GPU only when needed. However, to be sure, I am
 going to ask the developer of BIDMat on his upcoming talk.
 
 
 
  Best regards, Alexander
 
 
  From: Sam Halliday [mailto:sam.halli...@gmail.com]
  Sent: Thursday, February 26, 2015 1:56 PM
  To: Xiangrui Meng
  Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R.
 Sparks
  Subject: Re: Using CUDA within Spark / boosting linear algebra
 
 
  Btw, I wish people would stop cheating when comparing CPU and GPU
 timings for things like matrix multiply :-P
 
  Please always compare apples with apples and include the time it takes
 to set up the matrices, send it to the processing unit, doing the
 calculation AND copying it back to where you need to see the results.
 
  Ignoring this method will make you believe that your GPU is thousands
 of times faster than it really is. Again, jump to the end of my talk for
 graphs and more discussion  especially the bit about me being keen on
 funding to investigate APU hardware further ;-) (I believe it will solve
 the problem)
  On 26 Feb 2015 21:16, Xiangrui Meng men

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday

Btw, I wish people would stop cheating when comparing CPU and GPU timings
for things like matrix multiply :-P

Please always compare apples with apples and include the time it takes to
set up the matrices, send it to the processing unit, doing the calculation
AND copying it back to where you need to see the results.

Ignoring this method will make you believe that your GPU is thousands of
times faster than it really is. Again, jump to the end of my talk for
graphs and more discussion especially the bit about me being keen on
funding to investigate APU hardware further ;-) (I believe it will solve
the problem)
On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.com wrote:

Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com
wrote:
Better documentation for linking would be very helpful! Here's a JIRA:
https://issues.apache.org/jira/browse/SPARK-6019

On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

Thanks for compiling all the data and running these benchmarks, Alex.
The
big takeaways here can be seen with this chart:

https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

1) A properly configured GPU matrix multiply implementation (e.g.
BIDMat+GPU) can provide substantial (but less than an order of
magnitude)
benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
netlib-java+openblas-compiled).
2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
worse
than a well-tuned CPU implementation, particularly for larger matrices.
(netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
basically agrees with the authors own benchmarks (
https://github.com/fommil/netlib-java)

I think that most of our users are in a situation where using GPUs may
not
be practical - although we could consider having a good GPU backend
available as an option. However, *ALL* users of MLlib could benefit
(potentially tremendously) from using a well-tuned CPU-based BLAS
implementation. Perhaps we should consider updating the mllib guide
with a
more complete section for enabling high performance binaries on OSX and
Linux? Or better, figure out a way for the system to fetch these
automatically.

- Evan

On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:

Just to summarize this thread, I was finally able to make all
performance
comparisons that we discussed. It turns out that:
BIDMat-cublasBIDMat

MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

Below is the link to the spreadsheet with full results.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

One thing still needs exploration: does BIDMat-cublas perform copying
to/from machine’s RAM?

Thanks, Evan! It seems that ticket was marked as duplicate though the
original one discusses slightly different topic. I was able to link
netlib
with MKL from BIDMat binaries. Indeed, MKL is statically linked inside
a
60MB library.

It turn out that pre-compiled MKL is faster than precompiled OpenBlas
on
my machine. Probably, I’ll add two more columns with locally compiled
openblas and cuda.

Alexander

Great - perhaps we can move this discussion off-list and onto a JIRA
ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

It seems like this is going to be somewhat exploratory for a while (and
there's probably only a handful of us who really care about fast linear
algebra!)

- Evan

On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan,

Thank you for explanation and useful link

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Evan R. Sparks

I couldn't agree with you more, Sam. The GPU/Matrix guys typically don't
count their copy times, but claim that you should be doing *as much as
possible* on the GPU - so, maybe for some applications where you can
generate the data on the GPU this makes sense. But, in the context of Spark
we should be *very* careful about enumerating the applications we want GPU
support for and deciding whether it's appropriate to measure the overheads
of getting the data to the GPU.

On Thu, Feb 26, 2015 at 1:55 PM, Sam Halliday sam.halli...@gmail.com
wrote:

Btw, I wish people would stop cheating when comparing CPU and GPU timings
for things like matrix multiply :-P

Please always compare apples with apples and include the time it takes to
set up the matrices, send it to the processing unit, doing the calculation
AND copying it back to where you need to see the results.

Ignoring this method will make you believe that your GPU is thousands of
times faster than it really is. Again, jump to the end of my talk for
graphs and more discussion especially the bit about me being keen on
funding to investigate APU hardware further ;-) (I believe it will solve
the problem)
On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.com wrote:

Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com
wrote:
Better documentation for linking would be very helpful! Here's a JIRA:
https://issues.apache.org/jira/browse/SPARK-6019

On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

Thanks for compiling all the data and running these benchmarks, Alex.
The
big takeaways here can be seen with this chart:

https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

1) A properly configured GPU matrix multiply implementation (e.g.
BIDMat+GPU) can provide substantial (but less than an order of
magnitude)
benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
netlib-java+openblas-compiled).
2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
worse
than a well-tuned CPU implementation, particularly for larger matrices.
(netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
basically agrees with the authors own benchmarks (
https://github.com/fommil/netlib-java)

I think that most of our users are in a situation where using GPUs may
not
be practical - although we could consider having a good GPU backend
available as an option. However, *ALL* users of MLlib could benefit
(potentially tremendously) from using a well-tuned CPU-based BLAS
implementation. Perhaps we should consider updating the mllib guide
with a
more complete section for enabling high performance binaries on OSX and
Linux? Or better, figure out a way for the system to fetch these
automatically.

- Evan

On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:

Just to summarize this thread, I was finally able to make all
performance
comparisons that we discussed. It turns out that:
BIDMat-cublasBIDMat

MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

Below is the link to the spreadsheet with full results.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

One thing still needs exploration: does BIDMat-cublas perform copying
to/from machine’s RAM?

Thanks, Evan! It seems that ticket was marked as duplicate though the
original one discusses slightly different topic. I was able to link
netlib
with MKL from BIDMat binaries. Indeed, MKL is statically linked
inside a
60MB library.

It turn out that pre-compiled MKL is faster than precompiled OpenBlas
on
my machine. Probably, I’ll add two more columns with locally compiled
openblas and cuda.

Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Monday, February 09, 2015 6:06 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Xiangrui Meng

Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote:
Better documentation for linking would be very helpful! Here's a JIRA:
https://issues.apache.org/jira/browse/SPARK-6019

On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

Thanks for compiling all the data and running these benchmarks, Alex. The
big takeaways here can be seen with this chart:

https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

I think that most of our users are in a situation where using GPUs may not
be practical - although we could consider having a good GPU backend
available as an option. However, *ALL* users of MLlib could benefit
(potentially tremendously) from using a well-tuned CPU-based BLAS
implementation. Perhaps we should consider updating the mllib guide with a
more complete section for enabling high performance binaries on OSX and
Linux? Or better, figure out a way for the system to fetch these
automatically.

- Evan

On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:

Just to summarize this thread, I was finally able to make all performance
comparisons that we discussed. It turns out that:
BIDMat-cublasBIDMat
MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

Below is the link to the spreadsheet with full results.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

One thing still needs exploration: does BIDMat-cublas perform copying
to/from machine’s RAM?

Thanks, Evan! It seems that ticket was marked as duplicate though the
original one discusses slightly different topic. I was able to link netlib
with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
60MB library.

It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
my machine. Probably, I’ll add two more columns with locally compiled
openblas and cuda.

Alexander

Great - perhaps we can move this discussion off-list and onto a JIRA
ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

It seems like this is going to be somewhat exploratory for a while (and
there's probably only a handful of us who really care about fast linear
algebra!)

- Evan

On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan,

Thank you for explanation and useful link. I am going to build OpenBLAS,
link it with Netlib-java and perform benchmark again.

Though, it might be interesting to link Netlib-java with Intel MKL, as
you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
(Netlib-java) interested to compare their libraries.

Best regards, Alexander

From: Evan R. Sparks [mailto:evan.spa

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday

Hi all,

I'm not surprised if the GPU is slow. It's about the bottleneck copying the
memory. Watch my talk, linked from the netlib-java github page, to
understand further. The only way to currently make use of a GPU is to do
all the operations using the GPU's kernel. You can find some prepackaged
high level algorithms than do this, but it's extremely limiting.

I believe hardware will fix this problem eventually, so I still advocate
using the netlib primitives. I'm particularly interested in APU approaches
and I'm very interested in finding somebody to fund me to look into it.
It's too much work for a side project.

Look on the last few slides of my talk to see the potential performance
gains.

Best regards, Sam
On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.com wrote:

Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com
wrote:
Better documentation for linking would be very helpful! Here's a JIRA:
https://issues.apache.org/jira/browse/SPARK-6019

On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

Thanks for compiling all the data and running these benchmarks, Alex.
The
big takeaways here can be seen with this chart:

https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

1) A properly configured GPU matrix multiply implementation (e.g.
BIDMat+GPU) can provide substantial (but less than an order of
magnitude)
benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
netlib-java+openblas-compiled).
2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
worse
than a well-tuned CPU implementation, particularly for larger matrices.
(netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
basically agrees with the authors own benchmarks (
https://github.com/fommil/netlib-java)

I think that most of our users are in a situation where using GPUs may
not
be practical - although we could consider having a good GPU backend
available as an option. However, *ALL* users of MLlib could benefit
(potentially tremendously) from using a well-tuned CPU-based BLAS
implementation. Perhaps we should consider updating the mllib guide
with a
more complete section for enabling high performance binaries on OSX and
Linux? Or better, figure out a way for the system to fetch these
automatically.

- Evan

On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:

Just to summarize this thread, I was finally able to make all
performance
comparisons that we discussed. It turns out that:
BIDMat-cublasBIDMat

MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

Below is the link to the spreadsheet with full results.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

One thing still needs exploration: does BIDMat-cublas perform copying
to/from machine’s RAM?

Thanks, Evan! It seems that ticket was marked as duplicate though the
original one discusses slightly different topic. I was able to link
netlib
with MKL from BIDMat binaries. Indeed, MKL is statically linked inside
a
60MB library.

It turn out that pre-compiled MKL is faster than precompiled OpenBlas
on
my machine. Probably, I’ll add two more columns with locally compiled
openblas and cuda.

Alexander

Great - perhaps we can move this discussion off-list and onto a JIRA
ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

It seems like this is going to be somewhat exploratory for a while (and
there's probably only a handful of us who really care about fast linear
algebra!)

- Evan

On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander
alexander.ula

RE: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Ulanov, Alexander

Typo - CPU was 2.5 cheaper (not GPU!)

-Original Message-
From: Ulanov, Alexander 
Sent: Thursday, February 26, 2015 2:01 PM
To: Sam Halliday; Xiangrui Meng
Cc: dev@spark.apache.org; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Evan, thank you for the summary. I would like to add some more observations. 
The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both 
are 3 years old. I've also did a small test with modern hardware, and the new 
GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel 
E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My 
takeaway is that GPU is making a better price/value progress.

Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and 
the most reasonable explanation is that it holds the result in GPU memory, as 
Sam suggested. At the same time, it is OK because you can copy the result back 
from GPU only when needed. However, to be sure, I am going to ask the developer 
of BIDMat on his upcoming talk.

Best regards, Alexander

From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Thursday, February 26, 2015 1:56 PM
To: Xiangrui Meng
Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra

Btw, I wish people would stop cheating when comparing CPU and GPU timings for 
things like matrix multiply :-P

Please always compare apples with apples and include the time it takes to set 
up the matrices, send it to the processing unit, doing the calculation AND 
copying it back to where you need to see the results.

Ignoring this method will make you believe that your GPU is thousands of times 
faster than it really is. Again, jump to the end of my talk for graphs and more 
discussion  especially the bit about me being keen on funding to 
investigate APU hardware further ;-) (I believe it will solve the problem) On 
26 Feb 2015 21:16, Xiangrui Meng men...@gmail.commailto:men...@gmail.com 
wrote:
Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x slower than 
netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
jos...@databricks.commailto:jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019

 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
 evan.spa...@gmail.commailto:evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running these benchmarks, Alex. 
 The big takeaways here can be seen with this chart:

 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZH
 l6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

 1) A properly configured GPU matrix multiply implementation (e.g.
 BIDMat+GPU) can provide substantial (but less than an order of 
 BIDMat+magnitude)
 benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
 netlib-java+openblas-compiled).
 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude 
 worse than a well-tuned CPU implementation, particularly for larger matrices.
 (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this 
 basically agrees with the authors own benchmarks (
 https://github.com/fommil/netlib-java)

 I think that most of our users are in a situation where using GPUs 
 may not be practical - although we could consider having a good GPU 
 backend available as an option. However, *ALL* users of MLlib could 
 benefit (potentially tremendously) from using a well-tuned CPU-based 
 BLAS implementation. Perhaps we should consider updating the mllib 
 guide with a more complete section for enabling high performance 
 binaries on OSX and Linux? Or better, figure out a way for the system 
 to fetch these automatically.

 - Evan

 On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander  
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:

 Just to summarize this thread, I was finally able to make all 
 performance comparisons that we discussed. It turns out that:
 BIDMat-cublasBIDMat
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==
 netlib-cublasnetlib-blasf2jblas

 Below is the link to the spreadsheet with full results.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx3
 78T9J5r7kwKSPkY/edit?usp=sharing

 One thing still needs exploration: does BIDMat-cublas perform 
 copying to/from machine’s RAM?

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, February 10, 2015 2:12 PM
 To: Evan R. Sparks
 Cc: Joseph Bradley; 
 dev@spark.apache.orgmailto:dev@spark.apache.org
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks, Evan! It seems that ticket was marked as duplicate though 
 the original one discusses slightly

RE: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday

I've had some email exchanges with the author of BIDMat: it does exactly
what you need to get the GPU benefit and writes higher level algorithms
entirely in the GPU kernels so that the memory stays there as long as
possible. The restriction with this approach is that it is only offering
high-level algorithms so is not a toolkit for applied mathematics
research and development --- but it works well as a toolkit for higher
level analysis (e.g. for analysts and practitioners).

I believe BIDMat's approach is the best way to get performance out of
GPU hardware at the moment but I also have strong evidence to suggest
that the hardware will catch up and the memory transfer costs between
CPU/GPU will disappear meaning that there will be no need for custom GPU
kernel implementations. i.e. please continue to use BLAS primitives when
writing new algorithms and only go to the GPU for an alternative
optimised implementation.

Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
an API that looks like BLAS but takes pointers to special regions in the
GPU memory region. Somebody has written a wrapper around CUDA to create
a proper BLAS library but it only gives marginal performance over the
CPU because of the memory transfer overhead.

This slide from my talk

http://fommil.github.io/scalax14/#/11/2

says it all. X axis is matrix size, Y axis is logarithmic time to do
DGEMM. Black line is the cheating time for the GPU and the green line
is after copying the memory to/from the GPU memory. APUs have the
potential to eliminate the green line.

Best regards,
Sam

Ulanov, Alexander alexander.ula...@hp.com writes:

Evan, thank you for the summary. I would like to add some more observations.
The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They
both are 3 years old. I've also did a small test with modern hardware, and
the new GPU nVidia Titan was slightly more than 1 order of magnitude faster
than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU
($1200). My takeaway is that GPU is making a better price/value progress.

Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda
and the most reasonable explanation is that it holds the result in GPU
memory, as Sam suggested. At the same time, it is OK because you can copy the
result back from GPU only when needed. However, to be sure, I am going to ask
the developer of BIDMat on his upcoming talk.

Best regards, Alexander

From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Thursday, February 26, 2015 1:56 PM
To: Xiangrui Meng
Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra

Btw, I wish people would stop cheating when comparing CPU and GPU timings for
things like matrix multiply :-P

Please always compare apples with apples and include the time it takes to set
up the matrices, send it to the processing unit, doing the calculation AND
copying it back to where you need to see the results.

Ignoring this method will make you believe that your GPU is thousands of
times faster than it really is. Again, jump to the end of my talk for graphs
and more discussion especially the bit about me being keen on funding to
investigate APU hardware further ;-) (I believe it will solve the problem)
On 26 Feb 2015 21:16, Xiangrui Meng
men...@gmail.commailto:men...@gmail.com wrote:
Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
jos...@databricks.commailto:jos...@databricks.com wrote:
Better documentation for linking would be very helpful! Here's a JIRA:
https://issues.apache.org/jira/browse/SPARK-6019

On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
evan.spa...@gmail.commailto:evan.spa...@gmail.com
wrote:

Thanks for compiling all the data and running these benchmarks, Alex. The
big takeaways here can be seen with this chart:

https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

I think that most of our users are in a situation where using GPUs may not
be practical - although we could consider having a good GPU

RE: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Ulanov, Alexander

Evan, thank you for the summary. I would like to add some more observations. 
The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both 
are 3 years old. I've also did a small test with modern hardware, and the new 
GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel 
E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My 
takeaway is that GPU is making a better price/value progress.



Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and 
the most reasonable explanation is that it holds the result in GPU memory, as 
Sam suggested. At the same time, it is OK because you can copy the result back 
from GPU only when needed. However, to be sure, I am going to ask the developer 
of BIDMat on his upcoming talk.



Best regards, Alexander


From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Thursday, February 26, 2015 1:56 PM
To: Xiangrui Meng
Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra


Btw, I wish people would stop cheating when comparing CPU and GPU timings for 
things like matrix multiply :-P

Please always compare apples with apples and include the time it takes to set 
up the matrices, send it to the processing unit, doing the calculation AND 
copying it back to where you need to see the results.

Ignoring this method will make you believe that your GPU is thousands of times 
faster than it really is. Again, jump to the end of my talk for graphs and more 
discussion  especially the bit about me being keen on funding to 
investigate APU hardware further ;-) (I believe it will solve the problem)
On 26 Feb 2015 21:16, Xiangrui Meng 
men...@gmail.commailto:men...@gmail.com wrote:
Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
jos...@databricks.commailto:jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
 evan.spa...@gmail.commailto:evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running these benchmarks, Alex. The
 big takeaways here can be seen with this chart:

 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

 1) A properly configured GPU matrix multiply implementation (e.g.
 BIDMat+GPU) can provide substantial (but less than an order of magnitude)
 benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
 netlib-java+openblas-compiled).
 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
 than a well-tuned CPU implementation, particularly for larger matrices.
 (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
 basically agrees with the authors own benchmarks (
 https://github.com/fommil/netlib-java)

 I think that most of our users are in a situation where using GPUs may not
 be practical - although we could consider having a good GPU backend
 available as an option. However, *ALL* users of MLlib could benefit
 (potentially tremendously) from using a well-tuned CPU-based BLAS
 implementation. Perhaps we should consider updating the mllib guide with a
 more complete section for enabling high performance binaries on OSX and
 Linux? Or better, figure out a way for the system to fetch these
 automatically.

 - Evan



 On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:

 Just to summarize this thread, I was finally able to make all performance
 comparisons that we discussed. It turns out that:
 BIDMat-cublasBIDMat
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

 Below is the link to the spreadsheet with full results.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 One thing still needs exploration: does BIDMat-cublas perform copying
 to/from machine’s RAM?

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, February 10, 2015 2:12 PM
 To: Evan R. Sparks
 Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks, Evan! It seems that ticket was marked as duplicate though the
 original one discusses slightly different topic. I was able to link netlib
 with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
 60MB library.

 |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
 Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Xiangrui Meng

The copying overhead should be quadratic on n, while the computation
cost is cubic on n. I can understand that netlib-cublas is slower than
netlib-openblas on small problems. But I'm surprised to see that it is
still 20x slower on 1x1. I did the following on a g2.2xlarge
instance with BIDMat:

val n = 1

val f = rand(n, n)
flip; f*f; val rf = flop

flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop

flip; g*g; val rgg = flop

The CPU version finished in 12 seconds.
The CPU-GPU-CPU version finished in 2.2 seconds.
The GPU version finished in 1.7 seconds.

I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas
path. But based on the result, the data copying overhead is definitely
not as big as 20x at n = 1.

Best,
Xiangrui


On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com wrote:
 I've had some email exchanges with the author of BIDMat: it does exactly
 what you need to get the GPU benefit and writes higher level algorithms
 entirely in the GPU kernels so that the memory stays there as long as
 possible. The restriction with this approach is that it is only offering
 high-level algorithms so is not a toolkit for applied mathematics
 research and development --- but it works well as a toolkit for higher
 level analysis (e.g. for analysts and practitioners).

 I believe BIDMat's approach is the best way to get performance out of
 GPU hardware at the moment but I also have strong evidence to suggest
 that the hardware will catch up and the memory transfer costs between
 CPU/GPU will disappear meaning that there will be no need for custom GPU
 kernel implementations. i.e. please continue to use BLAS primitives when
 writing new algorithms and only go to the GPU for an alternative
 optimised implementation.

 Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
 an API that looks like BLAS but takes pointers to special regions in the
 GPU memory region. Somebody has written a wrapper around CUDA to create
 a proper BLAS library but it only gives marginal performance over the
 CPU because of the memory transfer overhead.

 This slide from my talk

   http://fommil.github.io/scalax14/#/11/2

 says it all. X axis is matrix size, Y axis is logarithmic time to do
 DGEMM. Black line is the cheating time for the GPU and the green line
 is after copying the memory to/from the GPU memory. APUs have the
 potential to eliminate the green line.

 Best regards,
 Sam



 Ulanov, Alexander alexander.ula...@hp.com writes:

 Evan, thank you for the summary. I would like to add some more observations. 
 The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They 
 both are 3 years old. I've also did a small test with modern hardware, and 
 the new GPU nVidia Titan was slightly more than 1 order of magnitude faster 
 than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU 
 ($1200). My takeaway is that GPU is making a better price/value progress.



 Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda 
 and the most reasonable explanation is that it holds the result in GPU 
 memory, as Sam suggested. At the same time, it is OK because you can copy 
 the result back from GPU only when needed. However, to be sure, I am going 
 to ask the developer of BIDMat on his upcoming talk.



 Best regards, Alexander


 From: Sam Halliday [mailto:sam.halli...@gmail.com]
 Sent: Thursday, February 26, 2015 1:56 PM
 To: Xiangrui Meng
 Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
 Subject: Re: Using CUDA within Spark / boosting linear algebra


 Btw, I wish people would stop cheating when comparing CPU and GPU timings 
 for things like matrix multiply :-P

 Please always compare apples with apples and include the time it takes to 
 set up the matrices, send it to the processing unit, doing the calculation 
 AND copying it back to where you need to see the results.

 Ignoring this method will make you believe that your GPU is thousands of 
 times faster than it really is. Again, jump to the end of my talk for graphs 
 and more discussion  especially the bit about me being keen on funding 
 to investigate APU hardware further ;-) (I believe it will solve the problem)
 On 26 Feb 2015 21:16, Xiangrui Meng 
 men...@gmail.commailto:men...@gmail.com wrote:
 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x
 slower than netlib-openblas. What is the overhead of using a GPU BLAS
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
 jos...@databricks.commailto:jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
 evan.spa...@gmail.commailto:evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running

Re: Using CUDA within Spark / boosting linear algebra

2015-02-25 Thread Evan R. Sparks

Thanks for compiling all the data and running these benchmarks, Alex. The
big takeaways here can be seen with this chart:
https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

I think that most of our users are in a situation where using GPUs may not
be practical - although we could consider having a good GPU backend
available as an option. However, *ALL* users of MLlib could benefit
(potentially tremendously) from using a well-tuned CPU-based BLAS
implementation. Perhaps we should consider updating the mllib guide with a
more complete section for enabling high performance binaries on OSX and
Linux? Or better, figure out a way for the system to fetch these
automatically.

- Evan

On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

Just to summarize this thread, I was finally able to make all performance
comparisons that we discussed. It turns out that:
BIDMat-cublasBIDMat
MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

Below is the link to the spreadsheet with full results.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

One thing still needs exploration: does BIDMat-cublas perform copying
to/from machine’s RAM?

Thanks, Evan! It seems that ticket was marked as duplicate though the
original one discusses slightly different topic. I was able to link netlib
with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
60MB library.

|A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat|
Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
+---+
|100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459
|
|1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
1569,233228 |

It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
my machine. Probably, I’ll add two more columns with locally compiled
openblas and cuda.

Alexander

Great - perhaps we can move this discussion off-list and onto a JIRA
ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

It seems like this is going to be somewhat exploratory for a while (and
there's probably only a handful of us who really care about fast linear
algebra!)

- Evan

On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.com
mailto:alexander.ula...@hp.com wrote:
Hi Evan,

Thank you for explanation and useful link. I am going to build OpenBLAS,
link it with Netlib-java and perform benchmark again.

Though, it might be interesting to link Netlib-java with Intel MKL, as you
suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java)
interested to compare their libraries.

Best regards, Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:58 PM

To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I would build OpenBLAS yourself, since good BLAS performance comes from
getting cache sizes, etc. set up correctly for your particular hardware -
this is often a very tricky process (see, e.g. ATLAS), but we found that on
relatively modern Xeon chips, OpenBLAS builds quickly and yields
performance

RE: Using CUDA within Spark / boosting linear algebra

2015-02-12 Thread Ulanov, Alexander

Just to summarize this thread, I was finally able to make all performance 
comparisons that we discussed. It turns out that: 
BIDMat-cublasBIDMat 
MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas

Below is the link to the spreadsheet with full results. 
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

One thing still needs exploration: does BIDMat-cublas perform copying to/from 
machine’s RAM?
 
-Original Message-
From: Ulanov, Alexander 
Sent: Tuesday, February 10, 2015 2:12 PM
To: Evan R. Sparks
Cc: Joseph Bradley; dev@spark.apache.org
Subject: RE: Using CUDA within Spark / boosting linear algebra

Thanks, Evan! It seems that ticket was marked as duplicate though the original 
one discusses slightly different topic. I was able to link netlib with MKL from 
BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library.

|A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat| 
Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
+---+
|100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 |
|1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 
|

It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my 
machine. Probably, I’ll add two more columns with locally compiled openblas and 
cuda.

Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Monday, February 09, 2015 6:06 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Great - perhaps we can move this discussion off-list and onto a JIRA ticket? 
(Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

It seems like this is going to be somewhat exploratory for a while (and there's 
probably only a handful of us who really care about fast linear algebra!)

- Evan

On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan,

Thank you for explanation and useful link. I am going to build OpenBLAS, link 
it with Netlib-java and perform benchmark again.

Do I understand correctly that BIDMat binaries contain statically linked Intel 
MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL 
BLAS installed on my server. If it is true, I wonder if it is OK because Intel 
sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS 
performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are 
supposed to be on par with JNI overheads.

Though, it might be interesting to link Netlib-java with Intel MKL, as you 
suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) 
interested to compare their libraries.

Best regards, Alexander

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:58 PM

To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I would build OpenBLAS yourself, since good BLAS performance comes from getting 
cache sizes, etc. set up correctly for your particular hardware - this is often 
a very tricky process (see, e.g. ATLAS), but we found that on relatively modern 
Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL.

To make sure the right library is getting used, you have to make sure it's 
first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will 
do the trick here.

For some examples of getting netlib-java setup on an ec2 node and some example 
benchmarking code we ran a while back, see: 
https://github.com/shivaram/matrix-bench

In particular - build-openblas-ec2.sh shows you how to build the library and 
set up symlinks correctly, and scala/run-netlib.sh shows you how to get the 
path setup and get that library picked up by netlib-java.

In this way - you could probably get cuBLAS set up to be used by netlib-java as 
well.

- Evan

On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Evan, could you elaborate on how to force BIDMat and netlib-java to force 
loading the right blas? For netlib, I there are few JVM flags, such as 
-Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can 
force it to use Java implementation. Not sure I understand how to force use a 
specific blas (not specific wrapper for blas).

Btw. I have installed openblas (yum install openblas), so I suppose that netlib 
is using it.

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev

RE: Using CUDA within Spark / boosting linear algebra

2015-02-10 Thread Ulanov, Alexander

Thanks, Evan! It seems that ticket was marked as duplicate though the original 
one discusses slightly different topic. I was able to link netlib with MKL from 
BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library.

|A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat| 
Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
+---+
|100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 |
|1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 
|

It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my 
machine. Probably, I’ll add two more columns with locally compiled openblas and 
cuda.

Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Monday, February 09, 2015 6:06 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Great - perhaps we can move this discussion off-list and onto a JIRA ticket? 
(Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

It seems like this is going to be somewhat exploratory for a while (and there's 
probably only a handful of us who really care about fast linear algebra!)

- Evan

On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan,

Thank you for explanation and useful link. I am going to build OpenBLAS, link 
it with Netlib-java and perform benchmark again.

Do I understand correctly that BIDMat binaries contain statically linked Intel 
MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL 
BLAS installed on my server. If it is true, I wonder if it is OK because Intel 
sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS 
performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are 
supposed to be on par with JNI overheads.

Though, it might be interesting to link Netlib-java with Intel MKL, as you 
suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) 
interested to compare their libraries.

Best regards, Alexander

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:58 PM

To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I would build OpenBLAS yourself, since good BLAS performance comes from getting 
cache sizes, etc. set up correctly for your particular hardware - this is often 
a very tricky process (see, e.g. ATLAS), but we found that on relatively modern 
Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL.

To make sure the right library is getting used, you have to make sure it's 
first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will 
do the trick here.

For some examples of getting netlib-java setup on an ec2 node and some example 
benchmarking code we ran a while back, see: 
https://github.com/shivaram/matrix-bench

In particular - build-openblas-ec2.sh shows you how to build the library and 
set up symlinks correctly, and scala/run-netlib.sh shows you how to get the 
path setup and get that library picked up by netlib-java.

In this way - you could probably get cuBLAS set up to be used by netlib-java as 
well.

- Evan

On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Evan, could you elaborate on how to force BIDMat and netlib-java to force 
loading the right blas? For netlib, I there are few JVM flags, such as 
-Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can 
force it to use Java implementation. Not sure I understand how to force use a 
specific blas (not specific wrapper for blas).

Btw. I have installed openblas (yum install openblas), so I suppose that netlib 
is using it.

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org

Subject: Re: Using CUDA within Spark / boosting linear algebra

Getting breeze to pick up the right blas library is critical for performance. I 
recommend using OpenBLAS (or MKL, if you already have it). It might make sense 
to force BIDMat to use the same underlying BLAS library as well.

On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan, Joseph

I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
netlib-java+breeze (sorry for weird table formatting):

|A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
Breeze+Netlib-java f2jblas

Re: Using CUDA within Spark / boosting linear algebra

2015-02-09 Thread Evan R. Sparks

Great - perhaps we can move this discussion off-list and onto a JIRA
ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

It seems like this is going to be somewhat exploratory for a while (and
there's probably only a handful of us who really care about fast linear
algebra!)

- Evan

On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

  Hi Evan,



 Thank you for explanation and useful link. I am going to build OpenBLAS,
 link it with Netlib-java and perform benchmark again.



 Do I understand correctly that BIDMat binaries contain statically linked
 Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
 having MKL BLAS installed on my server. If it is true, I wonder if it is OK
 because Intel sells this library. Nevertheless, it seems that in my case
 precompiled MKL BLAS performs better than precompiled OpenBLAS given that
 BIDMat and Netlib-java are supposed to be on par with JNI overheads.



 Though, it might be interesting to link Netlib-java with Intel MKL, as you
 suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java)
 interested to compare their libraries.



 Best regards, Alexander



 *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
 *Sent:* Friday, February 06, 2015 5:58 PM

 *To:* Ulanov, Alexander
 *Cc:* Joseph Bradley; dev@spark.apache.org
 *Subject:* Re: Using CUDA within Spark / boosting linear algebra



 I would build OpenBLAS yourself, since good BLAS performance comes from
 getting cache sizes, etc. set up correctly for your particular hardware -
 this is often a very tricky process (see, e.g. ATLAS), but we found that on
 relatively modern Xeon chips, OpenBLAS builds quickly and yields
 performance competitive with MKL.



 To make sure the right library is getting used, you have to make sure it's
 first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so
 will do the trick here.



 For some examples of getting netlib-java setup on an ec2 node and some
 example benchmarking code we ran a while back, see:
 https://github.com/shivaram/matrix-bench



 In particular - build-openblas-ec2.sh shows you how to build the library
 and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
 the path setup and get that library picked up by netlib-java.



 In this way - you could probably get cuBLAS set up to be used by
 netlib-java as well.



 - Evan



 On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:

  Evan, could you elaborate on how to force BIDMat and netlib-java to
 force loading the right blas? For netlib, I there are few JVM flags, such
 as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I
 can force it to use Java implementation. Not sure I understand how to force
 use a specific blas (not specific wrapper for blas).



 Btw. I have installed openblas (yum install openblas), so I suppose that
 netlib is using it.



 *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
 *Sent:* Friday, February 06, 2015 5:19 PM
 *To:* Ulanov, Alexander
 *Cc:* Joseph Bradley; dev@spark.apache.org


 *Subject:* Re: Using CUDA within Spark / boosting linear algebra



 Getting breeze to pick up the right blas library is critical for
 performance. I recommend using OpenBLAS (or MKL, if you already have it).
 It might make sense to force BIDMat to use the same underlying BLAS library
 as well.



 On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:

 Hi Evan, Joseph

 I did few matrix multiplication test and BIDMat seems to be ~10x faster
 than netlib-java+breeze (sorry for weird table formatting):

 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
 Linux, Scala 2.11.

 Later I will make tests with Cuda. I need to install new Cuda version for
 this purpose.

 Do you have any ideas why breeze-netlib with native blas is so much slower
 than BIDMat MKL?

 Best regards, Alexander

 From: Joseph Bradley [mailto:jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.org

 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Hi Alexander,

 Using GPUs with Spark would be very exciting.  Small comment: Concerning
 your question earlier about keeping data stored on the GPU rather than
 having to move it between main memory and GPU memory on each iteration, I
 would guess this would be critical to getting good performance.  If you
 could do multiple local iterations before aggregating results, then the
 cost of data movement to the GPU

Re: Using CUDA within Spark / boosting linear algebra

2015-02-09 Thread Chester @work

Maybe you can ask prof john canny himself:-)  as I invited him to give a talk 
at Alpine data labs in March's meetup (SF big Analytics  SF machine learning 
joined meetup) , 3/11. To be announced in next day or so. 

Chester

Sent from my iPhone

 On Feb 9, 2015, at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.com 
 wrote:
 
 Hi Evan,
 
 Thank you for explanation and useful link. I am going to build OpenBLAS, link 
 it with Netlib-java and perform benchmark again.
 
 Do I understand correctly that BIDMat binaries contain statically linked 
 Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having 
 MKL BLAS installed on my server. If it is true, I wonder if it is OK because 
 Intel sells this library. Nevertheless, it seems that in my case precompiled 
 MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and 
 Netlib-java are supposed to be on par with JNI overheads.
 
 Though, it might be interesting to link Netlib-java with Intel MKL, as you 
 suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) 
 interested to compare their libraries.
 
 Best regards, Alexander
 
 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Friday, February 06, 2015 5:58 PM
 To: Ulanov, Alexander
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra
 
 I would build OpenBLAS yourself, since good BLAS performance comes from 
 getting cache sizes, etc. set up correctly for your particular hardware - 
 this is often a very tricky process (see, e.g. ATLAS), but we found that on 
 relatively modern Xeon chips, OpenBLAS builds quickly and yields performance 
 competitive with MKL.
 
 To make sure the right library is getting used, you have to make sure it's 
 first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so 
 will do the trick here.
 
 For some examples of getting netlib-java setup on an ec2 node and some 
 example benchmarking code we ran a while back, see: 
 https://github.com/shivaram/matrix-bench
 
 In particular - build-openblas-ec2.sh shows you how to build the library and 
 set up symlinks correctly, and scala/run-netlib.sh shows you how to get the 
 path setup and get that library picked up by netlib-java.
 
 In this way - you could probably get cuBLAS set up to be used by netlib-java 
 as well.
 
 - Evan
 
 On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 Evan, could you elaborate on how to force BIDMat and netlib-java to force 
 loading the right blas? For netlib, I there are few JVM flags, such as 
 -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can 
 force it to use Java implementation. Not sure I understand how to force use a 
 specific blas (not specific wrapper for blas).
 
 Btw. I have installed openblas (yum install openblas), so I suppose that 
 netlib is using it.
 
 From: Evan R. Sparks 
 [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
 Sent: Friday, February 06, 2015 5:19 PM
 To: Ulanov, Alexander
 Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
 
 Subject: Re: Using CUDA within Spark / boosting linear algebra
 
 Getting breeze to pick up the right blas library is critical for performance. 
 I recommend using OpenBLAS (or MKL, if you already have it). It might make 
 sense to force BIDMat to use the same underlying BLAS library as well.
 
 On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 Hi Evan, Joseph
 
 I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
 netlib-java+breeze (sorry for weird table formatting):
 
 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |
 
 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, 
 Scala 2.11.
 
 Later I will make tests with Cuda. I need to install new Cuda version for 
 this purpose.
 
 Do you have any ideas why breeze-netlib with native blas is so much slower 
 than BIDMat MKL?
 
 Best regards, Alexander
 
 From: Joseph Bradley 
 [mailto:jos...@databricks.commailto:jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra
 
 Hi Alexander,
 
 Using GPUs with Spark would be very exciting.  Small comment: Concerning your 
 question earlier about keeping data stored on the GPU rather than having to 
 move it between main memory and GPU memory on each iteration, I would guess 
 this would be critical to getting good

RE: Using CUDA within Spark / boosting linear algebra

2015-02-09 Thread Ulanov, Alexander

Hi Evan,

Thank you for explanation and useful link. I am going to build OpenBLAS, link 
it with Netlib-java and perform benchmark again.

Do I understand correctly that BIDMat binaries contain statically linked Intel 
MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL 
BLAS installed on my server. If it is true, I wonder if it is OK because Intel 
sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS 
performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are 
supposed to be on par with JNI overheads.

Though, it might be interesting to link Netlib-java with Intel MKL, as you 
suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) 
interested to compare their libraries.

Best regards, Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:58 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I would build OpenBLAS yourself, since good BLAS performance comes from getting 
cache sizes, etc. set up correctly for your particular hardware - this is often 
a very tricky process (see, e.g. ATLAS), but we found that on relatively modern 
Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL.

To make sure the right library is getting used, you have to make sure it's 
first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will 
do the trick here.

For some examples of getting netlib-java setup on an ec2 node and some example 
benchmarking code we ran a while back, see: 
https://github.com/shivaram/matrix-bench

In particular - build-openblas-ec2.sh shows you how to build the library and 
set up symlinks correctly, and scala/run-netlib.sh shows you how to get the 
path setup and get that library picked up by netlib-java.

In this way - you could probably get cuBLAS set up to be used by netlib-java as 
well.

- Evan

On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Evan, could you elaborate on how to force BIDMat and netlib-java to force 
loading the right blas? For netlib, I there are few JVM flags, such as 
-Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can 
force it to use Java implementation. Not sure I understand how to force use a 
specific blas (not specific wrapper for blas).

Btw. I have installed openblas (yum install openblas), so I suppose that netlib 
is using it.

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org

Subject: Re: Using CUDA within Spark / boosting linear algebra

Getting breeze to pick up the right blas library is critical for performance. I 
recommend using OpenBLAS (or MKL, if you already have it). It might make sense 
to force BIDMat to use the same underlying BLAS library as well.

On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan, Joseph

I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
netlib-java+breeze (sorry for weird table formatting):

|A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
Breeze+Netlib-java f2jblas |
+---+
|100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
|1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, 
Scala 2.11.

Later I will make tests with Cuda. I need to install new Cuda version for this 
purpose.

Do you have any ideas why breeze-netlib with native blas is so much slower than 
BIDMat MKL?

Best regards, Alexander

From: Joseph Bradley 
[mailto:jos...@databricks.commailto:jos...@databricks.com]
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Hi Alexander,

Using GPUs with Spark would be very exciting.  Small comment: Concerning your 
question earlier about keeping data stored on the GPU rather than having to 
move it between main memory and GPU memory on each iteration, I would guess 
this would be critical to getting good performance.  If you could do multiple 
local iterations before aggregating results, then the cost of data movement to 
the GPU could be amortized (and I believe that is done in practice).  Having 
Spark be aware of the GPU and using it as another part of memory sounds like a 
much bigger undertaking.

Joseph

On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Thank you

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks

I would build OpenBLAS yourself, since good BLAS performance comes from
getting cache sizes, etc. set up correctly for your particular hardware -
this is often a very tricky process (see, e.g. ATLAS), but we found that on
relatively modern Xeon chips, OpenBLAS builds quickly and yields
performance competitive with MKL.

To make sure the right library is getting used, you have to make sure it's
first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so
will do the trick here.

For some examples of getting netlib-java setup on an ec2 node and some
example benchmarking code we ran a while back, see:
https://github.com/shivaram/matrix-bench

In particular - build-openblas-ec2.sh shows you how to build the library
and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
the path setup and get that library picked up by netlib-java.

In this way - you could probably get cuBLAS set up to be used by
netlib-java as well.

- Evan

On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

  Evan, could you elaborate on how to force BIDMat and netlib-java to
 force loading the right blas? For netlib, I there are few JVM flags, such
 as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I
 can force it to use Java implementation. Not sure I understand how to force
 use a specific blas (not specific wrapper for blas).



 Btw. I have installed openblas (yum install openblas), so I suppose that
 netlib is using it.



 *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
 *Sent:* Friday, February 06, 2015 5:19 PM
 *To:* Ulanov, Alexander
 *Cc:* Joseph Bradley; dev@spark.apache.org

 *Subject:* Re: Using CUDA within Spark / boosting linear algebra



 Getting breeze to pick up the right blas library is critical for
 performance. I recommend using OpenBLAS (or MKL, if you already have it).
 It might make sense to force BIDMat to use the same underlying BLAS library
 as well.



 On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:

 Hi Evan, Joseph

 I did few matrix multiplication test and BIDMat seems to be ~10x faster
 than netlib-java+breeze (sorry for weird table formatting):

 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
 Linux, Scala 2.11.

 Later I will make tests with Cuda. I need to install new Cuda version for
 this purpose.

 Do you have any ideas why breeze-netlib with native blas is so much slower
 than BIDMat MKL?

 Best regards, Alexander

 From: Joseph Bradley [mailto:jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.org

 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Hi Alexander,

 Using GPUs with Spark would be very exciting.  Small comment: Concerning
 your question earlier about keeping data stored on the GPU rather than
 having to move it between main memory and GPU memory on each iteration, I
 would guess this would be critical to getting good performance.  If you
 could do multiple local iterations before aggregating results, then the
 cost of data movement to the GPU could be amortized (and I believe that is
 done in practice).  Having Spark be aware of the GPU and using it as
 another part of memory sounds like a much bigger undertaking.

 Joseph

 On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:
 Thank you for explanation! I’ve watched the BIDMach presentation by John
 Canny and I am really inspired by his talk and comparisons with Spark MLlib.

 I am very interested to find out what will be better within Spark: BIDMat
 or netlib-java with CPU or GPU natives. Could you suggest a fair way to
 benchmark them? Currently I do benchmarks on artificial neural networks in
 batch mode. While it is not a “pure” test of linear algebra, it involves
 some other things that are essential to machine learning.

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 1:29 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I'd be surprised of BIDMat+OpenBLAS was significantly faster than
 netlib-java+OpenBLAS, but if it is much faster it's probably due to data
 layout and fewer levels of indirection - it's definitely a worthwhile
 experiment to run. The main speedups I've seen from using it come from
 highly optimized GPU code for linear algebra. I know that in the past Canny
 has gone as far as to write custom GPU kernels for performance-critical
 regions of code.[1]

 BIDMach is highly

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks

Getting breeze to pick up the right blas library is critical for
performance. I recommend using OpenBLAS (or MKL, if you already have it).
It might make sense to force BIDMat to use the same underlying BLAS library
as well.

On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Hi Evan, Joseph

 I did few matrix multiplication test and BIDMat seems to be ~10x faster
 than netlib-java+breeze (sorry for weird table formatting):

 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
 Linux, Scala 2.11.

 Later I will make tests with Cuda. I need to install new Cuda version for
 this purpose.

 Do you have any ideas why breeze-netlib with native blas is so much slower
 than BIDMat MKL?

 Best regards, Alexander

 From: Joseph Bradley [mailto:jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Hi Alexander,

 Using GPUs with Spark would be very exciting.  Small comment: Concerning
 your question earlier about keeping data stored on the GPU rather than
 having to move it between main memory and GPU memory on each iteration, I
 would guess this would be critical to getting good performance.  If you
 could do multiple local iterations before aggregating results, then the
 cost of data movement to the GPU could be amortized (and I believe that is
 done in practice).  Having Spark be aware of the GPU and using it as
 another part of memory sounds like a much bigger undertaking.

 Joseph

 On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:
 Thank you for explanation! I’ve watched the BIDMach presentation by John
 Canny and I am really inspired by his talk and comparisons with Spark MLlib.

 I am very interested to find out what will be better within Spark: BIDMat
 or netlib-java with CPU or GPU natives. Could you suggest a fair way to
 benchmark them? Currently I do benchmarks on artificial neural networks in
 batch mode. While it is not a “pure” test of linear algebra, it involves
 some other things that are essential to machine learning.

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 1:29 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I'd be surprised of BIDMat+OpenBLAS was significantly faster than
 netlib-java+OpenBLAS, but if it is much faster it's probably due to data
 layout and fewer levels of indirection - it's definitely a worthwhile
 experiment to run. The main speedups I've seen from using it come from
 highly optimized GPU code for linear algebra. I know that in the past Canny
 has gone as far as to write custom GPU kernels for performance-critical
 regions of code.[1]

 BIDMach is highly optimized for single node performance or performance on
 small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
 batched in that way) the performance tends to fall off. Canny argues for
 hardware/software codesign and as such prefers machine configurations that
 are quite different than what we find in most commodity cluster nodes -
 e.g. 10 disk cahnnels and 4 GPUs.

 In contrast, MLlib was designed for horizontal scalability on commodity
 clusters and works best on very big datasets - order of terabytes.

 For the most part, these projects developed concurrently to address
 slightly different use cases. That said, there may be bits of BIDMach we
 could repurpose for MLlib - keep in mind we need to be careful about
 maintaining cross-language compatibility for our Java and Python-users,
 though.

 - Evan

 [1] - http://arxiv.org/abs/1409.5402
 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

 On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com
 mailto:alexander.ula...@hp.com wrote:
 Hi Evan,

 Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
 what makes them faster than netlib-java?

 The same group has BIDMach library that implements machine learning. For
 some examples they use Caffe convolutional neural network library owned by
 another group in Berkeley. Could you elaborate on how these all might be
 connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
 you take BIDMach for optimization and learning?

 Best regards, Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
 evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 12:09 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.orgmailto:dev@spark.apache.org

RE: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Ulanov, Alexander

Evan, could you elaborate on how to force BIDMat and netlib-java to force 
loading the right blas? For netlib, I there are few JVM flags, such as 
-Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can 
force it to use Java implementation. Not sure I understand how to force use a 
specific blas (not specific wrapper for blas).

Btw. I have installed openblas (yum install openblas), so I suppose that netlib 
is using it.

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Getting breeze to pick up the right blas library is critical for performance. I 
recommend using OpenBLAS (or MKL, if you already have it). It might make sense 
to force BIDMat to use the same underlying BLAS library as well.

On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan, Joseph

I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
netlib-java+breeze (sorry for weird table formatting):

|A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
Breeze+Netlib-java f2jblas |
+---+
|100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
|1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, 
Scala 2.11.

Later I will make tests with Cuda. I need to install new Cuda version for this 
purpose.

Do you have any ideas why breeze-netlib with native blas is so much slower than 
BIDMat MKL?

Best regards, Alexander

From: Joseph Bradley 
[mailto:jos...@databricks.commailto:jos...@databricks.com]
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Hi Alexander,

Using GPUs with Spark would be very exciting.  Small comment: Concerning your 
question earlier about keeping data stored on the GPU rather than having to 
move it between main memory and GPU memory on each iteration, I would guess 
this would be critical to getting good performance.  If you could do multiple 
local iterations before aggregating results, then the cost of data movement to 
the GPU could be amortized (and I believe that is done in practice).  Having 
Spark be aware of the GPU and using it as another part of memory sounds like a 
much bigger undertaking.

Joseph

On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny 
and I am really inspired by his talk and comparisons with Spark MLlib.

I am very interested to find out what will be better within Spark: BIDMat or 
netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark 
them? Currently I do benchmarks on artificial neural networks in batch mode. 
While it is not a “pure” test of linear algebra, it involves some other things 
that are essential to machine learning.

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd be surprised of BIDMat+OpenBLAS was significantly faster than 
netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout 
and fewer levels of indirection - it's definitely a worthwhile experiment to 
run. The main speedups I've seen from using it come from highly optimized GPU 
code for linear algebra. I know that in the past Canny has gone as far as to 
write custom GPU kernels for performance-critical regions of code.[1]

BIDMach is highly optimized for single node performance or performance on small 
clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in 
that way) the performance tends to fall off. Canny argues for hardware/software 
codesign and as such prefers machine configurations that are quite different 
than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 
GPUs.

In contrast, MLlib was designed for horizontal scalability on commodity 
clusters and works best on very big datasets - order of terabytes.

For the most part, these projects developed concurrently to address slightly 
different use cases. That said, there may be bits of BIDMach we could repurpose 
for MLlib - keep in mind we need to be careful about maintaining cross-language 
compatibility for our Java and Python-users, though.

- Evan

[1] - http://arxiv.org/abs/1409.5402
[2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

On Thu, Feb 5, 2015 at 1

RE: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Ulanov, Alexander

Hi Evan, Joseph

I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
netlib-java+breeze (sorry for weird table formatting):

|A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
Breeze+Netlib-java f2jblas | 
+---+
|100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
|1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, 
Scala 2.11.

Later I will make tests with Cuda. I need to install new Cuda version for this 
purpose. 

Do you have any ideas why breeze-netlib with native blas is so much slower than 
BIDMat MKL?

Best regards, Alexander

From: Joseph Bradley [mailto:jos...@databricks.com] 
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Hi Alexander,

Using GPUs with Spark would be very exciting.  Small comment: Concerning your 
question earlier about keeping data stored on the GPU rather than having to 
move it between main memory and GPU memory on each iteration, I would guess 
this would be critical to getting good performance.  If you could do multiple 
local iterations before aggregating results, then the cost of data movement to 
the GPU could be amortized (and I believe that is done in practice).  Having 
Spark be aware of the GPU and using it as another part of memory sounds like a 
much bigger undertaking.

Joseph

On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com 
wrote:
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny 
and I am really inspired by his talk and comparisons with Spark MLlib.

I am very interested to find out what will be better within Spark: BIDMat or 
netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark 
them? Currently I do benchmarks on artificial neural networks in batch mode. 
While it is not a “pure” test of linear algebra, it involves some other things 
that are essential to machine learning.

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd be surprised of BIDMat+OpenBLAS was significantly faster than 
netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout 
and fewer levels of indirection - it's definitely a worthwhile experiment to 
run. The main speedups I've seen from using it come from highly optimized GPU 
code for linear algebra. I know that in the past Canny has gone as far as to 
write custom GPU kernels for performance-critical regions of code.[1]

BIDMach is highly optimized for single node performance or performance on small 
clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in 
that way) the performance tends to fall off. Canny argues for hardware/software 
codesign and as such prefers machine configurations that are quite different 
than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 
GPUs.

In contrast, MLlib was designed for horizontal scalability on commodity 
clusters and works best on very big datasets - order of terabytes.

For the most part, these projects developed concurrently to address slightly 
different use cases. That said, there may be bits of BIDMach we could repurpose 
for MLlib - keep in mind we need to be careful about maintaining cross-language 
compatibility for our Java and Python-users, though.

- Evan

[1] - http://arxiv.org/abs/1409.5402
[2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan,

Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what 
makes them faster than netlib-java?

The same group has BIDMach library that implements machine learning. For some 
examples they use Caffe convolutional neural network library owned by another 
group in Berkeley. Could you elaborate on how these all might be connected with 
Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach 
for optimization and learning?

Best regards, Alexander

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many 
cases.

You might consider taking a look at the codepaths that BIDMat 
(https://github.com/BIDData/BIDMat) takes and comparing them to 
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Nicholas Chammas

Lemme butt in randomly here and say there is an interesting discussion on
this Spark PR https://github.com/apache/spark/pull/4448 about
netlib-java, JBLAS, Breeze, and other things I know nothing of, that y'all
may find interesting. Among the participants is the author of netlib-java.

On Sun Feb 08 2015 at 2:48:19 AM Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Hi Evan, Joseph

 I did few matrix multiplication test and BIDMat seems to be ~10x faster
 than netlib-java+breeze (sorry for weird table formatting):

 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
 Linux, Scala 2.11.

 Later I will make tests with Cuda. I need to install new Cuda version for
 this purpose.

 Do you have any ideas why breeze-netlib with native blas is so much slower
 than BIDMat MKL?

 Best regards, Alexander

 From: Joseph Bradley [mailto:jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Hi Alexander,

 Using GPUs with Spark would be very exciting.  Small comment: Concerning
 your question earlier about keeping data stored on the GPU rather than
 having to move it between main memory and GPU memory on each iteration, I
 would guess this would be critical to getting good performance.  If you
 could do multiple local iterations before aggregating results, then the
 cost of data movement to the GPU could be amortized (and I believe that is
 done in practice).  Having Spark be aware of the GPU and using it as
 another part of memory sounds like a much bigger undertaking.

 Joseph

 On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:
 Thank you for explanation! I’ve watched the BIDMach presentation by John
 Canny and I am really inspired by his talk and comparisons with Spark MLlib.

 I am very interested to find out what will be better within Spark: BIDMat
 or netlib-java with CPU or GPU natives. Could you suggest a fair way to
 benchmark them? Currently I do benchmarks on artificial neural networks in
 batch mode. While it is not a “pure” test of linear algebra, it involves
 some other things that are essential to machine learning.

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 1:29 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I'd be surprised of BIDMat+OpenBLAS was significantly faster than
 netlib-java+OpenBLAS, but if it is much faster it's probably due to data
 layout and fewer levels of indirection - it's definitely a worthwhile
 experiment to run. The main speedups I've seen from using it come from
 highly optimized GPU code for linear algebra. I know that in the past Canny
 has gone as far as to write custom GPU kernels for performance-critical
 regions of code.[1]

 BIDMach is highly optimized for single node performance or performance on
 small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
 batched in that way) the performance tends to fall off. Canny argues for
 hardware/software codesign and as such prefers machine configurations that
 are quite different than what we find in most commodity cluster nodes -
 e.g. 10 disk cahnnels and 4 GPUs.

 In contrast, MLlib was designed for horizontal scalability on commodity
 clusters and works best on very big datasets - order of terabytes.

 For the most part, these projects developed concurrently to address
 slightly different use cases. That said, there may be bits of BIDMach we
 could repurpose for MLlib - keep in mind we need to be careful about
 maintaining cross-language compatibility for our Java and Python-users,
 though.

 - Evan

 [1] - http://arxiv.org/abs/1409.5402
 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

 On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com
 mailto:alexander.ula...@hp.com wrote:
 Hi Evan,

 Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
 what makes them faster than netlib-java?

 The same group has BIDMach library that implements machine learning. For
 some examples they use Caffe convolutional neural network library owned by
 another group in Berkeley. Could you elaborate on how these all might be
 connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
 you take BIDMach for optimization and learning?

 Best regards, Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
 evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 12:09 PM
 To: Ulanov

RE: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Ulanov, Alexander

Hi Evan,

Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what 
makes them faster than netlib-java?

The same group has BIDMach library that implements machine learning. For some 
examples they use Caffe convolutional neural network library owned by another 
group in Berkeley. Could you elaborate on how these all might be connected with 
Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach 
for optimization and learning?

Best regards, Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many 
cases.

You might consider taking a look at the codepaths that BIDMat 
(https://github.com/BIDData/BIDMat) takes and comparing them to 
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to 
make this work really fast from Scala. I've run it on my laptop and compared to 
MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of 
layers of indirection here and you really want to avoid data copying as much as 
possible.

We could also consider swapping out BIDMat for Breeze, but that would be a big 
project and if we can figure out how to get breeze+cublas to comparable 
performance that would be a big win.

On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Dear Spark developers,

I am exploring how to make linear algebra operations faster within Spark. One 
way of doing this is to use Scala Breeze library that is bundled with Spark. 
For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS 
(basic linear algebra subprograms) and LAPACK native binaries if they are 
available on the worker node. It also has its own optimized Java implementation 
of BLAS. It is worth mentioning, that native binaries provide better 
performance only for BLAS level 3, i.e. matrix-matrix operations or general 
matrix multiplication (GEMM). This is confirmed by GEMM test on Netlib-java 
page https://github.com/fommil/netlib-java. I also confirmed it with my 
experiments with training of artificial neural network 
https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I 
would like to boost performance more.

GPU is supposed to work fast with linear algebra and there is Nvidia CUDA 
implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU 
and I was able to do the following. I linked cublas (instead of cpu-based blas) 
with Netlib-java wrapper and put it into Spark, so Breeze/Netlib is using it. 
Then I did some performance measurements with regards to artificial neural 
network batch learning in Spark MLlib that involves matrix-matrix 
multiplications. It turns out that for matrices of size less than ~1000x780 GPU 
cublas has the same speed as CPU blas. Cublas becomes slower for bigger 
matrices. It worth mentioning that it is was not a test for ONLY multiplication 
since there are other operations involved. One of the reasons for slowdown 
might be the overhead of copying the matrices from computer memory to graphic 
card memory and back.

So, few questions:
1) Do these results with CUDA make sense?
2) If the problem is with copy overhead, are there any libraries that allow to 
force intermediate results to stay in graphic card memory thus removing the 
overhead?
3) Any other options to speed-up linear algebra in Spark?

Thank you, Alexander

-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.orgmailto:dev-unsubscr...@spark.apache.org
For additional commands, e-mail: 
dev-h...@spark.apache.orgmailto:dev-h...@spark.apache.org

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Evan R. Sparks

I'd be surprised of BIDMat+OpenBLAS was significantly faster than
netlib-java+OpenBLAS, but if it is much faster it's probably due to data
layout and fewer levels of indirection - it's definitely a worthwhile
experiment to run. The main speedups I've seen from using it come from
highly optimized GPU code for linear algebra. I know that in the past Canny
has gone as far as to write custom GPU kernels for performance-critical
regions of code.[1]

BIDMach is highly optimized for single node performance or performance on
small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
batched in that way) the performance tends to fall off. Canny argues for
hardware/software codesign and as such prefers machine configurations that
are quite different than what we find in most commodity cluster nodes -
e.g. 10 disk cahnnels and 4 GPUs.

In contrast, MLlib was designed for horizontal scalability on commodity
clusters and works best on very big datasets - order of terabytes.

For the most part, these projects developed concurrently to address
slightly different use cases. That said, there may be bits of BIDMach we
could repurpose for MLlib - keep in mind we need to be careful about
maintaining cross-language compatibility for our Java and Python-users,
though.

- Evan

[1] - http://arxiv.org/abs/1409.5402
[2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

  Hi Evan,



 Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
 what makes them faster than netlib-java?



 The same group has BIDMach library that implements machine learning. For
 some examples they use Caffe convolutional neural network library owned by
 another group in Berkeley. Could you elaborate on how these all might be
 connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
 you take BIDMach for optimization and learning?



 Best regards, Alexander



 *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
 *Sent:* Thursday, February 05, 2015 12:09 PM
 *To:* Ulanov, Alexander
 *Cc:* dev@spark.apache.org
 *Subject:* Re: Using CUDA within Spark / boosting linear algebra



 I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
 many cases.



 You might consider taking a look at the codepaths that BIDMat (
 https://github.com/BIDData/BIDMat) takes and comparing them to
 netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
 to make this work really fast from Scala. I've run it on my laptop and
 compared to MKL and in certain cases it's 10x faster at matrix multiply.
 There are a lot of layers of indirection here and you really want to avoid
 data copying as much as possible.



 We could also consider swapping out BIDMat for Breeze, but that would be a
 big project and if we can figure out how to get breeze+cublas to comparable
 performance that would be a big win.



 On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander 
 alexander.ula...@hp.com wrote:

 Dear Spark developers,

 I am exploring how to make linear algebra operations faster within Spark.
 One way of doing this is to use Scala Breeze library that is bundled with
 Spark. For matrix operations, it employs Netlib-java that has a Java
 wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
 binaries if they are available on the worker node. It also has its own
 optimized Java implementation of BLAS. It is worth mentioning, that native
 binaries provide better performance only for BLAS level 3, i.e.
 matrix-matrix operations or general matrix multiplication (GEMM). This is
 confirmed by GEMM test on Netlib-java page
 https://github.com/fommil/netlib-java. I also confirmed it with my
 experiments with training of artificial neural network
 https://github.com/apache/spark/pull/1290#issuecomment-70313952. However,
 I would like to boost performance more.

 GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
 implementation of BLAS, called cublas. I have one Linux server with Nvidia
 GPU and I was able to do the following. I linked cublas (instead of
 cpu-based blas) with Netlib-java wrapper and put it into Spark, so
 Breeze/Netlib is using it. Then I did some performance measurements with
 regards to artificial neural network batch learning in Spark MLlib that
 involves matrix-matrix multiplications. It turns out that for matrices of
 size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
 becomes slower for bigger matrices. It worth mentioning that it is was not
 a test for ONLY multiplication since there are other operations involved.
 One of the reasons for slowdown might be the overhead of copying the
 matrices from computer memory to graphic card memory and back.

 So, few questions:
 1) Do these results with CUDA make sense?
 2) If the problem is with copy overhead, are there any libraries that
 allow to force intermediate results to stay in graphic card

RE: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Ulanov, Alexander

Thank you for explanation! I’ve watched the BIDMach presentation by John Canny 
and I am really inspired by his talk and comparisons with Spark MLlib.

I am very interested to find out what will be better within Spark: BIDMat or 
netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark 
them? Currently I do benchmarks on artificial neural networks in batch mode. 
While it is not a “pure” test of linear algebra, it involves some other things 
that are essential to machine learning.

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd be surprised of BIDMat+OpenBLAS was significantly faster than 
netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout 
and fewer levels of indirection - it's definitely a worthwhile experiment to 
run. The main speedups I've seen from using it come from highly optimized GPU 
code for linear algebra. I know that in the past Canny has gone as far as to 
write custom GPU kernels for performance-critical regions of code.[1]

BIDMach is highly optimized for single node performance or performance on small 
clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in 
that way) the performance tends to fall off. Canny argues for hardware/software 
codesign and as such prefers machine configurations that are quite different 
than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 
GPUs.

In contrast, MLlib was designed for horizontal scalability on commodity 
clusters and works best on very big datasets - order of terabytes.

For the most part, these projects developed concurrently to address slightly 
different use cases. That said, there may be bits of BIDMach we could repurpose 
for MLlib - keep in mind we need to be careful about maintaining cross-language 
compatibility for our Java and Python-users, though.

- Evan

[1] - http://arxiv.org/abs/1409.5402
[2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan,

Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what 
makes them faster than netlib-java?

The same group has BIDMach library that implements machine learning. For some 
examples they use Caffe convolutional neural network library owned by another 
group in Berkeley. Could you elaborate on how these all might be connected with 
Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach 
for optimization and learning?

Best regards, Alexander

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many 
cases.

You might consider taking a look at the codepaths that BIDMat 
(https://github.com/BIDData/BIDMat) takes and comparing them to 
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to 
make this work really fast from Scala. I've run it on my laptop and compared to 
MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of 
layers of indirection here and you really want to avoid data copying as much as 
possible.

We could also consider swapping out BIDMat for Breeze, but that would be a big 
project and if we can figure out how to get breeze+cublas to comparable 
performance that would be a big win.

On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Dear Spark developers,

I am exploring how to make linear algebra operations faster within Spark. One 
way of doing this is to use Scala Breeze library that is bundled with Spark. 
For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS 
(basic linear algebra subprograms) and LAPACK native binaries if they are 
available on the worker node. It also has its own optimized Java implementation 
of BLAS. It is worth mentioning, that native binaries provide better 
performance only for BLAS level 3, i.e. matrix-matrix operations or general 
matrix multiplication (GEMM). This is confirmed by GEMM test on Netlib-java 
page https://github.com/fommil/netlib-java. I also confirmed it with my 
experiments with training of artificial neural network 
https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I 
would like to boost performance more.

GPU is supposed to work fast with linear algebra and there is Nvidia CUDA 
implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU 
and I was able to do the following. I linked cublas (instead of cpu-based blas) 
with Netlib-java wrapper and put

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Joseph Bradley

Hi Alexander,

Using GPUs with Spark would be very exciting.  Small comment: Concerning
your question earlier about keeping data stored on the GPU rather than
having to move it between main memory and GPU memory on each iteration, I
would guess this would be critical to getting good performance.  If you
could do multiple local iterations before aggregating results, then the
cost of data movement to the GPU could be amortized (and I believe that is
done in practice).  Having Spark be aware of the GPU and using it as
another part of memory sounds like a much bigger undertaking.

Joseph

On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Thank you for explanation! I’ve watched the BIDMach presentation by John
 Canny and I am really inspired by his talk and comparisons with Spark MLlib.

 I am very interested to find out what will be better within Spark: BIDMat
 or netlib-java with CPU or GPU natives. Could you suggest a fair way to
 benchmark them? Currently I do benchmarks on artificial neural networks in
 batch mode. While it is not a “pure” test of linear algebra, it involves
 some other things that are essential to machine learning.

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 1:29 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I'd be surprised of BIDMat+OpenBLAS was significantly faster than
 netlib-java+OpenBLAS, but if it is much faster it's probably due to data
 layout and fewer levels of indirection - it's definitely a worthwhile
 experiment to run. The main speedups I've seen from using it come from
 highly optimized GPU code for linear algebra. I know that in the past Canny
 has gone as far as to write custom GPU kernels for performance-critical
 regions of code.[1]

 BIDMach is highly optimized for single node performance or performance on
 small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
 batched in that way) the performance tends to fall off. Canny argues for
 hardware/software codesign and as such prefers machine configurations that
 are quite different than what we find in most commodity cluster nodes -
 e.g. 10 disk cahnnels and 4 GPUs.

 In contrast, MLlib was designed for horizontal scalability on commodity
 clusters and works best on very big datasets - order of terabytes.

 For the most part, these projects developed concurrently to address
 slightly different use cases. That said, there may be bits of BIDMach we
 could repurpose for MLlib - keep in mind we need to be careful about
 maintaining cross-language compatibility for our Java and Python-users,
 though.

 - Evan

 [1] - http://arxiv.org/abs/1409.5402
 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

 On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com
 mailto:alexander.ula...@hp.com wrote:
 Hi Evan,

 Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
 what makes them faster than netlib-java?

 The same group has BIDMach library that implements machine learning. For
 some examples they use Caffe convolutional neural network library owned by
 another group in Berkeley. Could you elaborate on how these all might be
 connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
 you take BIDMach for optimization and learning?

 Best regards, Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
 evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 12:09 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
 many cases.

 You might consider taking a look at the codepaths that BIDMat (
 https://github.com/BIDData/BIDMat) takes and comparing them to
 netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
 to make this work really fast from Scala. I've run it on my laptop and
 compared to MKL and in certain cases it's 10x faster at matrix multiply.
 There are a lot of layers of indirection here and you really want to avoid
 data copying as much as possible.

 We could also consider swapping out BIDMat for Breeze, but that would be a
 big project and if we can figure out how to get breeze+cublas to comparable
 performance that would be a big win.

 On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 Dear Spark developers,

 I am exploring how to make linear algebra operations faster within Spark.
 One way of doing this is to use Scala Breeze library that is bundled with
 Spark. For matrix operations, it employs Netlib-java that has a Java
 wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
 binaries if they are available on the worker node. It also has its own
 optimized Java implementation of BLAS. It is worth mentioning

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Evan R. Sparks

I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
many cases.

You might consider taking a look at the codepaths that BIDMat (
https://github.com/BIDData/BIDMat) takes and comparing them to
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
to make this work really fast from Scala. I've run it on my laptop and
compared to MKL and in certain cases it's 10x faster at matrix multiply.
There are a lot of layers of indirection here and you really want to avoid
data copying as much as possible.

We could also consider swapping out BIDMat for Breeze, but that would be a
big project and if we can figure out how to get breeze+cublas to comparable
performance that would be a big win.

On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Dear Spark developers,

 I am exploring how to make linear algebra operations faster within Spark.
 One way of doing this is to use Scala Breeze library that is bundled with
 Spark. For matrix operations, it employs Netlib-java that has a Java
 wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
 binaries if they are available on the worker node. It also has its own
 optimized Java implementation of BLAS. It is worth mentioning, that native
 binaries provide better performance only for BLAS level 3, i.e.
 matrix-matrix operations or general matrix multiplication (GEMM). This is
 confirmed by GEMM test on Netlib-java page
 https://github.com/fommil/netlib-java. I also confirmed it with my
 experiments with training of artificial neural network
 https://github.com/apache/spark/pull/1290#issuecomment-70313952. However,
 I would like to boost performance more.

 GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
 implementation of BLAS, called cublas. I have one Linux server with Nvidia
 GPU and I was able to do the following. I linked cublas (instead of
 cpu-based blas) with Netlib-java wrapper and put it into Spark, so
 Breeze/Netlib is using it. Then I did some performance measurements with
 regards to artificial neural network batch learning in Spark MLlib that
 involves matrix-matrix multiplications. It turns out that for matrices of
 size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
 becomes slower for bigger matrices. It worth mentioning that it is was not
 a test for ONLY multiplication since there are other operations involved.
 One of the reasons for slowdown might be the overhead of copying the
 matrices from computer memory to graphic card memory and back.

 So, few questions:
 1) Do these results with CUDA make sense?
 2) If the problem is with copy overhead, are there any libraries that
 allow to force intermediate results to stay in graphic card memory thus
 removing the overhead?
 3) Any other options to speed-up linear algebra in Spark?

 Thank you, Alexander

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Ulanov, Alexander

Dear Spark developers,

I am exploring how to make linear algebra operations faster within Spark. One 
way of doing this is to use Scala Breeze library that is bundled with Spark. 
For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS 
(basic linear algebra subprograms) and LAPACK native binaries if they are 
available on the worker node. It also has its own optimized Java implementation 
of BLAS. It is worth mentioning, that native binaries provide better 
performance only for BLAS level 3, i.e. matrix-matrix operations or general 
matrix multiplication (GEMM). This is confirmed by GEMM test on Netlib-java 
page https://github.com/fommil/netlib-java. I also confirmed it with my 
experiments with training of artificial neural network 
https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I 
would like to boost performance more.

GPU is supposed to work fast with linear algebra and there is Nvidia CUDA 
implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU 
and I was able to do the following. I linked cublas (instead of cpu-based blas) 
with Netlib-java wrapper and put it into Spark, so Breeze/Netlib is using it. 
Then I did some performance measurements with regards to artificial neural 
network batch learning in Spark MLlib that involves matrix-matrix 
multiplications. It turns out that for matrices of size less than ~1000x780 GPU 
cublas has the same speed as CPU blas. Cublas becomes slower for bigger 
matrices. It worth mentioning that it is was not a test for ONLY multiplication 
since there are other operations involved. One of the reasons for slowdown 
might be the overhead of copying the matrices from computer memory to graphic 
card memory and back. 

So, few questions:
1) Do these results with CUDA make sense? 
2) If the problem is with copy overhead, are there any libraries that allow to 
force intermediate results to stay in graphic card memory thus removing the 
overhead?
3) Any other options to speed-up linear algebra in Spark?

Thank you, Alexander

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

58 matches

Mail list logo