RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander Wed, 25 Mar 2015 15:30:05 -0700

Sure, I will write a how-to after I re-check the results.

-----Original Message-----
From: Sam Halliday [mailto:[email protected]] 
Sent: Wednesday, March 25, 2015 3:04 PM
To: Evan R. Sparks; [email protected]
Subject: Re: Using CUDA within Spark / boosting linear algebra


If you write it up I'll add it to the netlib-java wiki :-)

BTW, does it automatically flip between cpu/GPU? I've a project called 
MultiBLAS which was going to do this, it should be easy (but boring to
write)
On 25 Mar 2015 22:00, "Evan R. Sparks" <[email protected]> wrote:

> Alex - great stuff, and the nvblas numbers are pretty remarkable 
> (almost too good... did you check the results for correctness? - also, 
> is it possible that the "unified memory model" of nvblas is somehow 
> hiding pci transfer time?)
>
> this last bit (getting nvblas + netlib-java to play together) sounds 
> like it's non-trivial and took you a while to figure out! Would you 
> mind posting a gist or something of maybe the shell scripts/exports 
> you used to make this work - I can imagine it being highly useful for others 
> in the future.
>
> Thanks!
> Evan
>
> On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander < 
> [email protected]> wrote:
>
>> Hi again,
>>
>> I finally managed to use nvblas within Spark+netlib-java. It has 
>> exceptional performance for big matrices with Double, faster than 
>> BIDMat-cuda with Float. But for smaller matrices, if you will copy 
>> them to/from GPU, OpenBlas or MKL might be a better choice. This 
>> correlates with original nvblas presentation on GPU conf 2013 (slide 21):
>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3
>> 108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>
>> My results:
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx37
>> 8T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Just in case, these tests are not for generalization of performance 
>> of different libraries. I just want to pick a library that does at 
>> best dense matrices multiplication for my task.
>>
>> P.S. My previous issue with nvblas was the following: it has Fortran 
>> blas functions, at the same time netlib-java uses C cblas functions. 
>> So, one needs cblas shared library to use nvblas through netlib-java. 
>> Fedora does not have cblas (but Debian and Ubuntu have), so I needed 
>> to compile it. I could not use cblas from Atlas or Openblas because 
>> they link to their implementation and not to Fortran blas.
>>
>> Best regards, Alexander
>>
>> -----Original Message-----
>> From: Ulanov, Alexander
>> Sent: Tuesday, March 24, 2015 6:57 PM
>> To: Sam Halliday
>> Cc: [email protected]; Xiangrui Meng; Joseph Bradley; Evan R. 
>> Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Hi,
>>
>> I am trying to use nvblas with netlib-java from Spark. nvblas 
>> functions should replace current blas functions calls after executing 
>> LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage 
>> without any changes to netlib-java. It seems to work for simple Java 
>> example, but I cannot make it work with Spark. I run the following:
>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell 
>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>>
>> +-----------------------------------------------------------------------------+
>> | Processes:                                                       GPU
>> Memory |
>> |  GPU       PID  Type  Process name                               Usage
>>     |
>>
>> |=============================================================================|
>> |    0      8873    C   bash
>> 39MiB |
>> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
>> 39MiB |
>>
>> +-----------------------------------------------------------------------------+
>>
>> In Spark shell I do matrix multiplication and see the following:
>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded 
>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
>> So I am sure that netlib-native is loaded and cblas supposedly used.
>> However, matrix multiplication does executes on CPU since I see 16% 
>> of CPU used and 0% of GPU used. I also checked different matrix 
>> sizes, from
>> 100x100 to 12000x12000
>>
>> Could you suggest might the LD_PRELOAD not affect Spark shell?
>>
>> Best regards, Alexander
>>
>>
>>
>> From: Sam Halliday [mailto:[email protected]]
>> Sent: Monday, March 09, 2015 6:01 PM
>> To: Ulanov, Alexander
>> Cc: [email protected]; Xiangrui Meng; Joseph Bradley; Evan R. 
>> Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>>
>> Thanks so much for following up on this!
>>
>> Hmm, I wonder if we should have a concerted effort to chart 
>> performance on various pieces of hardware...
>> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[email protected]<mailto:
>> [email protected]>> wrote:
>> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added 
>> the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I 
>> see the support of Double in the current source code), did the test 
>> with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib 
>> MKL.
>>
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx37
>> 8T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Best regards, Alexander
>>
>> -----Original Message-----
>> From: Sam Halliday [mailto:[email protected]<mailto:
>> [email protected]>]
>> Sent: Tuesday, March 03, 2015 1:54 PM
>> To: Xiangrui Meng; Joseph Bradley
>> Cc: Evan R. Sparks; Ulanov, Alexander; [email protected]<mailto:
>> [email protected]>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> BTW, is anybody on this list going to the London Meetup in a few weeks?
>>
>>
>> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-ma
>> preduce-world#community
>>
>> Would be nice to meet other people working on the guts of Spark! :-)
>>
>>
>> Xiangrui Meng <[email protected]<mailto:[email protected]>> writes:
>>
>> > Hey Alexander,
>> >
>> > I don't quite understand the part where netlib-cublas is about 20x 
>> > slower than netlib-openblas. What is the overhead of using a GPU 
>> > BLAS with netlib-java?
>> >
>> > CC'ed Sam, the author of netlib-java.
>> >
>> > Best,
>> > Xiangrui
>> >
>> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
>> > <[email protected]
>> <mailto:[email protected]>> wrote:
>> >> Better documentation for linking would be very helpful!  Here's a JIRA:
>> >> https://issues.apache.org/jira/browse/SPARK-6019
>> >>
>> >>
>> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
>> >> <[email protected]<mailto:[email protected]>>
>> >> wrote:
>> >>
>> >>> Thanks for compiling all the data and running these benchmarks, 
>> >>> Alex. The big takeaways here can be seen with this chart:
>> >>>
>> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF5
>> >>> 0uZ Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>> >>>
>> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> >>> BIDMat+GPU) can provide substantial (but less than an order of
>> >>> BIDMat+magnitude)
>> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> >>> netlib-java+openblas-compiled).
>> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of 
>> >>> magnitude worse than a well-tuned CPU implementation, 
>> >>> particularly for larger
>> matrices.
>> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - 
>> >>> this basically agrees with the authors own benchmarks (
>> >>> https://github.com/fommil/netlib-java)
>> >>>
>> >>> I think that most of our users are in a situation where using 
>> >>> GPUs may not be practical - although we could consider having a 
>> >>> good GPU backend available as an option. However, *ALL* users of 
>> >>> MLlib could benefit (potentially tremendously) from using a 
>> >>> well-tuned CPU-based BLAS implementation. Perhaps we should 
>> >>> consider updating the mllib guide with a more complete section 
>> >>> for enabling high performance binaries on OSX and Linux? Or 
>> >>> better, figure out a way for the system to fetch these automatically.
>> >>>
>> >>> - Evan
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander < 
>> >>> [email protected]<mailto:[email protected]>> wrote:
>> >>>
>> >>>> Just to summarize this thread, I was finally able to make all 
>> >>>> performance comparisons that we discussed. It turns out that:
>> >>>> BIDMat-cublas>>BIDMat
>> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-re
>> >>>> po= =netlib-cublas>netlib-blas>f2jblas
>> >>>>
>> >>>> Below is the link to the spreadsheet with full results.
>> >>>>
>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgH
>> >>>> UMx 378T9J5r7kwKSPkY/edit?usp=sharing
>> >>>>
>> >>>> One thing still needs exploration: does BIDMat-cublas perform 
>> >>>> copying to/from machine’s RAM?
>> >>>>
>> >>>> -----Original Message-----
>> >>>> From: Ulanov, Alexander
>> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
>> >>>> To: Evan R. Sparks
>> >>>> Cc: Joseph Bradley;
>> >>>> [email protected]<mailto:[email protected]>
>> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate 
>> >>>> though the original one discusses slightly different topic. I 
>> >>>> was able to link netlib with MKL from BIDMat binaries. Indeed, 
>> >>>> MKL is statically linked inside a 60MB library.
>> >>>>
>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>> >>>>
>> +-----------------------------------------------------------------------+
>> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 
>> >>>> ||
>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>> >>>> |1,638475459 |
>> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 
>> >>>> ||445,0935211 |
>> >>>> 1569,233228 |
>> >>>>
>> >>>> It turn out that pre-compiled MKL is faster than precompiled 
>> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with 
>> >>>> locally compiled openblas and cuda.
>> >>>>
>> >>>> Alexander
>> >>>>
>> >>>> From: Evan R. Sparks
>> >>>> [mailto:[email protected]<mailto:[email protected]>]
>> >>>> Sent: Monday, February 09, 2015 6:06 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [email protected]<mailto:[email protected]>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Great - perhaps we can move this discussion off-list and onto a 
>> >>>> JIRA ticket? (Here's one:
>> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
>> >>>>
>> >>>> It seems like this is going to be somewhat exploratory for a 
>> >>>> while (and there's probably only a handful of us who really care 
>> >>>> about fast linear
>> >>>> algebra!)
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>>> wrote:
>> >>>> Hi Evan,
>> >>>>
>> >>>> Thank you for explanation and useful link. I am going to build 
>> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>> >>>>
>> >>>> Do I understand correctly that BIDMat binaries contain 
>> >>>> statically linked Intel MKL BLAS? It might be the reason why I 
>> >>>> am able to run BIDMat not having MKL BLAS installed on my 
>> >>>> server. If it is true, I wonder if it is OK because Intel sells 
>> >>>> this library. Nevertheless, it seems that in my case precompiled 
>> >>>> MKL BLAS performs better than precompiled OpenBLAS given that 
>> >>>> BIDMat and Netlib-java are supposed
>> to be on par with JNI overheads.
>> >>>>
>> >>>> Though, it might be interesting to link Netlib-java with Intel 
>> >>>> MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam 
>> >>>> Halliday
>> >>>> (Netlib-java) interested to compare their libraries.
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[email protected]<mailto:
>> [email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>>]
>> >>>> Sent: Friday, February 06, 2015 5:58 PM
>> >>>>
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [email protected]<mailto:[email protected]><mailto:dev@spark.
>> >>>> apache.org<mailto:[email protected]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I would build OpenBLAS yourself, since good BLAS performance 
>> >>>> comes from getting cache sizes, etc. set up correctly for your 
>> >>>> particular hardware - this is often a very tricky process (see, 
>> >>>> e.g. ATLAS), but we found that on relatively modern Xeon chips, 
>> >>>> OpenBLAS builds quickly and yields performance competitive with MKL.
>> >>>>
>> >>>> To make sure the right library is getting used, you have to make 
>> >>>> sure it's first on the search path - export 
>> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>> >>>>
>> >>>> For some examples of getting netlib-java setup on an ec2 node 
>> >>>> and some example benchmarking code we ran a while back, see:
>> >>>> https://github.com/shivaram/matrix-bench
>> >>>>
>> >>>> In particular - build-openblas-ec2.sh shows you how to build the 
>> >>>> library and set up symlinks correctly, and scala/run-netlib.sh 
>> >>>> shows you how to get the path setup and get that library picked 
>> >>>> up
>> by netlib-java.
>> >>>>
>> >>>> In this way - you could probably get cuBLAS set up to be used by 
>> >>>> netlib-java as well.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>>> wrote:
>> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java 
>> >>>> to force loading the right blas? For netlib, I there are few JVM 
>> >>>> flags, such as 
>> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS
>> >>>> , so I can force it to use Java implementation. Not sure I 
>> >>>> understand
>> how to force use a specific blas (not specific wrapper for blas).
>> >>>>
>> >>>> Btw. I have installed openblas (yum install openblas), so I 
>> >>>> suppose that netlib is using it.
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[email protected]<mailto:
>> [email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>>]
>> >>>> Sent: Friday, February 06, 2015 5:19 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [email protected]<mailto:[email protected]><mailto:dev@spark.
>> >>>> apache.org<mailto:[email protected]>>
>> >>>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Getting breeze to pick up the right blas library is critical for 
>> >>>> performance. I recommend using OpenBLAS (or MKL, if you already 
>> >>>> have
>> it).
>> >>>> It might make sense to force BIDMat to use the same underlying 
>> >>>> BLAS library as well.
>> >>>>
>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>>> wrote:
>> >>>> Hi Evan, Joseph
>> >>>>
>> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x 
>> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
>> >>>>
>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java 
>> >>>> |native_system_linux_x86-64|
>> >>>> Breeze+Netlib-java f2jblas |
>> >>>>
>> +-----------------------------------------------------------------------+
>> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 
>> >>>> |1569,233228
>> >>>> ||
>> >>>>
>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, 
>> >>>> Fedora
>> >>>> 19 Linux, Scala 2.11.
>> >>>>
>> >>>> Later I will make tests with Cuda. I need to install new Cuda 
>> >>>> version for this purpose.
>> >>>>
>> >>>> Do you have any ideas why breeze-netlib with native blas is so 
>> >>>> much slower than BIDMat MKL?
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Joseph Bradley [mailto:[email protected]<mailto:
>> [email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>>]
>> >>>> Sent: Thursday, February 05, 2015 5:29 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Evan R. Sparks;
>> >>>> [email protected]<mailto:[email protected]><mailto:dev@spark.
>> >>>> apache.org<mailto:[email protected]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Hi Alexander,
>> >>>>
>> >>>> Using GPUs with Spark would be very exciting.  Small comment:
>> >>>> Concerning your question earlier about keeping data stored on 
>> >>>> the GPU rather than having to move it between main memory and 
>> >>>> GPU memory on each iteration, I would guess this would be 
>> >>>> critical to getting good performance.  If you could do multiple 
>> >>>> local iterations before aggregating results, then the cost of 
>> >>>> data movement to the GPU could be amortized (and I believe that 
>> >>>> is done in practice).  Having Spark be aware of the GPU and 
>> >>>> using it as
>> another part of memory sounds like a much bigger undertaking.
>> >>>>
>> >>>> Joseph
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>>> wrote:
>> >>>> Thank you for explanation! I’ve watched the BIDMach presentation 
>> >>>> by John Canny and I am really inspired by his talk and 
>> >>>> comparisons with
>> Spark MLlib.
>> >>>>
>> >>>> I am very interested to find out what will be better within Spark:
>> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest 
>> >>>> a fair way to benchmark them? Currently I do benchmarks on 
>> >>>> artificial neural networks in batch mode. While it is not a 
>> >>>> “pure” test of linear algebra, it involves some other things 
>> >>>> that are essential to
>> machine learning.
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[email protected]<mailto:
>> [email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>>]
>> >>>> Sent: Thursday, February 05, 2015 1:29 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc:
>> >>>> [email protected]<mailto:[email protected]><mailto:dev@spark.
>> >>>> apache.org<mailto:[email protected]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster 
>> >>>> than
>> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due 
>> >>>> netlib-java+to data
>> >>>> layout and fewer levels of indirection - it's definitely a 
>> >>>> worthwhile experiment to run. The main speedups I've seen from 
>> >>>> using it come from highly optimized GPU code for linear algebra. 
>> >>>> I know that in the past Canny has gone as far as to write custom 
>> >>>> GPU kernels for performance-critical regions of code.[1]
>> >>>>
>> >>>> BIDMach is highly optimized for single node performance or 
>> >>>> performance on small clusters.[2] Once data doesn't fit easily 
>> >>>> in GPU memory (or can be batched in that way) the performance 
>> >>>> tends to fall off. Canny argues for hardware/software codesign 
>> >>>> and as such prefers machine configurations that are quite 
>> >>>> different than what we find in most commodity cluster nodes - 
>> >>>> e.g. 10 disk cahnnels and
>> 4 GPUs.
>> >>>>
>> >>>> In contrast, MLlib was designed for horizontal scalability on 
>> >>>> commodity clusters and works best on very big datasets - order 
>> >>>> of
>> terabytes.
>> >>>>
>> >>>> For the most part, these projects developed concurrently to 
>> >>>> address slightly different use cases. That said, there may be 
>> >>>> bits of BIDMach we could repurpose for MLlib - keep in mind we 
>> >>>> need to be careful about maintaining cross-language 
>> >>>> compatibility for our Java and Python-users, though.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] - 
>> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>><mailto:
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>>>> wrote:
>> >>>> Hi Evan,
>> >>>>
>> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. 
>> >>>> Do you know what makes them faster than netlib-java?
>> >>>>
>> >>>> The same group has BIDMach library that implements machine 
>> >>>> learning. For some examples they use Caffe convolutional neural 
>> >>>> network library owned by another group in Berkeley. Could you 
>> >>>> elaborate on how these all might be connected with Spark Mllib? 
>> >>>> If you take BIDMat for linear algebra why don’t you take BIDMach 
>> >>>> for
>> optimization and learning?
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[email protected]<mailto:
>> [email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>><mailto:
>> [email protected]<mailto:[email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>>>]
>> >>>> Sent: Thursday, February 05, 2015 12:09 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>><mailto:
>> >>>> [email protected]<mailto:[email protected]><mailto:dev@spark.
>> >>>> apache.org<mailto:[email protected]>>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU 
>> >>>> blas in many cases.
>> >>>>
>> >>>> You might consider taking a look at the codepaths that BIDMat (
>> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to 
>> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work 
>> >>>> optimizing to make this work really fast from Scala. I've run it 
>> >>>> on my laptop and compared to MKL and in certain cases it's 10x 
>> >>>> faster
>> at matrix multiply.
>> >>>> There are a lot of layers of indirection here and you really 
>> >>>> want to avoid data copying as much as possible.
>> >>>>
>> >>>> We could also consider swapping out BIDMat for Breeze, but that 
>> >>>> would be a big project and if we can figure out how to get
>> >>>> breeze+cublas to comparable performance that would be a big win.
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>><mailto:
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>>>> wrote:
>> >>>> Dear Spark developers,
>> >>>>
>> >>>> I am exploring how to make linear algebra operations faster 
>> >>>> within
>> Spark.
>> >>>> One way of doing this is to use Scala Breeze library that is 
>> >>>> bundled with Spark. For matrix operations, it employs 
>> >>>> Netlib-java that has a Java wrapper for BLAS (basic linear 
>> >>>> algebra subprograms) and LAPACK native binaries if they are 
>> >>>> available on the worker node. It also has its own optimized Java 
>> >>>> implementation of BLAS. It is worth mentioning, that native 
>> >>>> binaries provide better performance
>> only for BLAS level 3, i.e.
>> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
>> >>>> This is confirmed by GEMM test on Netlib-java page 
>> >>>> https://github.com/fommil/netlib-java. I also confirmed it with 
>> >>>> my experiments with training of artificial neural network 
>> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>> >>>> However, I would like to boost performance more.
>> >>>>
>> >>>> GPU is supposed to work fast with linear algebra and there is 
>> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one 
>> >>>> Linux server with Nvidia GPU and I was able to do the following. 
>> >>>> I linked cublas (instead of cpu-based blas) with Netlib-java 
>> >>>> wrapper and put it into Spark, so Breeze/Netlib is using it. 
>> >>>> Then I did some performance measurements with regards to 
>> >>>> artificial neural network batch learning in Spark MLlib that 
>> >>>> involves matrix-matrix multiplications. It turns out that for 
>> >>>> matrices of size less than
>> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas 
>> >>>> becomes slower for bigger matrices. It worth mentioning that it 
>> >>>> is was not a
>> test for ONLY multiplication since there are other operations involved.
>> >>>> One of the reasons for slowdown might be the overhead of copying 
>> >>>> the matrices from computer memory to graphic card memory and back.
>> >>>>
>> >>>> So, few questions:
>> >>>> 1) Do these results with CUDA make sense?
>> >>>> 2) If the problem is with copy overhead, are there any libraries 
>> >>>> that allow to force intermediate results to stay in graphic card 
>> >>>> memory thus removing the overhead?
>> >>>> 3) Any other options to speed-up linear algebra in Spark?
>> >>>>
>> >>>> Thank you, Alexander
>> >>>>
>> >>>> ----------------------------------------------------------------
>> >>>> ---
>> >>>> -- To unsubscribe, e-mail: [email protected]<mailto:
>> [email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]
>> >>>> ach 
>> >>>> e.org>><mailto:[email protected]<mailto:dev-unsubscribe
>> >>>> @sp
>> >>>> ark.apac> he.org<http://he.org>
>> >>>> <mailto:[email protected]<mailto:dev-unsubscribe@
>> >>>> spa rk.apache.org>>> For additional commands, e-mail:
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>><mailto:
>> [email protected]<mailto:[email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>>
>> --
>> Best regards,
>> Sam
>>
>
>

RE: Using CUDA within Spark / boosting linear algebra

Reply via email to