Re: Using CUDA within Spark / boosting linear algebra

Sam Halliday Wed, 25 Mar 2015 15:23:33 -0700

That would be a difficult task that would only benefit users of
netlib-java. MultiBLAS is easily implemented (although a lot of
boilerplate) and benefits all BLAS users on the system.


If anyone knows of a funding route for it, I'd love to hear from them,
because it's too much work for me to take on at the moment as hobby.
On 25 Mar 2015 22:16, "Dmitriy Lyubimov" <dlie...@gmail.com> wrote:

> Sam,
>
> whould it be easier to hack netlib-java to allow multiple (configurable)
>  library contexts? And so enable 3rd party configurations and optimizers to
> make their own choices until then?
>
> On Wed, Mar 25, 2015 at 3:07 PM, Sam Halliday <sam.halli...@gmail.com>
> wrote:
>
>> Yeah, MultiBLAS... it is dynamic.
>>
>> Except, I haven't written it yet :-P
>> On 25 Mar 2015 22:06, "Ulanov, Alexander" <alexander.ula...@hp.com>
>> wrote:
>>
>>>  Netlib knows nothing about GPU (or CPU), it just uses cblas symbols
>>> from the provided libblas.so.3 library at the runtime. So, you can switch
>>> at the runtime by providing another library. Sam, please suggest if there
>>> is another way.
>>>
>>>
>>>
>>> *From:* Dmitriy Lyubimov [mailto:dlie...@gmail.com]
>>> *Sent:* Wednesday, March 25, 2015 2:55 PM
>>> *To:* Ulanov, Alexander
>>> *Cc:* Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph
>>> Bradley; Evan R. Sparks; jfcanny
>>> *Subject:* Re: Using CUDA within Spark / boosting linear algebra
>>>
>>>
>>>
>>> Alexander,
>>>
>>>
>>>
>>> does using netlib imply that one cannot switch between CPU and GPU blas
>>> alternatives at will at the same time? the choice is always determined by
>>> linking aliternatives to libblas.so, right?
>>>
>>>
>>>
>>> On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <
>>> alexander.ula...@hp.com> wrote:
>>>
>>> Hi again,
>>>
>>> I finally managed to use nvblas within Spark+netlib-java. It has
>>> exceptional performance for big matrices with Double, faster than
>>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>>> original nvblas presentation on GPU conf 2013 (slide 21):
>>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>>
>>> My results:
>>>
>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>
>>> Just in case, these tests are not for generalization of performance of
>>> different libraries. I just want to pick a library that does at best dense
>>> matrices multiplication for my task.
>>>
>>> P.S. My previous issue with nvblas was the following: it has Fortran
>>> blas functions, at the same time netlib-java uses C cblas functions. So,
>>> one needs cblas shared library to use nvblas through netlib-java. Fedora
>>> does not have cblas (but Debian and Ubuntu have), so I needed to compile
>>> it. I could not use cblas from Atlas or Openblas because they link to their
>>> implementation and not to Fortran blas.
>>>
>>> Best regards, Alexander
>>>
>>> -----Original Message-----
>>> From: Ulanov, Alexander
>>>
>>> Sent: Tuesday, March 24, 2015 6:57 PM
>>> To: Sam Halliday
>>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>
>>> Hi,
>>>
>>> I am trying to use nvblas with netlib-java from Spark. nvblas functions
>>> should replace current blas functions calls after executing LD_PRELOAD as
>>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
>>> changes to netlib-java. It seems to work for simple Java example, but I
>>> cannot make it work with Spark. I run the following:
>>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
>>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>>>
>>> +-----------------------------------------------------------------------------+
>>> | Processes:                                                       GPU
>>> Memory |
>>> |  GPU       PID  Type  Process name
>>>  Usage      |
>>>
>>> |=============================================================================|
>>> |    0      8873    C   bash
>>> 39MiB |
>>> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
>>> 39MiB |
>>>
>>> +-----------------------------------------------------------------------------+
>>>
>>> In Spark shell I do matrix multiplication and see the following:
>>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
>>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
>>> So I am sure that netlib-native is loaded and cblas supposedly used.
>>> However, matrix multiplication does executes on CPU since I see 16% of CPU
>>> used and 0% of GPU used. I also checked different matrix sizes, from
>>> 100x100 to 12000x12000
>>>
>>> Could you suggest might the LD_PRELOAD not affect Spark shell?
>>>
>>> Best regards, Alexander
>>>
>>>
>>>
>>> From: Sam Halliday [mailto:sam.halli...@gmail.com]
>>> Sent: Monday, March 09, 2015 6:01 PM
>>> To: Ulanov, Alexander
>>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>
>>>
>>> Thanks so much for following up on this!
>>>
>>> Hmm, I wonder if we should have a concerted effort to chart performance
>>> on various pieces of hardware...
>>> On 9 Mar 2015 21:08, "Ulanov, Alexander" <alexander.ula...@hp.com
>>> <mailto:alexander.ula...@hp.com>> wrote:
>>> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
>>> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
>>> support of Double in the current source code), did the test with BIDMat and
>>> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>>>
>>>
>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>
>>> Best regards, Alexander
>>>
>>> -----Original Message-----
>>> From: Sam Halliday [mailto:sam.halli...@gmail.com<mailto:
>>> sam.halli...@gmail.com>]
>>> Sent: Tuesday, March 03, 2015 1:54 PM
>>> To: Xiangrui Meng; Joseph Bradley
>>> Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org<mailto:
>>> dev@spark.apache.org>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> BTW, is anybody on this list going to the London Meetup in a few weeks?
>>>
>>>
>>> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>>>
>>> Would be nice to meet other people working on the guts of Spark! :-)
>>>
>>>
>>> Xiangrui Meng <men...@gmail.com<mailto:men...@gmail.com>> writes:
>>>
>>> > Hey Alexander,
>>> >
>>> > I don't quite understand the part where netlib-cublas is about 20x
>>> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
>>> > with netlib-java?
>>> >
>>> > CC'ed Sam, the author of netlib-java.
>>> >
>>> > Best,
>>> > Xiangrui
>>> >
>>> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <jos...@databricks.com
>>> <mailto:jos...@databricks.com>> wrote:
>>> >> Better documentation for linking would be very helpful!  Here's a
>>> JIRA:
>>> >> https://issues.apache.org/jira/browse/SPARK-6019
>>> >>
>>> >>
>>> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>>> >> <evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>
>>> >> wrote:
>>> >>
>>> >>> Thanks for compiling all the data and running these benchmarks,
>>> >>> Alex. The big takeaways here can be seen with this chart:
>>> >>>
>>> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>> >>>
>>> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> >>> BIDMat+GPU) can provide substantial (but less than an order of
>>> >>> BIDMat+magnitude)
>>> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> >>> netlib-java+openblas-compiled).
>>> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>>> >>> worse than a well-tuned CPU implementation, particularly for larger
>>> matrices.
>>> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> >>> basically agrees with the authors own benchmarks (
>>> >>> https://github.com/fommil/netlib-java)
>>> >>>
>>> >>> I think that most of our users are in a situation where using GPUs
>>> >>> may not be practical - although we could consider having a good GPU
>>> >>> backend available as an option. However, *ALL* users of MLlib could
>>> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
>>> >>> BLAS implementation. Perhaps we should consider updating the mllib
>>> >>> guide with a more complete section for enabling high performance
>>> >>> binaries on OSX and Linux? Or better, figure out a way for the
>>> >>> system to fetch these automatically.
>>> >>>
>>> >>> - Evan
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
>>> >>>
>>> >>>> Just to summarize this thread, I was finally able to make all
>>> >>>> performance comparisons that we discussed. It turns out that:
>>> >>>> BIDMat-cublas>>BIDMat
>>> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>>> >>>> =netlib-cublas>netlib-blas>f2jblas
>>> >>>>
>>> >>>> Below is the link to the spreadsheet with full results.
>>> >>>>
>>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>> >>>>
>>> >>>> One thing still needs exploration: does BIDMat-cublas perform
>>> >>>> copying to/from machine’s RAM?
>>> >>>>
>>> >>>> -----Original Message-----
>>> >>>> From: Ulanov, Alexander
>>> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>> >>>> To: Evan R. Sparks
>>> >>>> Cc: Joseph Bradley;
>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
>>> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>> >>>> the original one discusses slightly different topic. I was able to
>>> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>> >>>> statically linked inside a 60MB library.
>>> >>>>
>>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>> >>>>
>>> +-----------------------------------------------------------------------+
>>> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>> >>>> |1,638475459 |
>>> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>> >>>> 1569,233228 |
>>> >>>>
>>> >>>> It turn out that pre-compiled MKL is faster than precompiled
>>> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>> >>>> locally compiled openblas and cuda.
>>> >>>>
>>> >>>> Alexander
>>> >>>>
>>> >>>> From: Evan R. Sparks
>>> >>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>]
>>> >>>> Sent: Monday, February 09, 2015 6:06 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: Joseph Bradley;
>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> Great - perhaps we can move this discussion off-list and onto a
>>> >>>> JIRA ticket? (Here's one:
>>> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>> >>>>
>>> >>>> It seems like this is going to be somewhat exploratory for a while
>>> >>>> (and there's probably only a handful of us who really care about
>>> >>>> fast linear
>>> >>>> algebra!)
>>> >>>>
>>> >>>> - Evan
>>> >>>>
>>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
>>> >>>> Hi Evan,
>>> >>>>
>>> >>>> Thank you for explanation and useful link. I am going to build
>>> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>>> >>>>
>>> >>>> Do I understand correctly that BIDMat binaries contain statically
>>> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>>> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>>> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>> >>>> it seems that in my case precompiled MKL BLAS performs better than
>>> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
>>> to be on par with JNI overheads.
>>> >>>>
>>> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>>> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>>> >>>> Halliday
>>> >>>> (Netlib-java) interested to compare their libraries.
>>> >>>>
>>> >>>> Best regards, Alexander
>>> >>>>
>>> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
>>> evan.spa...@gmail.com><mailto:
>>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
>>> >>>> Sent: Friday, February 06, 2015 5:58 PM
>>> >>>>
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: Joseph Bradley;
>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.
>>> >>>> apache.org<mailto:dev@spark.apache.org>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
>>> >>>> from getting cache sizes, etc. set up correctly for your particular
>>> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>>> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>>> >>>> quickly and yields performance competitive with MKL.
>>> >>>>
>>> >>>> To make sure the right library is getting used, you have to make
>>> >>>> sure it's first on the search path - export
>>> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>> >>>>
>>> >>>> For some examples of getting netlib-java setup on an ec2 node and
>>> >>>> some example benchmarking code we ran a while back, see:
>>> >>>> https://github.com/shivaram/matrix-bench
>>> >>>>
>>> >>>> In particular - build-openblas-ec2.sh shows you how to build the
>>> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
>>> >>>> shows you how to get the path setup and get that library picked up
>>> by netlib-java.
>>> >>>>
>>> >>>> In this way - you could probably get cuBLAS set up to be used by
>>> >>>> netlib-java as well.
>>> >>>>
>>> >>>> - Evan
>>> >>>>
>>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
>>> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>>> >>>> force loading the right blas? For netlib, I there are few JVM
>>> >>>> flags, such as
>>> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>> >>>> so I can force it to use Java implementation. Not sure I understand
>>> how to force use a specific blas (not specific wrapper for blas).
>>> >>>>
>>> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
>>> >>>> that netlib is using it.
>>> >>>>
>>> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
>>> evan.spa...@gmail.com><mailto:
>>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
>>> >>>> Sent: Friday, February 06, 2015 5:19 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: Joseph Bradley;
>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.
>>> >>>> apache.org<mailto:dev@spark.apache.org>>
>>> >>>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> Getting breeze to pick up the right blas library is critical for
>>> >>>> performance. I recommend using OpenBLAS (or MKL, if you already
>>> have it).
>>> >>>> It might make sense to force BIDMat to use the same underlying BLAS
>>> >>>> library as well.
>>> >>>>
>>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
>>> >>>> Hi Evan, Joseph
>>> >>>>
>>> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
>>> >>>>
>>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>> >>>> |native_system_linux_x86-64|
>>> >>>> Breeze+Netlib-java f2jblas |
>>> >>>>
>>> +-----------------------------------------------------------------------+
>>> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>>> >>>> ||
>>> >>>>
>>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>>> >>>> 19 Linux, Scala 2.11.
>>> >>>>
>>> >>>> Later I will make tests with Cuda. I need to install new Cuda
>>> >>>> version for this purpose.
>>> >>>>
>>> >>>> Do you have any ideas why breeze-netlib with native blas is so much
>>> >>>> slower than BIDMat MKL?
>>> >>>>
>>> >>>> Best regards, Alexander
>>> >>>>
>>> >>>> From: Joseph Bradley [mailto:jos...@databricks.com<mailto:
>>> jos...@databricks.com><mailto:
>>> >>>> jos...@databricks.com<mailto:jos...@databricks.com>>]
>>> >>>> Sent: Thursday, February 05, 2015 5:29 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: Evan R. Sparks;
>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.
>>> >>>> apache.org<mailto:dev@spark.apache.org>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> Hi Alexander,
>>> >>>>
>>> >>>> Using GPUs with Spark would be very exciting.  Small comment:
>>> >>>> Concerning your question earlier about keeping data stored on the
>>> >>>> GPU rather than having to move it between main memory and GPU
>>> >>>> memory on each iteration, I would guess this would be critical to
>>> >>>> getting good performance.  If you could do multiple local
>>> >>>> iterations before aggregating results, then the cost of data
>>> >>>> movement to the GPU could be amortized (and I believe that is done
>>> >>>> in practice).  Having Spark be aware of the GPU and using it as
>>> another part of memory sounds like a much bigger undertaking.
>>> >>>>
>>> >>>> Joseph
>>> >>>>
>>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
>>> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>>> >>>> John Canny and I am really inspired by his talk and comparisons
>>> with Spark MLlib.
>>> >>>>
>>> >>>> I am very interested to find out what will be better within Spark:
>>> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>>> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
>>> >>>> neural networks in batch mode. While it is not a “pure” test of
>>> >>>> linear algebra, it involves some other things that are essential to
>>> machine learning.
>>> >>>>
>>> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
>>> evan.spa...@gmail.com><mailto:
>>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
>>> >>>> Sent: Thursday, February 05, 2015 1:29 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc:
>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.
>>> >>>> apache.org<mailto:dev@spark.apache.org>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>>> >>>> netlib-java+data
>>> >>>> layout and fewer levels of indirection - it's definitely a
>>> >>>> worthwhile experiment to run. The main speedups I've seen from
>>> >>>> using it come from highly optimized GPU code for linear algebra. I
>>> >>>> know that in the past Canny has gone as far as to write custom GPU
>>> >>>> kernels for performance-critical regions of code.[1]
>>> >>>>
>>> >>>> BIDMach is highly optimized for single node performance or
>>> >>>> performance on small clusters.[2] Once data doesn't fit easily in
>>> >>>> GPU memory (or can be batched in that way) the performance tends to
>>> >>>> fall off. Canny argues for hardware/software codesign and as such
>>> >>>> prefers machine configurations that are quite different than what
>>> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and
>>> 4 GPUs.
>>> >>>>
>>> >>>> In contrast, MLlib was designed for horizontal scalability on
>>> >>>> commodity clusters and works best on very big datasets - order of
>>> terabytes.
>>> >>>>
>>> >>>> For the most part, these projects developed concurrently to address
>>> >>>> slightly different use cases. That said, there may be bits of
>>> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>>> >>>> careful about maintaining cross-language compatibility for our Java
>>> >>>> and Python-users, though.
>>> >>>>
>>> >>>> - Evan
>>> >>>>
>>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>> >>>>
>>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto:
>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote:
>>> >>>> Hi Evan,
>>> >>>>
>>> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>> >>>> you know what makes them faster than netlib-java?
>>> >>>>
>>> >>>> The same group has BIDMach library that implements machine
>>> >>>> learning. For some examples they use Caffe convolutional neural
>>> >>>> network library owned by another group in Berkeley. Could you
>>> >>>> elaborate on how these all might be connected with Spark Mllib? If
>>> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
>>> optimization and learning?
>>> >>>>
>>> >>>> Best regards, Alexander
>>> >>>>
>>> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
>>> evan.spa...@gmail.com><mailto:
>>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>><mailto:
>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
>>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>>]
>>> >>>> Sent: Thursday, February 05, 2015 12:09 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
>>> dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto:
>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.
>>> >>>> apache.org<mailto:dev@spark.apache.org>>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>>> >>>> blas in many cases.
>>> >>>>
>>> >>>> You might consider taking a look at the codepaths that BIDMat (
>>> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>>> >>>> optimizing to make this work really fast from Scala. I've run it on
>>> >>>> my laptop and compared to MKL and in certain cases it's 10x faster
>>> at matrix multiply.
>>> >>>> There are a lot of layers of indirection here and you really want
>>> >>>> to avoid data copying as much as possible.
>>> >>>>
>>> >>>> We could also consider swapping out BIDMat for Breeze, but that
>>> >>>> would be a big project and if we can figure out how to get
>>> >>>> breeze+cublas to comparable performance that would be a big win.
>>> >>>>
>>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto:
>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote:
>>> >>>> Dear Spark developers,
>>> >>>>
>>> >>>> I am exploring how to make linear algebra operations faster within
>>> Spark.
>>> >>>> One way of doing this is to use Scala Breeze library that is
>>> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
>>> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>>> >>>> and LAPACK native binaries if they are available on the worker
>>> >>>> node. It also has its own optimized Java implementation of BLAS. It
>>> >>>> is worth mentioning, that native binaries provide better
>>> performance only for BLAS level 3, i.e.
>>> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>> >>>> This is confirmed by GEMM test on Netlib-java page
>>> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>> >>>> experiments with training of artificial neural network
>>> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>> >>>> However, I would like to boost performance more.
>>> >>>>
>>> >>>> GPU is supposed to work fast with linear algebra and there is
>>> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>>> >>>> server with Nvidia GPU and I was able to do the following. I linked
>>> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>>> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>>> >>>> performance measurements with regards to artificial neural network
>>> >>>> batch learning in Spark MLlib that involves matrix-matrix
>>> >>>> multiplications. It turns out that for matrices of size less than
>>> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>>> >>>> slower for bigger matrices. It worth mentioning that it is was not
>>> a test for ONLY multiplication since there are other operations involved.
>>> >>>> One of the reasons for slowdown might be the overhead of copying
>>> >>>> the matrices from computer memory to graphic card memory and back.
>>> >>>>
>>> >>>> So, few questions:
>>> >>>> 1) Do these results with CUDA make sense?
>>> >>>> 2) If the problem is with copy overhead, are there any libraries
>>> >>>> that allow to force intermediate results to stay in graphic card
>>> >>>> memory thus removing the overhead?
>>> >>>> 3) Any other options to speed-up linear algebra in Spark?
>>> >>>>
>>> >>>> Thank you, Alexander
>>> >>>>
>>> >>>> -------------------------------------------------------------------
>>> >>>> -- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto:
>>> dev-unsubscr...@spark.apache.org><mailto:
>>> >>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscribe@spark.apach
>>> >>>> e.org>><mailto:dev-unsubscr...@spark.apac<mailto:dev-unsubscribe@sp
>>> >>>> ark.apac> he.org<http://he.org>
>>> >>>> <mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscribe@spa
>>> >>>> rk.apache.org>>> For additional commands, e-mail:
>>> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto:
>>> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org
>>> >><mailto:dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org
>>> ><mailto:
>>> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>
>>>
>>> --
>>> Best regards,
>>> Sam
>>>
>>>
>>>
>>
>

Re: Using CUDA within Spark / boosting linear algebra

Reply via email to