Re: Using CUDA within Spark / boosting linear algebra

Sam Halliday Tue, 03 Mar 2015 13:55:14 -0800

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community


Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng <men...@gmail.com> writes:

> Hey Alexander,
>
> I don't quite understand the part where netlib-cublas is about 20x
> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> with netlib-java?
>
> CC'ed Sam, the author of netlib-java.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <jos...@databricks.com> wrote:
>> Better documentation for linking would be very helpful!  Here's a JIRA:
>> https://issues.apache.org/jira/browse/SPARK-6019
>>
>>
>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks <evan.spa...@gmail.com>
>> wrote:
>>
>>> Thanks for compiling all the data and running these benchmarks, Alex. The
>>> big takeaways here can be seen with this chart:
>>>
>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>
>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> BIDMat+GPU) can provide substantial (but less than an order of magnitude)
>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> netlib-java+openblas-compiled).
>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
>>> than a well-tuned CPU implementation, particularly for larger matrices.
>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> basically agrees with the authors own benchmarks (
>>> https://github.com/fommil/netlib-java)
>>>
>>> I think that most of our users are in a situation where using GPUs may not
>>> be practical - although we could consider having a good GPU backend
>>> available as an option. However, *ALL* users of MLlib could benefit
>>> (potentially tremendously) from using a well-tuned CPU-based BLAS
>>> implementation. Perhaps we should consider updating the mllib guide with a
>>> more complete section for enabling high performance binaries on OSX and
>>> Linux? Or better, figure out a way for the system to fetch these
>>> automatically.
>>>
>>> - Evan
>>>
>>>
>>>
>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> alexander.ula...@hp.com> wrote:
>>>
>>>> Just to summarize this thread, I was finally able to make all performance
>>>> comparisons that we discussed. It turns out that:
>>>> BIDMat-cublas>>BIDMat
>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>>>>
>>>> Below is the link to the spreadsheet with full results.
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> One thing still needs exploration: does BIDMat-cublas perform copying
>>>> to/from machine’s RAM?
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> To: Evan R. Sparks
>>>> Cc: Joseph Bradley; dev@spark.apache.org
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Thanks, Evan! It seems that ticket was marked as duplicate though the
>>>> original one discusses slightly different topic. I was able to link netlib
>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
>>>> 60MB library.
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>> |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>>> 1569,233228 |
>>>>
>>>> It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
>>>> my machine. Probably, I’ll add two more columns with locally compiled
>>>> openblas and cuda.
>>>>
>>>> Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley; dev@spark.apache.org
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Great - perhaps we can move this discussion off-list and onto a JIRA
>>>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)
>>>>
>>>> It seems like this is going to be somewhat exploratory for a while (and
>>>> there's probably only a handful of us who really care about fast linear
>>>> algebra!)
>>>>
>>>> - Evan
>>>>
>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for explanation and useful link. I am going to build OpenBLAS,
>>>> link it with Netlib-java and perform benchmark again.
>>>>
>>>> Do I understand correctly that BIDMat binaries contain statically linked
>>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
>>>> having MKL BLAS installed on my server. If it is true, I wonder if it is OK
>>>> because Intel sells this library. Nevertheless, it seems that in my case
>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
>>>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>>
>>>> Though, it might be interesting to link Netlib-java with Intel MKL, as
>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>>>> (Netlib-java) interested to compare their libraries.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
>>>> evan.spa...@gmail.com>]
>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I would build OpenBLAS yourself, since good BLAS performance comes from
>>>> getting cache sizes, etc. set up correctly for your particular hardware -
>>>> this is often a very tricky process (see, e.g. ATLAS), but we found that on
>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>>>> performance competitive with MKL.
>>>>
>>>> To make sure the right library is getting used, you have to make sure
>>>> it's first on the search path - export
>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>
>>>> For some examples of getting netlib-java setup on an ec2 node and some
>>>> example benchmarking code we ran a while back, see:
>>>> https://github.com/shivaram/matrix-bench
>>>>
>>>> In particular - build-openblas-ec2.sh shows you how to build the library
>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
>>>> the path setup and get that library picked up by netlib-java.
>>>>
>>>> In this way - you could probably get cuBLAS set up to be used by
>>>> netlib-java as well.
>>>>
>>>> - Evan
>>>>
>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to force
>>>> loading the right blas? For netlib, I there are few JVM flags, such as
>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can
>>>> force it to use Java implementation. Not sure I understand how to force use
>>>> a specific blas (not specific wrapper for blas).
>>>>
>>>> Btw. I have installed openblas (yum install openblas), so I suppose that
>>>> netlib is using it.
>>>>
>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
>>>> evan.spa...@gmail.com>]
>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org>
>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Getting breeze to pick up the right blas library is critical for
>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>>> It might make sense to force BIDMat to use the same underlying BLAS library
>>>> as well.
>>>>
>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
>>>> Hi Evan, Joseph
>>>>
>>>> I did few matrix multiplication test and BIDMat seems to be ~10x faster
>>>> than netlib-java+breeze (sorry for weird table formatting):
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
>>>> Breeze+Netlib-java f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
>>>>
>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
>>>> Linux, Scala 2.11.
>>>>
>>>> Later I will make tests with Cuda. I need to install new Cuda version for
>>>> this purpose.
>>>>
>>>> Do you have any ideas why breeze-netlib with native blas is so much
>>>> slower than BIDMat MKL?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Joseph Bradley [mailto:jos...@databricks.com<mailto:
>>>> jos...@databricks.com>]
>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Evan R. Sparks; dev@spark.apache.org<mailto:dev@spark.apache.org>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment: Concerning
>>>> your question earlier about keeping data stored on the GPU rather than
>>>> having to move it between main memory and GPU memory on each iteration, I
>>>> would guess this would be critical to getting good performance.  If you
>>>> could do multiple local iterations before aggregating results, then the
>>>> cost of data movement to the GPU could be amortized (and I believe that is
>>>> done in practice).  Having Spark be aware of the GPU and using it as
>>>> another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph
>>>>
>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
>>>> Thank you for explanation! I’ve watched the BIDMach presentation by John
>>>> Canny and I am really inspired by his talk and comparisons with Spark 
>>>> MLlib.
>>>>
>>>> I am very interested to find out what will be better within Spark: BIDMat
>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair way to
>>>> benchmark them? Currently I do benchmarks on artificial neural networks in
>>>> batch mode. While it is not a “pure” test of linear algebra, it involves
>>>> some other things that are essential to machine learning.
>>>>
>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
>>>> evan.spa...@gmail.com>]
>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to data
>>>> layout and fewer levels of indirection - it's definitely a worthwhile
>>>> experiment to run. The main speedups I've seen from using it come from
>>>> highly optimized GPU code for linear algebra. I know that in the past Canny
>>>> has gone as far as to write custom GPU kernels for performance-critical
>>>> regions of code.[1]
>>>>
>>>> BIDMach is highly optimized for single node performance or performance on
>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
>>>> batched in that way) the performance tends to fall off. Canny argues for
>>>> hardware/software codesign and as such prefers machine configurations that
>>>> are quite different than what we find in most commodity cluster nodes -
>>>> e.g. 10 disk cahnnels and 4 GPUs.
>>>>
>>>> In contrast, MLlib was designed for horizontal scalability on commodity
>>>> clusters and works best on very big datasets - order of terabytes.
>>>>
>>>> For the most part, these projects developed concurrently to address
>>>> slightly different use cases. That said, there may be bits of BIDMach we
>>>> could repurpose for MLlib - keep in mind we need to be careful about
>>>> maintaining cross-language compatibility for our Java and Python-users,
>>>> though.
>>>>
>>>> - Evan
>>>>
>>>> [1] - http://arxiv.org/abs/1409.5402
>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>>
>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
>>>> know what makes them faster than netlib-java?
>>>>
>>>> The same group has BIDMach library that implements machine learning. For
>>>> some examples they use Caffe convolutional neural network library owned by
>>>> another group in Berkeley. Could you elaborate on how these all might be
>>>> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
>>>> you take BIDMach for optimization and learning?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
>>>> evan.spa...@gmail.com><mailto:evan.spa...@gmail.com<mailto:
>>>> evan.spa...@gmail.com>>]
>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> To: Ulanov, Alexander
>>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
>>>> dev@spark.apache.org<mailto:dev@spark.apache.org>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
>>>> many cases.
>>>>
>>>> You might consider taking a look at the codepaths that BIDMat (
>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
>>>> to make this work really fast from Scala. I've run it on my laptop and
>>>> compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>>> There are a lot of layers of indirection here and you really want to avoid
>>>> data copying as much as possible.
>>>>
>>>> We could also consider swapping out BIDMat for Breeze, but that would be
>>>> a big project and if we can figure out how to get breeze+cublas to
>>>> comparable performance that would be a big win.
>>>>
>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
>>>> Dear Spark developers,
>>>>
>>>> I am exploring how to make linear algebra operations faster within Spark.
>>>> One way of doing this is to use Scala Breeze library that is bundled with
>>>> Spark. For matrix operations, it employs Netlib-java that has a Java
>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
>>>> binaries if they are available on the worker node. It also has its own
>>>> optimized Java implementation of BLAS. It is worth mentioning, that native
>>>> binaries provide better performance only for BLAS level 3, i.e.
>>>> matrix-matrix operations or general matrix multiplication (GEMM). This is
>>>> confirmed by GEMM test on Netlib-java page
>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>>> experiments with training of artificial neural network
>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>> However, I would like to boost performance more.
>>>>
>>>> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
>>>> implementation of BLAS, called cublas. I have one Linux server with Nvidia
>>>> GPU and I was able to do the following. I linked cublas (instead of
>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
>>>> Breeze/Netlib is using it. Then I did some performance measurements with
>>>> regards to artificial neural network batch learning in Spark MLlib that
>>>> involves matrix-matrix multiplications. It turns out that for matrices of
>>>> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
>>>> becomes slower for bigger matrices. It worth mentioning that it is was not
>>>> a test for ONLY multiplication since there are other operations involved.
>>>> One of the reasons for slowdown might be the overhead of copying the
>>>> matrices from computer memory to graphic card memory and back.
>>>>
>>>> So, few questions:
>>>> 1) Do these results with CUDA make sense?
>>>> 2) If the problem is with copy overhead, are there any libraries that
>>>> allow to force intermediate results to stay in graphic card memory thus
>>>> removing the overhead?
>>>> 3) Any other options to speed-up linear algebra in Spark?
>>>>
>>>> Thank you, Alexander
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto:
>>>> dev-unsubscr...@spark.apache.org><mailto:dev-unsubscr...@spark.apache.org
>>>> <mailto:dev-unsubscr...@spark.apache.org>>
>>>> For additional commands, e-mail: dev-h...@spark.apache.org<mailto:
>>>> dev-h...@spark.apache.org><mailto:dev-h...@spark.apache.org<mailto:
>>>> dev-h...@spark.apache.org>>
>>>>
>>>>
>>>>
>>>>
>>>

-- 
Best regards,
Sam

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Using CUDA within Spark / boosting linear algebra

Reply via email to