Re: Using CUDA within Spark / boosting linear algebra

Sam Halliday Fri, 27 Feb 2015 12:35:30 -0800

Also, check the JNILoader output.

Remember, for netlib-java to use your system libblas all you need to do is
setup libblas.so.3 like any native application would expect.


I haven't ever used the cublas "real BLAS"  implementation, so I'd be
interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to check
that all the runtime links are in order.

Btw, I have some DGEMM wrappers in my netlib-java performance module... and
I also planned to write more in MultiBLAS (until I mothballed the project
for the hardware to catch up, which is probably has and now I just need a
reason to look at it)
 On 27 Feb 2015 20:26, "Xiangrui Meng" <men...@gmail.com> wrote:

> Hey Sam,
>
> The running times are not "big O" estimates:
>
> > The CPU version finished in 12 seconds.
> > The CPU->GPU->CPU version finished in 2.2 seconds.
> > The GPU version finished in 1.7 seconds.
>
> I think there is something wrong with the netlib/cublas combination.
> Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
> interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
> through the CPU BLAS interface we need to use NVBLAS, which intercepts
> some Level 3 CPU BLAS calls (including GEMM). So we need to load
> nvblas.so first and then some CPU BLAS library in JNI. I wonder
> whether the setup was correct.
>
> Alexander, could you check whether GPU is used in the netlib-cublas
> experiments? You can tell it by watching CPU/GPU usage.
>
> Best,
> Xiangrui
>
> On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday <sam.halli...@gmail.com>
> wrote:
> > Don't use "big O" estimates, always measure. It used to work back in the
> > days when double multiplication was a bottleneck. The computation cost is
> > effectively free on both the CPU and GPU and you're seeing pure copying
> > costs. Also, I'm dubious that cublas is doing what you think it is. Can
> you
> > link me to the source code for DGEMM?
> >
> > I show all of this in my talk, with explanations, I can't stress enough
> how
> > much I recommend that you watch it if you want to understand high
> > performance hardware acceleration for linear algebra :-)
> >
> > On 27 Feb 2015 01:42, "Xiangrui Meng" <men...@gmail.com> wrote:
> >>
> >> The copying overhead should be quadratic on n, while the computation
> >> cost is cubic on n. I can understand that netlib-cublas is slower than
> >> netlib-openblas on small problems. But I'm surprised to see that it is
> >> still 20x slower on 10000x10000. I did the following on a g2.2xlarge
> >> instance with BIDMat:
> >>
> >> val n = 10000
> >>
> >> val f = rand(n, n)
> >> flip; f*f; val rf = flop
> >>
> >> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg =
> flop
> >>
> >> flip; g*g; val rgg = flop
> >>
> >> The CPU version finished in 12 seconds.
> >> The CPU->GPU->CPU version finished in 2.2 seconds.
> >> The GPU version finished in 1.7 seconds.
> >>
> >> I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
> >> path. But based on the result, the data copying overhead is definitely
> >> not as big as 20x at n = 10000.
> >>
> >> Best,
> >> Xiangrui
> >>
> >>
> >> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday <sam.halli...@gmail.com>
> >> wrote:
> >> > I've had some email exchanges with the author of BIDMat: it does
> exactly
> >> > what you need to get the GPU benefit and writes higher level
> algorithms
> >> > entirely in the GPU kernels so that the memory stays there as long as
> >> > possible. The restriction with this approach is that it is only
> offering
> >> > high-level algorithms so is not a toolkit for applied mathematics
> >> > research and development --- but it works well as a toolkit for higher
> >> > level analysis (e.g. for analysts and practitioners).
> >> >
> >> > I believe BIDMat's approach is the best way to get performance out of
> >> > GPU hardware at the moment but I also have strong evidence to suggest
> >> > that the hardware will catch up and the memory transfer costs between
> >> > CPU/GPU will disappear meaning that there will be no need for custom
> GPU
> >> > kernel implementations. i.e. please continue to use BLAS primitives
> when
> >> > writing new algorithms and only go to the GPU for an alternative
> >> > optimised implementation.
> >> >
> >> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and
> offer
> >> > an API that looks like BLAS but takes pointers to special regions in
> the
> >> > GPU memory region. Somebody has written a wrapper around CUDA to
> create
> >> > a proper BLAS library but it only gives marginal performance over the
> >> > CPU because of the memory transfer overhead.
> >> >
> >> > This slide from my talk
> >> >
> >> >   http://fommil.github.io/scalax14/#/11/2
> >> >
> >> > says it all. X axis is matrix size, Y axis is logarithmic time to do
> >> > DGEMM. Black line is the "cheating" time for the GPU and the green
> line
> >> > is after copying the memory to/from the GPU memory. APUs have the
> >> > potential to eliminate the green line.
> >> >
> >> > Best regards,
> >> > Sam
> >> >
> >> >
> >> >
> >> > "Ulanov, Alexander" <alexander.ula...@hp.com> writes:
> >> >
> >> >> Evan, thank you for the summary. I would like to add some more
> >> >> observations. The GPU that I used is 2.5 times cheaper than the CPU
> ($250 vs
> >> >> $100). They both are 3 years old. I've also did a small test with
> modern
> >> >> hardware, and the new GPU nVidia Titan was slightly more than 1
> order of
> >> >> magnitude faster than Intel E5-2650 v2 for the same tests. However,
> it costs
> >> >> as much as CPU ($1200). My takeaway is that GPU is making a better
> >> >> price/value progress.
> >> >>
> >> >>
> >> >>
> >> >> Xiangrui, I was also surprised that BIDMat-cuda was faster than
> >> >> netlib-cuda and the most reasonable explanation is that it holds the
> result
> >> >> in GPU memory, as Sam suggested. At the same time, it is OK because
> you can
> >> >> copy the result back from GPU only when needed. However, to be sure,
> I am
> >> >> going to ask the developer of BIDMat on his upcoming talk.
> >> >>
> >> >>
> >> >>
> >> >> Best regards, Alexander
> >> >>
> >> >>
> >> >> From: Sam Halliday [mailto:sam.halli...@gmail.com]
> >> >> Sent: Thursday, February 26, 2015 1:56 PM
> >> >> To: Xiangrui Meng
> >> >> Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R.
> >> >> Sparks
> >> >> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>
> >> >>
> >> >> Btw, I wish people would stop cheating when comparing CPU and GPU
> >> >> timings for things like matrix multiply :-P
> >> >>
> >> >> Please always compare apples with apples and include the time it
> takes
> >> >> to set up the matrices, send it to the processing unit, doing the
> >> >> calculation AND copying it back to where you need to see the results.
> >> >>
> >> >> Ignoring this method will make you believe that your GPU is thousands
> >> >> of times faster than it really is. Again, jump to the end of my talk
> for
> >> >> graphs and more discussion....  especially the bit about me being
> keen on
> >> >> funding to investigate APU hardware further ;-) (I believe it will
> solve the
> >> >> problem)
> >> >> On 26 Feb 2015 21:16, "Xiangrui Meng"
> >> >> <men...@gmail.com<mailto:men...@gmail.com>> wrote:
> >> >> Hey Alexander,
> >> >>
> >> >> I don't quite understand the part where netlib-cublas is about 20x
> >> >> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> >> >> with netlib-java?
> >> >>
> >> >> CC'ed Sam, the author of netlib-java.
> >> >>
> >> >> Best,
> >> >> Xiangrui
> >> >>
> >> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
> >> >> <jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
> >> >>> Better documentation for linking would be very helpful!  Here's a
> >> >>> JIRA:
> >> >>> https://issues.apache.org/jira/browse/SPARK-6019
> >> >>>
> >> >>>
> >> >>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
> >> >>> <evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>
> >> >>> wrote:
> >> >>>
> >> >>>> Thanks for compiling all the data and running these benchmarks,
> Alex.
> >> >>>> The
> >> >>>> big takeaways here can be seen with this chart:
> >> >>>>
> >> >>>>
> >> >>>>
> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >> >>>>
> >> >>>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >> >>>> BIDMat+GPU) can provide substantial (but less than an order of
> >> >>>> magnitude)
> >> >>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >> >>>> netlib-java+openblas-compiled).
> >> >>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> >> >>>> worse
> >> >>>> than a well-tuned CPU implementation, particularly for larger
> >> >>>> matrices.
> >> >>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >> >>>> basically agrees with the authors own benchmarks (
> >> >>>> https://github.com/fommil/netlib-java)
> >> >>>>
> >> >>>> I think that most of our users are in a situation where using GPUs
> >> >>>> may not
> >> >>>> be practical - although we could consider having a good GPU backend
> >> >>>> available as an option. However, *ALL* users of MLlib could benefit
> >> >>>> (potentially tremendously) from using a well-tuned CPU-based BLAS
> >> >>>> implementation. Perhaps we should consider updating the mllib guide
> >> >>>> with a
> >> >>>> more complete section for enabling high performance binaries on OSX
> >> >>>> and
> >> >>>> Linux? Or better, figure out a way for the system to fetch these
> >> >>>> automatically.
> >> >>>>
> >> >>>> - Evan
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
> >> >>>>
> >> >>>>> Just to summarize this thread, I was finally able to make all
> >> >>>>> performance
> >> >>>>> comparisons that we discussed. It turns out that:
> >> >>>>> BIDMat-cublas>>BIDMat
> >> >>>>>
> >> >>>>>
> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
> >> >>>>>
> >> >>>>> Below is the link to the spreadsheet with full results.
> >> >>>>>
> >> >>>>>
> >> >>>>>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
> >> >>>>>
> >> >>>>> One thing still needs exploration: does BIDMat-cublas perform
> >> >>>>> copying
> >> >>>>> to/from machine’s RAM?
> >> >>>>>
> >> >>>>> -----Original Message-----
> >> >>>>> From: Ulanov, Alexander
> >> >>>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >> >>>>> To: Evan R. Sparks
> >> >>>>> Cc: Joseph Bradley;
> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
> >> >>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
> >> >>>>> the
> >> >>>>> original one discusses slightly different topic. I was able to
> link
> >> >>>>> netlib
> >> >>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked
> >> >>>>> inside a
> >> >>>>> 60MB library.
> >> >>>>>
> >> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >> >>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >> >>>>>
> >> >>>>>
> +-----------------------------------------------------------------------+
> >> >>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >> >>>>> |1,638475459 |
> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211
> |
> >> >>>>> 1569,233228 |
> >> >>>>>
> >> >>>>> It turn out that pre-compiled MKL is faster than precompiled
> >> >>>>> OpenBlas on
> >> >>>>> my machine. Probably, I’ll add two more columns with locally
> >> >>>>> compiled
> >> >>>>> openblas and cuda.
> >> >>>>>
> >> >>>>> Alexander
> >> >>>>>
> >> >>>>> From: Evan R. Sparks
> >> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>]
> >> >>>>> Sent: Monday, February 09, 2015 6:06 PM
> >> >>>>> To: Ulanov, Alexander
> >> >>>>> Cc: Joseph Bradley;
> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> Great - perhaps we can move this discussion off-list and onto a
> JIRA
> >> >>>>> ticket? (Here's one:
> >> >>>>> https://issues.apache.org/jira/browse/SPARK-5705)
> >> >>>>>
> >> >>>>> It seems like this is going to be somewhat exploratory for a while
> >> >>>>> (and
> >> >>>>> there's probably only a handful of us who really care about fast
> >> >>>>> linear
> >> >>>>> algebra!)
> >> >>>>>
> >> >>>>> - Evan
> >> >>>>>
> >> >>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >> >>>>>
> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
> >> >>>>> wrote:
> >> >>>>> Hi Evan,
> >> >>>>>
> >> >>>>> Thank you for explanation and useful link. I am going to build
> >> >>>>> OpenBLAS,
> >> >>>>> link it with Netlib-java and perform benchmark again.
> >> >>>>>
> >> >>>>> Do I understand correctly that BIDMat binaries contain statically
> >> >>>>> linked
> >> >>>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat
> >> >>>>> not
> >> >>>>> having MKL BLAS installed on my server. If it is true, I wonder if
> >> >>>>> it is OK
> >> >>>>> because Intel sells this library. Nevertheless, it seems that in
> my
> >> >>>>> case
> >> >>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS
> given
> >> >>>>> that
> >> >>>>> BIDMat and Netlib-java are supposed to be on par with JNI
> overheads.
> >> >>>>>
> >> >>>>> Though, it might be interesting to link Netlib-java with Intel
> MKL,
> >> >>>>> as
> >> >>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
> >> >>>>> (Netlib-java) interested to compare their libraries.
> >> >>>>>
> >> >>>>> Best regards, Alexander
> >> >>>>>
> >> >>>>> From: Evan R. Sparks
> >> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com
> ><mailto:
> >> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
> >> >>>>> Sent: Friday, February 06, 2015 5:58 PM
> >> >>>>>
> >> >>>>> To: Ulanov, Alexander
> >> >>>>> Cc: Joseph Bradley;
> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
> dev@spark.apache.org<mailto:dev@spark.apache.org>>
> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> I would build OpenBLAS yourself, since good BLAS performance comes
> >> >>>>> from
> >> >>>>> getting cache sizes, etc. set up correctly for your particular
> >> >>>>> hardware -
> >> >>>>> this is often a very tricky process (see, e.g. ATLAS), but we
> found
> >> >>>>> that on
> >> >>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
> >> >>>>> performance competitive with MKL.
> >> >>>>>
> >> >>>>> To make sure the right library is getting used, you have to make
> >> >>>>> sure
> >> >>>>> it's first on the search path - export
> >> >>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >> >>>>>
> >> >>>>> For some examples of getting netlib-java setup on an ec2 node and
> >> >>>>> some
> >> >>>>> example benchmarking code we ran a while back, see:
> >> >>>>> https://github.com/shivaram/matrix-bench
> >> >>>>>
> >> >>>>> In particular - build-openblas-ec2.sh shows you how to build the
> >> >>>>> library
> >> >>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you
> how
> >> >>>>> to get
> >> >>>>> the path setup and get that library picked up by netlib-java.
> >> >>>>>
> >> >>>>> In this way - you could probably get cuBLAS set up to be used by
> >> >>>>> netlib-java as well.
> >> >>>>>
> >> >>>>> - Evan
> >> >>>>>
> >> >>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >> >>>>>
> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
> >> >>>>> wrote:
> >> >>>>> Evan, could you elaborate on how to force BIDMat and netlib-java
> to
> >> >>>>> force
> >> >>>>> loading the right blas? For netlib, I there are few JVM flags,
> such
> >> >>>>> as
> >> >>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
> so
> >> >>>>> I can
> >> >>>>> force it to use Java implementation. Not sure I understand how to
> >> >>>>> force use
> >> >>>>> a specific blas (not specific wrapper for blas).
> >> >>>>>
> >> >>>>> Btw. I have installed openblas (yum install openblas), so I
> suppose
> >> >>>>> that
> >> >>>>> netlib is using it.
> >> >>>>>
> >> >>>>> From: Evan R. Sparks
> >> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com
> ><mailto:
> >> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
> >> >>>>> Sent: Friday, February 06, 2015 5:19 PM
> >> >>>>> To: Ulanov, Alexander
> >> >>>>> Cc: Joseph Bradley;
> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
> dev@spark.apache.org<mailto:dev@spark.apache.org>>
> >> >>>>>
> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> Getting breeze to pick up the right blas library is critical for
> >> >>>>> performance. I recommend using OpenBLAS (or MKL, if you already
> have
> >> >>>>> it).
> >> >>>>> It might make sense to force BIDMat to use the same underlying
> BLAS
> >> >>>>> library
> >> >>>>> as well.
> >> >>>>>
> >> >>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >> >>>>>
> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
> >> >>>>> wrote:
> >> >>>>> Hi Evan, Joseph
> >> >>>>>
> >> >>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> >> >>>>> faster
> >> >>>>> than netlib-java+breeze (sorry for weird table formatting):
> >> >>>>>
> >> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> >> >>>>> native_system_linux_x86-64|
> >> >>>>> Breeze+Netlib-java f2jblas |
> >> >>>>>
> >> >>>>>
> +-----------------------------------------------------------------------+
> >> >>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 |
> 1569,233228 |
> >> >>>>>
> >> >>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM,
> Fedora
> >> >>>>> 19
> >> >>>>> Linux, Scala 2.11.
> >> >>>>>
> >> >>>>> Later I will make tests with Cuda. I need to install new Cuda
> >> >>>>> version for
> >> >>>>> this purpose.
> >> >>>>>
> >> >>>>> Do you have any ideas why breeze-netlib with native blas is so
> much
> >> >>>>> slower than BIDMat MKL?
> >> >>>>>
> >> >>>>> Best regards, Alexander
> >> >>>>>
> >> >>>>> From: Joseph Bradley
> >> >>>>> [mailto:jos...@databricks.com<mailto:jos...@databricks.com
> ><mailto:
> >> >>>>> jos...@databricks.com<mailto:jos...@databricks.com>>]
> >> >>>>> Sent: Thursday, February 05, 2015 5:29 PM
> >> >>>>> To: Ulanov, Alexander
> >> >>>>> Cc: Evan R. Sparks;
> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
> dev@spark.apache.org<mailto:dev@spark.apache.org>>
> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> Hi Alexander,
> >> >>>>>
> >> >>>>> Using GPUs with Spark would be very exciting.  Small comment:
> >> >>>>> Concerning
> >> >>>>> your question earlier about keeping data stored on the GPU rather
> >> >>>>> than
> >> >>>>> having to move it between main memory and GPU memory on each
> >> >>>>> iteration, I
> >> >>>>> would guess this would be critical to getting good performance.
> If
> >> >>>>> you
> >> >>>>> could do multiple local iterations before aggregating results,
> then
> >> >>>>> the
> >> >>>>> cost of data movement to the GPU could be amortized (and I believe
> >> >>>>> that is
> >> >>>>> done in practice).  Having Spark be aware of the GPU and using it
> as
> >> >>>>> another part of memory sounds like a much bigger undertaking.
> >> >>>>>
> >> >>>>> Joseph
> >> >>>>>
> >> >>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >> >>>>>
> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
> >> >>>>> wrote:
> >> >>>>> Thank you for explanation! I’ve watched the BIDMach presentation
> by
> >> >>>>> John
> >> >>>>> Canny and I am really inspired by his talk and comparisons with
> >> >>>>> Spark MLlib.
> >> >>>>>
> >> >>>>> I am very interested to find out what will be better within Spark:
> >> >>>>> BIDMat
> >> >>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair
> way
> >> >>>>> to
> >> >>>>> benchmark them? Currently I do benchmarks on artificial neural
> >> >>>>> networks in
> >> >>>>> batch mode. While it is not a “pure” test of linear algebra, it
> >> >>>>> involves
> >> >>>>> some other things that are essential to machine learning.
> >> >>>>>
> >> >>>>> From: Evan R. Sparks
> >> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com
> ><mailto:
> >> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
> >> >>>>> Sent: Thursday, February 05, 2015 1:29 PM
> >> >>>>> To: Ulanov, Alexander
> >> >>>>> Cc:
> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
> dev@spark.apache.org<mailto:dev@spark.apache.org>>
> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >> >>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due
> to
> >> >>>>> data
> >> >>>>> layout and fewer levels of indirection - it's definitely a
> >> >>>>> worthwhile
> >> >>>>> experiment to run. The main speedups I've seen from using it come
> >> >>>>> from
> >> >>>>> highly optimized GPU code for linear algebra. I know that in the
> >> >>>>> past Canny
> >> >>>>> has gone as far as to write custom GPU kernels for
> >> >>>>> performance-critical
> >> >>>>> regions of code.[1]
> >> >>>>>
> >> >>>>> BIDMach is highly optimized for single node performance or
> >> >>>>> performance on
> >> >>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or
> >> >>>>> can be
> >> >>>>> batched in that way) the performance tends to fall off. Canny
> argues
> >> >>>>> for
> >> >>>>> hardware/software codesign and as such prefers machine
> >> >>>>> configurations that
> >> >>>>> are quite different than what we find in most commodity cluster
> >> >>>>> nodes -
> >> >>>>> e.g. 10 disk cahnnels and 4 GPUs.
> >> >>>>>
> >> >>>>> In contrast, MLlib was designed for horizontal scalability on
> >> >>>>> commodity
> >> >>>>> clusters and works best on very big datasets - order of terabytes.
> >> >>>>>
> >> >>>>> For the most part, these projects developed concurrently to
> address
> >> >>>>> slightly different use cases. That said, there may be bits of
> >> >>>>> BIDMach we
> >> >>>>> could repurpose for MLlib - keep in mind we need to be careful
> about
> >> >>>>> maintaining cross-language compatibility for our Java and
> >> >>>>> Python-users,
> >> >>>>> though.
> >> >>>>>
> >> >>>>> - Evan
> >> >>>>>
> >> >>>>> [1] - http://arxiv.org/abs/1409.5402
> >> >>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >> >>>>>
> >> >>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >> >>>>>
> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto:
> >> >>>>>
> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>>
> >> >>>>> wrote:
> >> >>>>> Hi Evan,
> >> >>>>>
> >> >>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
> >> >>>>> you
> >> >>>>> know what makes them faster than netlib-java?
> >> >>>>>
> >> >>>>> The same group has BIDMach library that implements machine
> learning.
> >> >>>>> For
> >> >>>>> some examples they use Caffe convolutional neural network library
> >> >>>>> owned by
> >> >>>>> another group in Berkeley. Could you elaborate on how these all
> >> >>>>> might be
> >> >>>>> connected with Spark Mllib? If you take BIDMat for linear algebra
> >> >>>>> why don’t
> >> >>>>> you take BIDMach for optimization and learning?
> >> >>>>>
> >> >>>>> Best regards, Alexander
> >> >>>>>
> >> >>>>> From: Evan R. Sparks
> >> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com
> ><mailto:
> >> >>>>>
> >> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>><mailto:
> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
> >> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>>]
> >> >>>>> Sent: Thursday, February 05, 2015 12:09 PM
> >> >>>>> To: Ulanov, Alexander
> >> >>>>> Cc:
> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
> dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto:
> >> >>>>>
> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
> dev@spark.apache.org<mailto:dev@spark.apache.org>>>
> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >> >>>>>
> >> >>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> >> >>>>> blas in
> >> >>>>> many cases.
> >> >>>>>
> >> >>>>> You might consider taking a look at the codepaths that BIDMat (
> >> >>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >> >>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> >> >>>>> optimizing
> >> >>>>> to make this work really fast from Scala. I've run it on my laptop
> >> >>>>> and
> >> >>>>> compared to MKL and in certain cases it's 10x faster at matrix
> >> >>>>> multiply.
> >> >>>>> There are a lot of layers of indirection here and you really want
> to
> >> >>>>> avoid
> >> >>>>> data copying as much as possible.
> >> >>>>>
> >> >>>>> We could also consider swapping out BIDMat for Breeze, but that
> >> >>>>> would be
> >> >>>>> a big project and if we can figure out how to get breeze+cublas to
> >> >>>>> comparable performance that would be a big win.
> >> >>>>>
> >> >>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >> >>>>>
> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto:
> >> >>>>>
> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>>
> >> >>>>> wrote:
> >> >>>>> Dear Spark developers,
> >> >>>>>
> >> >>>>> I am exploring how to make linear algebra operations faster within
> >> >>>>> Spark.
> >> >>>>> One way of doing this is to use Scala Breeze library that is
> bundled
> >> >>>>> with
> >> >>>>> Spark. For matrix operations, it employs Netlib-java that has a
> Java
> >> >>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK
> >> >>>>> native
> >> >>>>> binaries if they are available on the worker node. It also has its
> >> >>>>> own
> >> >>>>> optimized Java implementation of BLAS. It is worth mentioning,
> that
> >> >>>>> native
> >> >>>>> binaries provide better performance only for BLAS level 3, i.e.
> >> >>>>> matrix-matrix operations or general matrix multiplication (GEMM).
> >> >>>>> This is
> >> >>>>> confirmed by GEMM test on Netlib-java page
> >> >>>>> https://github.com/fommil/netlib-java. I also confirmed it with
> my
> >> >>>>> experiments with training of artificial neural network
> >> >>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >> >>>>> However, I would like to boost performance more.
> >> >>>>>
> >> >>>>> GPU is supposed to work fast with linear algebra and there is
> Nvidia
> >> >>>>> CUDA
> >> >>>>> implementation of BLAS, called cublas. I have one Linux server
> with
> >> >>>>> Nvidia
> >> >>>>> GPU and I was able to do the following. I linked cublas (instead
> of
> >> >>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
> >> >>>>> Breeze/Netlib is using it. Then I did some performance
> measurements
> >> >>>>> with
> >> >>>>> regards to artificial neural network batch learning in Spark MLlib
> >> >>>>> that
> >> >>>>> involves matrix-matrix multiplications. It turns out that for
> >> >>>>> matrices of
> >> >>>>> size less than ~1000x780 GPU cublas has the same speed as CPU
> blas.
> >> >>>>> Cublas
> >> >>>>> becomes slower for bigger matrices. It worth mentioning that it is
> >> >>>>> was not
> >> >>>>> a test for ONLY multiplication since there are other operations
> >> >>>>> involved.
> >> >>>>> One of the reasons for slowdown might be the overhead of copying
> the
> >> >>>>> matrices from computer memory to graphic card memory and back.
> >> >>>>>
> >> >>>>> So, few questions:
> >> >>>>> 1) Do these results with CUDA make sense?
> >> >>>>> 2) If the problem is with copy overhead, are there any libraries
> >> >>>>> that
> >> >>>>> allow to force intermediate results to stay in graphic card memory
> >> >>>>> thus
> >> >>>>> removing the overhead?
> >> >>>>> 3) Any other options to speed-up linear algebra in Spark?
> >> >>>>>
> >> >>>>> Thank you, Alexander
> >> >>>>>
> >> >>>>>
> >> >>>>>
> ---------------------------------------------------------------------
> >> >>>>> To unsubscribe, e-mail:
> >> >>>>> dev-unsubscr...@spark.apache.org<mailto:
> dev-unsubscr...@spark.apache.org><mailto:
> >> >>>>>
> >> >>>>> dev-unsubscr...@spark.apache.org<mailto:
> dev-unsubscr...@spark.apache.org>><mailto:dev-unsubscr...@spark.apache.org
> <mailto:dev-unsubscr...@spark.apache.org>
> >> >>>>>
> >> >>>>> <mailto:dev-unsubscr...@spark.apache.org<mailto:
> dev-unsubscr...@spark.apache.org>>>
> >> >>>>> For additional commands, e-mail:
> >> >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org
> ><mailto:
> >> >>>>>
> >> >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org
> >><mailto:dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org
> ><mailto:
> >> >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>
> >> >
> >> > --
> >> > Best regards,
> >> > Sam
> >> >
>

Re: Using CUDA within Spark / boosting linear algebra

Reply via email to