Great - perhaps we can move this discussion off-list and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)
It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us who really care about fast linear algebra!) - Evan On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <alexander.ula...@hp.com> wrote: > Hi Evan, > > > > Thank you for explanation and useful link. I am going to build OpenBLAS, > link it with Netlib-java and perform benchmark again. > > > > Do I understand correctly that BIDMat binaries contain statically linked > Intel MKL BLAS? It might be the reason why I am able to run BIDMat not > having MKL BLAS installed on my server. If it is true, I wonder if it is OK > because Intel sells this library. Nevertheless, it seems that in my case > precompiled MKL BLAS performs better than precompiled OpenBLAS given that > BIDMat and Netlib-java are supposed to be on par with JNI overheads. > > > > Though, it might be interesting to link Netlib-java with Intel MKL, as you > suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) > interested to compare their libraries. > > > > Best regards, Alexander > > > > *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] > *Sent:* Friday, February 06, 2015 5:58 PM > > *To:* Ulanov, Alexander > *Cc:* Joseph Bradley; dev@spark.apache.org > *Subject:* Re: Using CUDA within Spark / boosting linear algebra > > > > I would build OpenBLAS yourself, since good BLAS performance comes from > getting cache sizes, etc. set up correctly for your particular hardware - > this is often a very tricky process (see, e.g. ATLAS), but we found that on > relatively modern Xeon chips, OpenBLAS builds quickly and yields > performance competitive with MKL. > > > > To make sure the right library is getting used, you have to make sure it's > first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so > will do the trick here. > > > > For some examples of getting netlib-java setup on an ec2 node and some > example benchmarking code we ran a while back, see: > https://github.com/shivaram/matrix-bench > > > > In particular - build-openblas-ec2.sh shows you how to build the library > and set up symlinks correctly, and scala/run-netlib.sh shows you how to get > the path setup and get that library picked up by netlib-java. > > > > In this way - you could probably get cuBLAS set up to be used by > netlib-java as well. > > > > - Evan > > > > On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <alexander.ula...@hp.com> > wrote: > > Evan, could you elaborate on how to force BIDMat and netlib-java to > force loading the right blas? For netlib, I there are few JVM flags, such > as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I > can force it to use Java implementation. Not sure I understand how to force > use a specific blas (not specific wrapper for blas). > > > > Btw. I have installed openblas (yum install openblas), so I suppose that > netlib is using it. > > > > *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] > *Sent:* Friday, February 06, 2015 5:19 PM > *To:* Ulanov, Alexander > *Cc:* Joseph Bradley; dev@spark.apache.org > > > *Subject:* Re: Using CUDA within Spark / boosting linear algebra > > > > Getting breeze to pick up the right blas library is critical for > performance. I recommend using OpenBLAS (or MKL, if you already have it). > It might make sense to force BIDMat to use the same underlying BLAS library > as well. > > > > On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <alexander.ula...@hp.com> > wrote: > > Hi Evan, Joseph > > I did few matrix multiplication test and BIDMat seems to be ~10x faster > than netlib-java+breeze (sorry for weird table formatting): > > |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| > Breeze+Netlib-java f2jblas | > +-----------------------------------------------------------------------+ > |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | > |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | > |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 | > > Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 > Linux, Scala 2.11. > > Later I will make tests with Cuda. I need to install new Cuda version for > this purpose. > > Do you have any ideas why breeze-netlib with native blas is so much slower > than BIDMat MKL? > > Best regards, Alexander > > From: Joseph Bradley [mailto:jos...@databricks.com] > Sent: Thursday, February 05, 2015 5:29 PM > To: Ulanov, Alexander > Cc: Evan R. Sparks; dev@spark.apache.org > > Subject: Re: Using CUDA within Spark / boosting linear algebra > > Hi Alexander, > > Using GPUs with Spark would be very exciting. Small comment: Concerning > your question earlier about keeping data stored on the GPU rather than > having to move it between main memory and GPU memory on each iteration, I > would guess this would be critical to getting good performance. If you > could do multiple local iterations before aggregating results, then the > cost of data movement to the GPU could be amortized (and I believe that is > done in practice). Having Spark be aware of the GPU and using it as > another part of memory sounds like a much bigger undertaking. > > Joseph > > On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <alexander.ula...@hp.com> > wrote: > Thank you for explanation! I’ve watched the BIDMach presentation by John > Canny and I am really inspired by his talk and comparisons with Spark MLlib. > > I am very interested to find out what will be better within Spark: BIDMat > or netlib-java with CPU or GPU natives. Could you suggest a fair way to > benchmark them? Currently I do benchmarks on artificial neural networks in > batch mode. While it is not a “pure” test of linear algebra, it involves > some other things that are essential to machine learning. > > From: Evan R. Sparks [mailto:evan.spa...@gmail.com] > Sent: Thursday, February 05, 2015 1:29 PM > To: Ulanov, Alexander > Cc: dev@spark.apache.org > Subject: Re: Using CUDA within Spark / boosting linear algebra > > I'd be surprised of BIDMat+OpenBLAS was significantly faster than > netlib-java+OpenBLAS, but if it is much faster it's probably due to data > layout and fewer levels of indirection - it's definitely a worthwhile > experiment to run. The main speedups I've seen from using it come from > highly optimized GPU code for linear algebra. I know that in the past Canny > has gone as far as to write custom GPU kernels for performance-critical > regions of code.[1] > > BIDMach is highly optimized for single node performance or performance on > small clusters.[2] Once data doesn't fit easily in GPU memory (or can be > batched in that way) the performance tends to fall off. Canny argues for > hardware/software codesign and as such prefers machine configurations that > are quite different than what we find in most commodity cluster nodes - > e.g. 10 disk cahnnels and 4 GPUs. > > In contrast, MLlib was designed for horizontal scalability on commodity > clusters and works best on very big datasets - order of terabytes. > > For the most part, these projects developed concurrently to address > slightly different use cases. That said, there may be bits of BIDMach we > could repurpose for MLlib - keep in mind we need to be careful about > maintaining cross-language compatibility for our Java and Python-users, > though. > > - Evan > > [1] - http://arxiv.org/abs/1409.5402 > [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf > > On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <alexander.ula...@hp.com > <mailto:alexander.ula...@hp.com>> wrote: > Hi Evan, > > Thank you for suggestion! BIDMat seems to have terrific speed. Do you know > what makes them faster than netlib-java? > > The same group has BIDMach library that implements machine learning. For > some examples they use Caffe convolutional neural network library owned by > another group in Berkeley. Could you elaborate on how these all might be > connected with Spark Mllib? If you take BIDMat for linear algebra why don’t > you take BIDMach for optimization and learning? > > Best regards, Alexander > > From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: > evan.spa...@gmail.com>] > Sent: Thursday, February 05, 2015 12:09 PM > To: Ulanov, Alexander > Cc: dev@spark.apache.org<mailto:dev@spark.apache.org> > Subject: Re: Using CUDA within Spark / boosting linear algebra > > I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in > many cases. > > You might consider taking a look at the codepaths that BIDMat ( > https://github.com/BIDData/BIDMat) takes and comparing them to > netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing > to make this work really fast from Scala. I've run it on my laptop and > compared to MKL and in certain cases it's 10x faster at matrix multiply. > There are a lot of layers of indirection here and you really want to avoid > data copying as much as possible. > > We could also consider swapping out BIDMat for Breeze, but that would be a > big project and if we can figure out how to get breeze+cublas to comparable > performance that would be a big win. > > On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < > alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: > Dear Spark developers, > > I am exploring how to make linear algebra operations faster within Spark. > One way of doing this is to use Scala Breeze library that is bundled with > Spark. For matrix operations, it employs Netlib-java that has a Java > wrapper for BLAS (basic linear algebra subprograms) and LAPACK native > binaries if they are available on the worker node. It also has its own > optimized Java implementation of BLAS. It is worth mentioning, that native > binaries provide better performance only for BLAS level 3, i.e. > matrix-matrix operations or general matrix multiplication (GEMM). This is > confirmed by GEMM test on Netlib-java page > https://github.com/fommil/netlib-java. I also confirmed it with my > experiments with training of artificial neural network > https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, > I would like to boost performance more. > > GPU is supposed to work fast with linear algebra and there is Nvidia CUDA > implementation of BLAS, called cublas. I have one Linux server with Nvidia > GPU and I was able to do the following. I linked cublas (instead of > cpu-based blas) with Netlib-java wrapper and put it into Spark, so > Breeze/Netlib is using it. Then I did some performance measurements with > regards to artificial neural network batch learning in Spark MLlib that > involves matrix-matrix multiplications. It turns out that for matrices of > size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas > becomes slower for bigger matrices. It worth mentioning that it is was not > a test for ONLY multiplication since there are other operations involved. > One of the reasons for slowdown might be the overhead of copying the > matrices from computer memory to graphic card memory and back. > > So, few questions: > 1) Do these results with CUDA make sense? > 2) If the problem is with copy overhead, are there any libraries that > allow to force intermediate results to stay in graphic card memory thus > removing the overhead? > 3) Any other options to speed-up linear algebra in Spark? > > Thank you, Alexander > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto: > dev-unsubscr...@spark.apache.org> > For additional commands, e-mail: dev-h...@spark.apache.org<mailto: > dev-h...@spark.apache.org> > > > > >