Re: Using CUDA within Spark / boosting linear algebra
Allen, Currently it only supports OpenCL because the code generator we’ve extended targeted OpenCL. There’s no technical reason that CUDA couldn’t be supported if people would be interested in that, but it would require a rewrite of some of the code generator as well as some ifdefs in the runtime to allow us to compile with either OpenCL or CUDA support. There are actually a few components that support both OpenCL and CUDA for when they’ve been reused for other projects that did use CUDA, just not all of them. Thanks, Max > On Feb 4, 2016, at 9:42 AM, Allen Zhang <allenzhang...@126.com> wrote: > > Hi Max, > > I will look at it tomorrow. but a quick question, does it support CUDA from > Nvidia, not only OpenCL? > > Thanks, > Allen > > > > > > At 2016-02-04 23:13:05, "Max Grossman" <j...@rice.edu> wrote: > Hi all, > > I’m jumping on this thread to point out another Spark+GPU project for people > to take a look at: https://github.com/agrippa/spark-swat > <https://github.com/agrippa/spark-swat> > > SWAT (Spark with Accelerated Tasks) is a third-party JAR sitting on top of > Spark that uses runtime code generation to convert user-written > transformations into OpenCL kernels. SWAT’s lightweight runtime supports > multi-GPU systems, managing each device and its memory automatically. You > write your own Spark programs, and the runtime takes care of offloading your > transformations to the GPUs in your system: > > val rdd = CLWrapper.cl(sc.objectFile(inputPath)) > val next = rdd.map(i => 2 * i).collect > > SWAT primarily distinguishes itself in programmability: an explicit goal of > this project is to have as few user-visible API changes as possible from what > people have come to know and love in Spark. There are a number of > fixed-function GPU libraries out there now, so we wanted to look instead at > something that could be used to build new but still well-performing Spark > apps. > > SWAT is currently more of a research project than a production-ready system, > so there’s a chance it won’t work out-of-the-box on some systems. With that > said, it does have fairly comprehensive functional and code generation > testing. If you’re interested in trying it out and having trouble setting up, > feel free to contact me directly. And of course, any questions or feedback > from the community are always welcome. > > Thanks, > > Max > >> On Jan 22, 2016, at 3:42 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com >> <mailto:ishiz...@jp.ibm.com>> wrote: >> >> Hi Alexander, >> The goal of our columnar to effectively drive GPUs in Spark. One of >> important items is to effectively and easily enable highly-tuned libraries >> for GPU such as BIDMach. >> >> We will enable BIDMach with our columnar storage. On the other hand, it is >> not easy task to scaling BIDMach with current Spark. I expect that this talk >> would help us. >> http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565 >> >> <http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565> >> >> We appreciate your great feedback. >> >> Best Regards, >> Kazuaki Ishizaki, Ph.D., Senior research staff member, IBM Research - Tokyo >> >> >> >> From:"Ulanov, Alexander" <alexander.ula...@hpe.com >> <mailto:alexander.ula...@hpe.com>> >> To:Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org >> <mailto:dev@spark.apache.org>" <dev@spark.apache.org >> <mailto:dev@spark.apache.org>>, Joseph Bradley <jos...@databricks.com >> <mailto:jos...@databricks.com>> >> Cc:John Canny <ca...@berkeley.edu <mailto:ca...@berkeley.edu>>, >> "Evan R. Sparks" <evan.spa...@gmail.com <mailto:evan.spa...@gmail.com>>, >> Xiangrui Meng <men...@gmail.com <mailto:men...@gmail.com>>, Sam Halliday >> <sam.halli...@gmail.com <mailto:sam.halli...@gmail.com>> >> Date:2016/01/22 04:20 >> Subject:RE: Using CUDA within Spark / boosting linear algebra >> >> >> >> Hi Kazuaki, >> >> Indeed, moving data to/from GPU is costly and this benchmark summarizes the >> costs for moving different data sizes with regards to matrices >> multiplication. These costs are paid for the convenience of using the >> standard BLAS API that Nvidia NVBLAS provides. The thing is that there are >> no code changes required (in Spark), one just needs to reference BLAS >> implementation with the syst
Re: Using CUDA within Spark / boosting linear algebra
Hi all, I’m jumping on this thread to point out another Spark+GPU project for people to take a look at: https://github.com/agrippa/spark-swat <https://github.com/agrippa/spark-swat> SWAT (Spark with Accelerated Tasks) is a third-party JAR sitting on top of Spark that uses runtime code generation to convert user-written transformations into OpenCL kernels. SWAT’s lightweight runtime supports multi-GPU systems, managing each device and its memory automatically. You write your own Spark programs, and the runtime takes care of offloading your transformations to the GPUs in your system: val rdd = CLWrapper.cl(sc.objectFile(inputPath)) val next = rdd.map(i => 2 * i).collect SWAT primarily distinguishes itself in programmability: an explicit goal of this project is to have as few user-visible API changes as possible from what people have come to know and love in Spark. There are a number of fixed-function GPU libraries out there now, so we wanted to look instead at something that could be used to build new but still well-performing Spark apps. SWAT is currently more of a research project than a production-ready system, so there’s a chance it won’t work out-of-the-box on some systems. With that said, it does have fairly comprehensive functional and code generation testing. If you’re interested in trying it out and having trouble setting up, feel free to contact me directly. And of course, any questions or feedback from the community are always welcome. Thanks, Max > On Jan 22, 2016, at 3:42 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com> wrote: > > Hi Alexander, > The goal of our columnar to effectively drive GPUs in Spark. One of important > items is to effectively and easily enable highly-tuned libraries for GPU such > as BIDMach. > > We will enable BIDMach with our columnar storage. On the other hand, it is > not easy task to scaling BIDMach with current Spark. I expect that this talk > would help us. > http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565 > > <http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565> > > We appreciate your great feedback. > > Best Regards, > Kazuaki Ishizaki, Ph.D., Senior research staff member, IBM Research - Tokyo > > > > From:"Ulanov, Alexander" <alexander.ula...@hpe.com> > To:Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org" > <dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com> > Cc:John Canny <ca...@berkeley.edu>, "Evan R. Sparks" > <evan.spa...@gmail.com>, Xiangrui Meng <men...@gmail.com>, Sam Halliday > <sam.halli...@gmail.com> > Date:2016/01/22 04:20 > Subject:RE: Using CUDA within Spark / boosting linear algebra > > > > Hi Kazuaki, > > Indeed, moving data to/from GPU is costly and this benchmark summarizes the > costs for moving different data sizes with regards to matrices > multiplication. These costs are paid for the convenience of using the > standard BLAS API that Nvidia NVBLAS provides. The thing is that there are no > code changes required (in Spark), one just needs to reference BLAS > implementation with the system variable. Naturally, hardware-specific > implementation will always be faster than default. The benchmark results show > that fact by comparing jCuda (by means of BIDMat) and NVBLAS. However, it > also shows that it worth using NVBLAS for large matrices because it can take > advantage of several GPUs and it will be faster despite the copying overhead. > That is also a known thing advertised by Nvidia. > > By the way, I don’t think that the column/row friendly format is an issue, > because one can use transposed matrices to fit the required format. I believe > that is just a software preference. > > My suggestion with regards to your prototype would be to make comparisons > with Spark’s implementation of logistic regression (that does not take > advantage of GPU) and also with BIDMach’s (that takes advantage of GPUs). It > will give the users a better understanding of your’s implementation > performance. Currently you compare it with Spark’s example logistic > regression implementation that is supposed to be a reference for learning > Spark rather than benchmarking its performance. > > Best regards, Alexander > > From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com > <mailto:ishiz...@jp.ibm.com>] > Sent: Thursday, January 21, 2016 3:34 AM > To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley > Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday > Subject: RE: Using CUDA within Spark / boosting linear algebra > > Dear all, > > >>>> Hi
RE: Using CUDA within Spark / boosting linear algebra
Hi Allen, Thank you for your feedback. An API to launch GPU kernels with JCuda is the our first step. A purpose to release our prototype is to get feedback. In the future, we may use other wrappers instead of JCuda. We are very appreciate it if you would suggest or propose APIs to effectively exploit GPUs such as BIDMat in Spark. If we would run BIDMat with our columnar strorage, the performance boost would be good as others reported. Best Regards, Kazuaki Ishizaki, From: "Allen Zhang" <allenzhang...@126.com> To: Kazuaki Ishizaki/Japan/IBM@IBMJP Cc: "dev@spark.apache.org" <dev@spark.apache.org>, "Ulanov, Alexander" <alexander.ula...@hpe.com>, "Joseph Bradley" <jos...@databricks.com>, "John Canny" <ca...@berkeley.edu>, "Evan R. Sparks" <evan.spa...@gmail.com>, "Xiangrui Meng" <men...@gmail.com>, "Sam Halliday" <sam.halli...@gmail.com> Date: 2016/01/21 21:05 Subject:RE: Using CUDA within Spark / boosting linear algebra Hi Kazuaki, Jcuda is actually a wrapper of the **pure** CUDA, as your wiki page shows that 3.15x performance boost of logistic regression seems slower than BIDMat-cublas or pure CUDA. Could you elaborate on why you chose Jcuda other then JNI to call CUDA directly? Regards, Allen Zhang At 2016-01-21 19:34:14, "Kazuaki Ishizaki" <ishiz...@jp.ibm.com> wrote: Dear all, >>>> Hi Alexander, >>>> >>>> Using GPUs with Spark would be very exciting. Small comment: >>>> Concerning your question earlier about keeping data stored on the >>>> GPU rather than having to move it between main memory and GPU >>>> memory on each iteration, I would guess this would be critical to >>>> getting good performance. If you could do multiple local >>>> iterations before aggregating results, then the cost of data >>>> movement to the GPU could be amortized (and I believe that is done >>>> in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. >>>> >>>> Joseph As Joseph pointed out before, there are two potential issues to efficiently exploit GPUs in Spark. (1) the cost of data movement between CPU and GPU (2) the cost of encoding/decoding between current row-format and GPU-friendly column format Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues by supporting data partition caching in GPU device memory and by providing binary column storage for data partition. We really appreciate it if you would give us comments, suggestions, or feedback. Best Regards Kazuaki Ishizaki From:"Ulanov, Alexander" <alexander.ula...@hpe.com> To: Sam Halliday <sam.halli...@gmail.com>, John Canny < ca...@berkeley.edu> Cc:Xiangrui Meng <men...@gmail.com>, "dev@spark.apache.org" < dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>, "Evan R. Sparks" <evan.spa...@gmail.com> Date:2016/01/21 11:07 Subject:RE: Using CUDA within Spark / boosting linear algebra Hi Everyone, I’ve updated the benchmark and done experiments with new hardware with 2x Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel E5-2650 v3 @ 2.30GHz. This time I computed average and median of 10 runs for each of experiment and approximated FLOPS. Results are available at google docs (old experiments are in the other 2 sheets): https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Benchmark code: https://github.com/avulanov/scala-blas Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, March 26, 2015 9:27 AM To: John Canny Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; Ulanov, Alexander Subject: Re: Using CUDA within Spark / boosting linear algebra John, I have to disagree with you there. Dense matrices come up a lot in industry, although your personal experience may be different. On 26 Mar 2015 16:20, "John Canny" <ca...@berkeley.edu> wrote: I mentioned this earlier in the thread, but I'll put it out again. Dense BLAS are not very important for most machine learning workloads: at least for non-image workloads in industry (and for image processing you would probably want a deep learning/SGD solution with convolution kernels). e.g. it was only relevant for 1/7 of our recent benchmarks, which should be a reasonable sample. What really matters is sparse BLAS performance. BIDMat is still an order of magnitude faster there. Those kernels are only in BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data. Its also the case that the overa
RE: Using CUDA within Spark / boosting linear algebra
Hi Alexander, The goal of our columnar to effectively drive GPUs in Spark. One of important items is to effectively and easily enable highly-tuned libraries for GPU such as BIDMach. We will enable BIDMach with our columnar storage. On the other hand, it is not easy task to scaling BIDMach with current Spark. I expect that this talk would help us. http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565 We appreciate your great feedback. Best Regards, Kazuaki Ishizaki, Ph.D., Senior research staff member, IBM Research - Tokyo From: "Ulanov, Alexander" <alexander.ula...@hpe.com> To: Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org" <dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com> Cc: John Canny <ca...@berkeley.edu>, "Evan R. Sparks" <evan.spa...@gmail.com>, Xiangrui Meng <men...@gmail.com>, Sam Halliday <sam.halli...@gmail.com> Date: 2016/01/22 04:20 Subject:RE: Using CUDA within Spark / boosting linear algebra Hi Kazuaki, Indeed, moving data to/from GPU is costly and this benchmark summarizes the costs for moving different data sizes with regards to matrices multiplication. These costs are paid for the convenience of using the standard BLAS API that Nvidia NVBLAS provides. The thing is that there are no code changes required (in Spark), one just needs to reference BLAS implementation with the system variable. Naturally, hardware-specific implementation will always be faster than default. The benchmark results show that fact by comparing jCuda (by means of BIDMat) and NVBLAS. However, it also shows that it worth using NVBLAS for large matrices because it can take advantage of several GPUs and it will be faster despite the copying overhead. That is also a known thing advertised by Nvidia. By the way, I don’t think that the column/row friendly format is an issue, because one can use transposed matrices to fit the required format. I believe that is just a software preference. My suggestion with regards to your prototype would be to make comparisons with Spark’s implementation of logistic regression (that does not take advantage of GPU) and also with BIDMach’s (that takes advantage of GPUs). It will give the users a better understanding of your’s implementation performance. Currently you compare it with Spark’s example logistic regression implementation that is supposed to be a reference for learning Spark rather than benchmarking its performance. Best regards, Alexander From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com] Sent: Thursday, January 21, 2016 3:34 AM To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday Subject: RE: Using CUDA within Spark / boosting linear algebra Dear all, >>>> Hi Alexander, >>>> >>>> Using GPUs with Spark would be very exciting. Small comment: >>>> Concerning your question earlier about keeping data stored on the >>>> GPU rather than having to move it between main memory and GPU >>>> memory on each iteration, I would guess this would be critical to >>>> getting good performance. If you could do multiple local >>>> iterations before aggregating results, then the cost of data >>>> movement to the GPU could be amortized (and I believe that is done >>>> in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. >>>> >>>> Joseph As Joseph pointed out before, there are two potential issues to efficiently exploit GPUs in Spark. (1) the cost of data movement between CPU and GPU (2) the cost of encoding/decoding between current row-format and GPU-friendly column format Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues by supporting data partition caching in GPU device memory and by providing binary column storage for data partition. We really appreciate it if you would give us comments, suggestions, or feedback. Best Regards Kazuaki Ishizaki From:"Ulanov, Alexander" <alexander.ula...@hpe.com> To:Sam Halliday <sam.halli...@gmail.com>, John Canny < ca...@berkeley.edu> Cc: Xiangrui Meng <men...@gmail.com>, "dev@spark.apache.org" < dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>, "Evan R. Sparks" <evan.spa...@gmail.com> Date:2016/01/21 11:07 Subject:RE: Using CUDA within Spark / boosting linear algebra Hi Everyone, I’ve updated the benchmark and done experiments with new hardware with 2x Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel E5-2650 v3 @ 2.30GHz. This time I computed average and median of 10 runs for each of experiment and
RE: Using CUDA within Spark / boosting linear algebra
Hi Kazuaki, Jcuda is actually a wrapper of the **pure** CUDA, as your wiki page shows that 3.15x performance boost of logistic regression seems slower than BIDMat-cublas or pure CUDA. Could you elaborate on why you chose Jcuda other then JNI to call CUDA directly? Regards, Allen Zhang At 2016-01-21 19:34:14, "Kazuaki Ishizaki" <ishiz...@jp.ibm.com> wrote: Dear all, >>>> Hi Alexander, >>>> >>>> Using GPUs with Spark would be very exciting. Small comment: >>>> Concerning your question earlier about keeping data stored on the >>>> GPU rather than having to move it between main memory and GPU >>>> memory on each iteration, I would guess this would be critical to >>>> getting good performance. If you could do multiple local >>>> iterations before aggregating results, then the cost of data >>>> movement to the GPU could be amortized (and I believe that is done >>>> in practice). Having Spark be aware of the GPU and using it as another >>>> part of memory sounds like a much bigger undertaking. >>>> >>>> Joseph As Joseph pointed out before, there are two potential issues to efficiently exploit GPUs in Spark. (1) the cost of data movement between CPU and GPU (2) the cost of encoding/decoding between current row-format and GPU-friendly column format Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues by supporting data partition caching in GPU device memory and by providing binary column storage for data partition. We really appreciate it if you would give us comments, suggestions, or feedback. Best Regards Kazuaki Ishizaki From:"Ulanov, Alexander" <alexander.ula...@hpe.com> To:Sam Halliday <sam.halli...@gmail.com>, John Canny <ca...@berkeley.edu> Cc:Xiangrui Meng <men...@gmail.com>, "dev@spark.apache.org" <dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>, "Evan R. Sparks" <evan.spa...@gmail.com> Date:2016/01/21 11:07 Subject:RE: Using CUDA within Spark / boosting linear algebra Hi Everyone, I’ve updated the benchmark and done experiments with new hardware with 2x Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel E5-2650 v3 @ 2.30GHz. This time I computed average and median of 10 runs for each of experiment and approximated FLOPS. Results are available at google docs (old experiments are in the other 2 sheets): https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Benchmark code: https://github.com/avulanov/scala-blas Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, March 26, 2015 9:27 AM To: John Canny Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; Ulanov, Alexander Subject: Re: Using CUDA within Spark / boosting linear algebra John, I have to disagree with you there. Dense matrices come up a lot in industry, although your personal experience may be different. On 26 Mar 2015 16:20, "John Canny" <ca...@berkeley.edu> wrote: I mentioned this earlier in the thread, but I'll put it out again. Dense BLAS are not very important for most machine learning workloads: at least for non-image workloads in industry (and for image processing you would probably want a deep learning/SGD solution with convolution kernels). e.g. it was only relevant for 1/7 of our recent benchmarks, which should be a reasonable sample. What really matters is sparse BLAS performance. BIDMat is still an order of magnitude faster there. Those kernels are only in BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data. Its also the case that the overall performance of an algorithm is determined by the slowest kernel, not the fastest. If the goal is to get closer to BIDMach's performance on typical problems, you need to make sure that every kernel goes at comparable speed. So the real question is how much faster MLLib routines do on a complete problem with/without GPU acceleration. For BIDMach, its close to a factor of 10. But that required running entirely on the GPU, and making sure every kernel is close to its limit. -John If you think nvblas would be helpful, you should try it in some end-to-end benchmarks. On 3/25/15, 6:23 PM, Evan R. Sparks wrote: Yeah, much more reasonable - nice to know that we can get full GPU performance from breeze/netlib-java - meaning there's no compelling performance reason to switch out our current linear algebra library (at least as far as this benchmark is concerned). Instead, it looks like a user guide for configuring Spark/MLlib to use the right BLAS library will get us most of the way there. Or, would it make sense to finally ship openblas compiled for so
RE: Using CUDA within Spark / boosting linear algebra
Dear all, >>>> Hi Alexander, >>>> >>>> Using GPUs with Spark would be very exciting. Small comment: >>>> Concerning your question earlier about keeping data stored on the >>>> GPU rather than having to move it between main memory and GPU >>>> memory on each iteration, I would guess this would be critical to >>>> getting good performance. If you could do multiple local >>>> iterations before aggregating results, then the cost of data >>>> movement to the GPU could be amortized (and I believe that is done >>>> in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. >>>> >>>> Joseph As Joseph pointed out before, there are two potential issues to efficiently exploit GPUs in Spark. (1) the cost of data movement between CPU and GPU (2) the cost of encoding/decoding between current row-format and GPU-friendly column format Our prototype http://kiszk.github.io/spark-gpu/ addresses these two issues by supporting data partition caching in GPU device memory and by providing binary column storage for data partition. We really appreciate it if you would give us comments, suggestions, or feedback. Best Regards Kazuaki Ishizaki From: "Ulanov, Alexander" <alexander.ula...@hpe.com> To: Sam Halliday <sam.halli...@gmail.com>, John Canny <ca...@berkeley.edu> Cc: Xiangrui Meng <men...@gmail.com>, "dev@spark.apache.org" <dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>, "Evan R. Sparks" <evan.spa...@gmail.com> Date: 2016/01/21 11:07 Subject:RE: Using CUDA within Spark / boosting linear algebra Hi Everyone, I’ve updated the benchmark and done experiments with new hardware with 2x Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel E5-2650 v3 @ 2.30GHz. This time I computed average and median of 10 runs for each of experiment and approximated FLOPS. Results are available at google docs (old experiments are in the other 2 sheets): https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Benchmark code: https://github.com/avulanov/scala-blas Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, March 26, 2015 9:27 AM To: John Canny Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; Ulanov, Alexander Subject: Re: Using CUDA within Spark / boosting linear algebra John, I have to disagree with you there. Dense matrices come up a lot in industry, although your personal experience may be different. On 26 Mar 2015 16:20, "John Canny" <ca...@berkeley.edu> wrote: I mentioned this earlier in the thread, but I'll put it out again. Dense BLAS are not very important for most machine learning workloads: at least for non-image workloads in industry (and for image processing you would probably want a deep learning/SGD solution with convolution kernels). e.g. it was only relevant for 1/7 of our recent benchmarks, which should be a reasonable sample. What really matters is sparse BLAS performance. BIDMat is still an order of magnitude faster there. Those kernels are only in BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data. Its also the case that the overall performance of an algorithm is determined by the slowest kernel, not the fastest. If the goal is to get closer to BIDMach's performance on typical problems, you need to make sure that every kernel goes at comparable speed. So the real question is how much faster MLLib routines do on a complete problem with/without GPU acceleration. For BIDMach, its close to a factor of 10. But that required running entirely on the GPU, and making sure every kernel is close to its limit. -John If you think nvblas would be helpful, you should try it in some end-to-end benchmarks. On 3/25/15, 6:23 PM, Evan R. Sparks wrote: Yeah, much more reasonable - nice to know that we can get full GPU performance from breeze/netlib-java - meaning there's no compelling performance reason to switch out our current linear algebra library (at least as far as this benchmark is concerned). Instead, it looks like a user guide for configuring Spark/MLlib to use the right BLAS library will get us most of the way there. Or, would it make sense to finally ship openblas compiled for some common platforms (64-bit linux, windows, mac) directly with Spark - hopefully eliminating the jblas warnings once and for all for most users? (Licensing is BSD) Or am I missing something? On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander < alexander.ula...@hp.com> wrote: As everyone suggested, the results were too good to be true, so I double-checked them. It turns that nvblas did not do multip
RE: Using CUDA within Spark / boosting linear algebra
Hi Kazuaki, Indeed, moving data to/from GPU is costly and this benchmark summarizes the costs for moving different data sizes with regards to matrices multiplication. These costs are paid for the convenience of using the standard BLAS API that Nvidia NVBLAS provides. The thing is that there are no code changes required (in Spark), one just needs to reference BLAS implementation with the system variable. Naturally, hardware-specific implementation will always be faster than default. The benchmark results show that fact by comparing jCuda (by means of BIDMat) and NVBLAS. However, it also shows that it worth using NVBLAS for large matrices because it can take advantage of several GPUs and it will be faster despite the copying overhead. That is also a known thing advertised by Nvidia. By the way, I don't think that the column/row friendly format is an issue, because one can use transposed matrices to fit the required format. I believe that is just a software preference. My suggestion with regards to your prototype would be to make comparisons with Spark's implementation of logistic regression (that does not take advantage of GPU) and also with BIDMach's (that takes advantage of GPUs). It will give the users a better understanding of your's implementation performance. Currently you compare it with Spark's example logistic regression implementation that is supposed to be a reference for learning Spark rather than benchmarking its performance. Best regards, Alexander From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com] Sent: Thursday, January 21, 2016 3:34 AM To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday Subject: RE: Using CUDA within Spark / boosting linear algebra Dear all, >>>> Hi Alexander, >>>> >>>> Using GPUs with Spark would be very exciting. Small comment: >>>> Concerning your question earlier about keeping data stored on the >>>> GPU rather than having to move it between main memory and GPU >>>> memory on each iteration, I would guess this would be critical to >>>> getting good performance. If you could do multiple local >>>> iterations before aggregating results, then the cost of data >>>> movement to the GPU could be amortized (and I believe that is done >>>> in practice). Having Spark be aware of the GPU and using it as another >>>> part of memory sounds like a much bigger undertaking. >>>> >>>> Joseph As Joseph pointed out before, there are two potential issues to efficiently exploit GPUs in Spark. (1) the cost of data movement between CPU and GPU (2) the cost of encoding/decoding between current row-format and GPU-friendly column format Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues by supporting data partition caching in GPU device memory and by providing binary column storage for data partition. We really appreciate it if you would give us comments, suggestions, or feedback. Best Regards Kazuaki Ishizaki From:"Ulanov, Alexander" <alexander.ula...@hpe.com<mailto:alexander.ula...@hpe.com>> To:Sam Halliday <sam.halli...@gmail.com<mailto:sam.halli...@gmail.com>>, John Canny <ca...@berkeley.edu<mailto:ca...@berkeley.edu>> Cc:Xiangrui Meng <men...@gmail.com<mailto:men...@gmail.com>>, "dev@spark.apache.org<mailto:dev@spark.apache.org>" <dev@spark.apache.org<mailto:dev@spark.apache.org>>, Joseph Bradley <jos...@databricks.com<mailto:jos...@databricks.com>>, "Evan R. Sparks" <evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>> Date:2016/01/21 11:07 Subject:RE: Using CUDA within Spark / boosting linear algebra Hi Everyone, I've updated the benchmark and done experiments with new hardware with 2x Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel E5-2650 v3 @ 2.30GHz. This time I computed average and median of 10 runs for each of experiment and approximated FLOPS. Results are available at google docs (old experiments are in the other 2 sheets): https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Benchmark code: https://github.com/avulanov/scala-blas Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, March 26, 2015 9:27 AM To: John Canny Cc: Xiangrui Meng; dev@spark.apache.org<mailto:dev@spark.apache.org>; Joseph Bradley; Evan R. Sparks; Ulanov, Alexander Subject: Re: Using CUDA within Spark / boosting linear algebra John, I have to disagree with you there. Dense matrices come up a lot in industry, although your personal experience may be different. On 26 Mar 2015 16:20, "J
RE: Using CUDA within Spark / boosting linear algebra
Hi Everyone, I’ve updated the benchmark and done experiments with new hardware with 2x Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel E5-2650 v3 @ 2.30GHz. This time I computed average and median of 10 runs for each of experiment and approximated FLOPS. Results are available at google docs (old experiments are in the other 2 sheets): https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Benchmark code: https://github.com/avulanov/scala-blas Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, March 26, 2015 9:27 AM To: John Canny Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; Ulanov, Alexander Subject: Re: Using CUDA within Spark / boosting linear algebra John, I have to disagree with you there. Dense matrices come up a lot in industry, although your personal experience may be different. On 26 Mar 2015 16:20, "John Canny" <ca...@berkeley.edu<mailto:ca...@berkeley.edu>> wrote: I mentioned this earlier in the thread, but I'll put it out again. Dense BLAS are not very important for most machine learning workloads: at least for non-image workloads in industry (and for image processing you would probably want a deep learning/SGD solution with convolution kernels). e.g. it was only relevant for 1/7 of our recent benchmarks, which should be a reasonable sample. What really matters is sparse BLAS performance. BIDMat is still an order of magnitude faster there. Those kernels are only in BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data. Its also the case that the overall performance of an algorithm is determined by the slowest kernel, not the fastest. If the goal is to get closer to BIDMach's performance on typical problems, you need to make sure that every kernel goes at comparable speed. So the real question is how much faster MLLib routines do on a complete problem with/without GPU acceleration. For BIDMach, its close to a factor of 10. But that required running entirely on the GPU, and making sure every kernel is close to its limit. -John If you think nvblas would be helpful, you should try it in some end-to-end benchmarks. On 3/25/15, 6:23 PM, Evan R. Sparks wrote: Yeah, much more reasonable - nice to know that we can get full GPU performance from breeze/netlib-java - meaning there's no compelling performance reason to switch out our current linear algebra library (at least as far as this benchmark is concerned). Instead, it looks like a user guide for configuring Spark/MLlib to use the right BLAS library will get us most of the way there. Or, would it make sense to finally ship openblas compiled for some common platforms (64-bit linux, windows, mac) directly with Spark - hopefully eliminating the jblas warnings once and for all for most users? (Licensing is BSD) Or am I missing something? On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: As everyone suggested, the results were too good to be true, so I double-checked them. It turns that nvblas did not do multiplication due to parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My previously posted results with nvblas are matrices copying only. The default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked other values that worked. As a result, netlib+nvblas is on par with BIDMat-cuda. As promised, I am going to post a how-to for nvblas configuration. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing -Original Message- From: Ulanov, Alexander Sent: Wednesday, March 25, 2015 2:31 PM To: Sam Halliday Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny Subject: RE: Using CUDA within Spark / boosting linear algebra Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through ne
Re: Using CUDA within Spark / boosting linear algebra
Hi Alex, Since it is non-trivial to make nvblas work with netlib-java, it would be great if you can send the instructions to netlib-java as part of the README. Hopefully we don't need to modify netlib-java code to use nvblas. Best, Xiangrui On Thu, Mar 26, 2015 at 9:54 AM, Sean Owen so...@cloudera.com wrote: The license issue is with libgfortran, rather than OpenBLAS. (FWIW I am going through the motions to get OpenBLAS set up by default on CDH in the near future, and the hard part is just handling libgfortran.) On Thu, Mar 26, 2015 at 4:07 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Alright Sam - you are the expert here. If the GPL issues are unavoidable, that's fine - what is the exact bit of code that is GPL? The suggestion to use OpenBLAS is not to say it's the best option, but that it's a *free, reasonable default* for many users - keep in mind the most common deployment for Spark/MLlib is on 64-bit linux on EC2[1]. Additionally, for many of the problems we're targeting, this reasonable default can provide a 1-2 orders of magnitude improvement in performance over the f2jblas implementation that netlib-java falls back on. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: Using CUDA within Spark / boosting linear algebra
Hi Sam, What is the best way to do it? Should I clone netlib-java, edit readme.md and make a PR? Best regards, Alexander -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Monday, March 30, 2015 2:43 PM To: Sean Owen Cc: Evan R. Sparks; Sam Halliday; dev@spark.apache.org; Ulanov, Alexander; jfcanny Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alex, Since it is non-trivial to make nvblas work with netlib-java, it would be great if you can send the instructions to netlib-java as part of the README. Hopefully we don't need to modify netlib-java code to use nvblas. Best, Xiangrui On Thu, Mar 26, 2015 at 9:54 AM, Sean Owen so...@cloudera.com wrote: The license issue is with libgfortran, rather than OpenBLAS. (FWIW I am going through the motions to get OpenBLAS set up by default on CDH in the near future, and the hard part is just handling libgfortran.) On Thu, Mar 26, 2015 at 4:07 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Alright Sam - you are the expert here. If the GPL issues are unavoidable, that's fine - what is the exact bit of code that is GPL? The suggestion to use OpenBLAS is not to say it's the best option, but that it's a *free, reasonable default* for many users - keep in mind the most common deployment for Spark/MLlib is on 64-bit linux on EC2[1]. Additionally, for many of the problems we're targeting, this reasonable default can provide a 1-2 orders of magnitude improvement in performance over the f2jblas implementation that netlib-java falls back on. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Using CUDA within Spark / boosting linear algebra
I mentioned this earlier in the thread, but I'll put it out again. Dense BLAS are not very important for most machine learning workloads: at least for non-image workloads in industry (and for image processing you would probably want a deep learning/SGD solution with convolution kernels). e.g. it was only relevant for 1/7 of our recent benchmarks, which should be a reasonable sample. What really matters is sparse BLAS performance. BIDMat is still an order of magnitude faster there. Those kernels are only in BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data. Its also the case that the overall performance of an algorithm is determined by the slowest kernel, not the fastest. If the goal is to get closer to BIDMach's performance on typical problems, you need to make sure that every kernel goes at comparable speed. So the real question is how much faster MLLib routines do on a complete problem with/without GPU acceleration. For BIDMach, its close to a factor of 10. But that required running entirely on the GPU, and making sure every kernel is close to its limit. -John If you think nvblas would be helpful, you should try it in some end-to-end benchmarks. On 3/25/15, 6:23 PM, Evan R. Sparks wrote: Yeah, much more reasonable - nice to know that we can get full GPU performance from breeze/netlib-java - meaning there's no compelling performance reason to switch out our current linear algebra library (at least as far as this benchmark is concerned). Instead, it looks like a user guide for configuring Spark/MLlib to use the right BLAS library will get us most of the way there. Or, would it make sense to finally ship openblas compiled for some common platforms (64-bit linux, windows, mac) directly with Spark - hopefully eliminating the jblas warnings once and for all for most users? (Licensing is BSD) Or am I missing something? On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander alexander.ula...@hp.com mailto:alexander.ula...@hp.com wrote: As everyone suggested, the results were too good to be true, so I double-checked them. It turns that nvblas did not do multiplication due to parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My previously posted results with nvblas are matrices copying only. The default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked other values that worked. As a result, netlib+nvblas is on par with BIDMat-cuda. As promised, I am going to post a how-to for nvblas configuration. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing -Original Message- From: Ulanov, Alexander Sent: Wednesday, March 25, 2015 2:31 PM To: Sam Halliday Cc: dev@spark.apache.org mailto:dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny Subject: RE: Using CUDA within Spark / boosting linear algebra Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org mailto:dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
Re: Using CUDA within Spark / boosting linear algebra
I'm not at all surprised ;-) I fully expect the GPU performance to get better automatically as the hardware improves. Netlib natives still need to be shipped separately. I'd also oppose any move to make Open BLAS the default - is not always better and I think natives really need DevOps buy-in. It's not the right solution for everybody. On 26 Mar 2015 01:23, Evan R. Sparks evan.spa...@gmail.com wrote: Yeah, much more reasonable - nice to know that we can get full GPU performance from breeze/netlib-java - meaning there's no compelling performance reason to switch out our current linear algebra library (at least as far as this benchmark is concerned). Instead, it looks like a user guide for configuring Spark/MLlib to use the right BLAS library will get us most of the way there. Or, would it make sense to finally ship openblas compiled for some common platforms (64-bit linux, windows, mac) directly with Spark - hopefully eliminating the jblas warnings once and for all for most users? (Licensing is BSD) Or am I missing something? On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: As everyone suggested, the results were too good to be true, so I double-checked them. It turns that nvblas did not do multiplication due to parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My previously posted results with nvblas are matrices copying only. The default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked other values that worked. As a result, netlib+nvblas is on par with BIDMat-cuda. As promised, I am going to post a how-to for nvblas configuration. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing -Original Message- From: Ulanov, Alexander Sent: Wednesday, March 25, 2015 2:31 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny Subject: RE: Using CUDA within Spark / boosting linear algebra Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| |0 8873C bash 39MiB | |0 8910C /usr/lib/jvm/java-1.7.0/bin/java 39MiB | +-+ In Spark shell I do matrix multiplication and see the following: 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000 Could you
Re: Using CUDA within Spark / boosting linear algebra
Btw, OpenBLAS requires GPL runtime binaries which are typically considered system libraries (and these fall under something similar to the Java classpath exception rule)... so it's basically impossible to distribute OpenBLAS the way you're suggesting, sorry. Indeed, there is work ongoing in Spark right now to clear up something of this nature. On a more technical level, I'd recommend watching my talk at ScalaX which explains in detail why high performance only comes from machine optimised binaries, which requires DevOps buy-in (and, I'd recommend using MKL anyway on the CPU, not OpenBLAS). On an even deeper level, using natives has consequences to JIT and GC which isn't suitable for everybody and we'd really like people to go into that with their eyes wide open. On 26 Mar 2015 07:43, Sam Halliday sam.halli...@gmail.com wrote: I'm not at all surprised ;-) I fully expect the GPU performance to get better automatically as the hardware improves. Netlib natives still need to be shipped separately. I'd also oppose any move to make Open BLAS the default - is not always better and I think natives really need DevOps buy-in. It's not the right solution for everybody. On 26 Mar 2015 01:23, Evan R. Sparks evan.spa...@gmail.com wrote: Yeah, much more reasonable - nice to know that we can get full GPU performance from breeze/netlib-java - meaning there's no compelling performance reason to switch out our current linear algebra library (at least as far as this benchmark is concerned). Instead, it looks like a user guide for configuring Spark/MLlib to use the right BLAS library will get us most of the way there. Or, would it make sense to finally ship openblas compiled for some common platforms (64-bit linux, windows, mac) directly with Spark - hopefully eliminating the jblas warnings once and for all for most users? (Licensing is BSD) Or am I missing something? On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: As everyone suggested, the results were too good to be true, so I double-checked them. It turns that nvblas did not do multiplication due to parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My previously posted results with nvblas are matrices copying only. The default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked other values that worked. As a result, netlib+nvblas is on par with BIDMat-cuda. As promised, I am going to post a how-to for nvblas configuration. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing -Original Message- From: Ulanov, Alexander Sent: Wednesday, March 25, 2015 2:31 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny Subject: RE: Using CUDA within Spark / boosting linear algebra Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes
RE: Using CUDA within Spark / boosting linear algebra
Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| |0 8873C bash39MiB | |0 8910C /usr/lib/jvm/java-1.7.0/bin/java39MiB | +-+ In Spark shell I do matrix multiplication and see the following: 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000 Could you suggest might the LD_PRELOAD not affect Spark shell? Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.commailto:sam.halli...@gmail.com] Sent: Tuesday, March 03, 2015 1:54 PM To: Xiangrui Meng; Joseph Bradley Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra BTW, is anybody on this list going to the London Meetup in a few weeks? https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community Would be nice to meet other people working on the guts of Spark! :-) Xiangrui Meng men...@gmail.commailto:men...@gmail.com writes: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.commailto:jos...@databricks.com wrote: Better documentation for linking
Re: Using CUDA within Spark / boosting linear algebra
That would be a difficult task that would only benefit users of netlib-java. MultiBLAS is easily implemented (although a lot of boilerplate) and benefits all BLAS users on the system. If anyone knows of a funding route for it, I'd love to hear from them, because it's too much work for me to take on at the moment as hobby. On 25 Mar 2015 22:16, Dmitriy Lyubimov dlie...@gmail.com wrote: Sam, whould it be easier to hack netlib-java to allow multiple (configurable) library contexts? And so enable 3rd party configurations and optimizers to make their own choices until then? On Wed, Mar 25, 2015 at 3:07 PM, Sam Halliday sam.halli...@gmail.com wrote: Yeah, MultiBLAS... it is dynamic. Except, I haven't written it yet :-P On 25 Mar 2015 22:06, Ulanov, Alexander alexander.ula...@hp.com wrote: Netlib knows nothing about GPU (or CPU), it just uses cblas symbols from the provided libblas.so.3 library at the runtime. So, you can switch at the runtime by providing another library. Sam, please suggest if there is another way. *From:* Dmitriy Lyubimov [mailto:dlie...@gmail.com] *Sent:* Wednesday, March 25, 2015 2:55 PM *To:* Ulanov, Alexander *Cc:* Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny *Subject:* Re: Using CUDA within Spark / boosting linear algebra Alexander, does using netlib imply that one cannot switch between CPU and GPU blas alternatives at will at the same time? the choice is always determined by linking aliternatives to libblas.so, right? On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| |0 8873C bash 39MiB | |0 8910C /usr/lib/jvm/java-1.7.0/bin/java 39MiB | +-+ In Spark shell I do matrix multiplication and see the following: 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000 Could you suggest might the LD_PRELOAD not affect Spark shell? Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much for following up on this! Hmm, I wonder if we should have a concerted
Re: Using CUDA within Spark / boosting linear algebra
If you write it up I'll add it to the netlib-java wiki :-) BTW, does it automatically flip between cpu/GPU? I've a project called MultiBLAS which was going to do this, it should be easy (but boring to write) On 25 Mar 2015 22:00, Evan R. Sparks evan.spa...@gmail.com wrote: Alex - great stuff, and the nvblas numbers are pretty remarkable (almost too good... did you check the results for correctness? - also, is it possible that the unified memory model of nvblas is somehow hiding pci transfer time?) this last bit (getting nvblas + netlib-java to play together) sounds like it's non-trivial and took you a while to figure out! Would you mind posting a gist or something of maybe the shell scripts/exports you used to make this work - I can imagine it being highly useful for others in the future. Thanks! Evan On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| |0 8873C bash 39MiB | |0 8910C /usr/lib/jvm/java-1.7.0/bin/java 39MiB | +-+ In Spark shell I do matrix multiplication and see the following: 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000 Could you suggest might the LD_PRELOAD not affect Spark shell? Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto: alexander.ula...@hp.com wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.commailto: sam.halli...@gmail.com
RE: Using CUDA within Spark / boosting linear algebra
Sure, I will write a how-to after I re-check the results. -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Wednesday, March 25, 2015 3:04 PM To: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra If you write it up I'll add it to the netlib-java wiki :-) BTW, does it automatically flip between cpu/GPU? I've a project called MultiBLAS which was going to do this, it should be easy (but boring to write) On 25 Mar 2015 22:00, Evan R. Sparks evan.spa...@gmail.com wrote: Alex - great stuff, and the nvblas numbers are pretty remarkable (almost too good... did you check the results for correctness? - also, is it possible that the unified memory model of nvblas is somehow hiding pci transfer time?) this last bit (getting nvblas + netlib-java to play together) sounds like it's non-trivial and took you a while to figure out! Would you mind posting a gist or something of maybe the shell scripts/exports you used to make this work - I can imagine it being highly useful for others in the future. Thanks! Evan On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3 108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx37 8T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| |0 8873C bash 39MiB | |0 8910C /usr/lib/jvm/java-1.7.0/bin/java 39MiB | +-+ In Spark shell I do matrix multiplication and see the following: 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000 Could you suggest might the LD_PRELOAD not affect Spark shell? Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto: alexander.ula...@hp.com wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test
RE: Using CUDA within Spark / boosting linear algebra
Netlib knows nothing about GPU (or CPU), it just uses cblas symbols from the provided libblas.so.3 library at the runtime. So, you can switch at the runtime by providing another library. Sam, please suggest if there is another way. From: Dmitriy Lyubimov [mailto:dlie...@gmail.com] Sent: Wednesday, March 25, 2015 2:55 PM To: Ulanov, Alexander Cc: Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny Subject: Re: Using CUDA within Spark / boosting linear algebra Alexander, does using netlib imply that one cannot switch between CPU and GPU blas alternatives at will at the same time? the choice is always determined by linking aliternatives to libblas.so, right? On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.orgmailto:dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| |0 8873C bash39MiB | |0 8910C /usr/lib/jvm/java-1.7.0/bin/java39MiB | +-+ In Spark shell I do matrix multiplication and see the following: 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000 Could you suggest might the LD_PRELOAD not affect Spark shell? Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.commailto:sam.halli...@gmail.com] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: dev@spark.apache.orgmailto:dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday
RE: Using CUDA within Spark / boosting linear algebra
As everyone suggested, the results were too good to be true, so I double-checked them. It turns that nvblas did not do multiplication due to parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My previously posted results with nvblas are matrices copying only. The default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked other values that worked. As a result, netlib+nvblas is on par with BIDMat-cuda. As promised, I am going to post a how-to for nvblas configuration. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing -Original Message- From: Ulanov, Alexander Sent: Wednesday, March 25, 2015 2:31 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny Subject: RE: Using CUDA within Spark / boosting linear algebra Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| |0 8873C bash39MiB | |0 8910C /usr/lib/jvm/java-1.7.0/bin/java39MiB | +-+ In Spark shell I do matrix multiplication and see the following: 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000 Could you suggest might the LD_PRELOAD not affect Spark shell? Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.commailto:sam.halli...@gmail.com] Sent: Tuesday, March 03
Re: Using CUDA within Spark / boosting linear algebra
Alex - great stuff, and the nvblas numbers are pretty remarkable (almost too good... did you check the results for correctness? - also, is it possible that the unified memory model of nvblas is somehow hiding pci transfer time?) this last bit (getting nvblas + netlib-java to play together) sounds like it's non-trivial and took you a while to figure out! Would you mind posting a gist or something of maybe the shell scripts/exports you used to make this work - I can imagine it being highly useful for others in the future. Thanks! Evan On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| |0 8873C bash 39MiB | |0 8910C /usr/lib/jvm/java-1.7.0/bin/java 39MiB | +-+ In Spark shell I do matrix multiplication and see the following: 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000 Could you suggest might the LD_PRELOAD not affect Spark shell? Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto: alexander.ula...@hp.com wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.commailto: sam.halli...@gmail.com] Sent: Tuesday, March 03, 2015 1:54 PM To: Xiangrui Meng; Joseph Bradley Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.orgmailto: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra BTW, is anybody on this list going to the London Meetup
Re: Using CUDA within Spark / boosting linear algebra
Alex, I think you should recheck your numbers. Both BIDMat and nvblas are wrappers for cublas. The speeds are identical, except on machines that have multiple GPUs which nvblas exploits and cublas doesnt. It would be a good idea to add a column with Gflop throughput. Your numbers for BIDMat 10kx10k multiply give about 300 single float gflops, which seems about right for a Quadro 4000 (current generation devices are 10x faster than a 4000). Your numbers for netlib-nvblas would indicate a double float throughput of 8 tflops, which is physically impossible on that device. It shouldnt matter which interface you use if you have a single GPU. -John On 3/25/2015 2:34 PM, Ulanov, Alexander [via Apache Spark Developers List] wrote: Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| |0 8873C bash 39MiB | |0 8910C /usr/lib/jvm/java-1.7.0/bin/java 39MiB | +-+ In Spark shell I do matrix multiplication and see the following: 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000 Could you suggest might the LD_PRELOAD not affect Spark shell? Best regards, Alexander From: Sam Halliday [mailto:[hidden email]] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, Ulanov, Alexander [hidden email]mailto:[hidden email] wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday [mailto:[hidden email]mailto:[hidden email]] Sent: Tuesday, March 03, 2015 1:54 PM To: Xiangrui Meng; Joseph Bradley Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]mailto:[hidden email] Subject: Re: Using CUDA within Spark / boosting linear algebra BTW, is anybody on this list going
Re: Using CUDA within Spark / boosting linear algebra
Reyonld, Prof Canny gives me the slides yesterday I will posted the link to the slides to both SF BIg Analytics and SF Machine Learning meetups. Chester Sent from my iPad On Mar 12, 2015, at 22:53, Reynold Xin r...@databricks.com wrote: Thanks for chiming in, John. I missed your meetup last night - do you have any writeups or slides about roofline design? In particular, I'm curious about what optimizations are available for power-law dense * sparse? (I don't have any background in optimizations) On Thu, Mar 12, 2015 at 8:50 PM, jfcanny ca...@berkeley.edu wrote: If you're contemplating GPU acceleration in Spark, its important to look beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the datasets we've tested in BIDMach, and we've tried to make them representative of industry machine learning workloads. Unless you're crunching images or audio, the majority of data will be very sparse and power law distributed. You need a good sparse BLAS, and in practice it seems like you need a sparse BLAS tailored for power-law data. We had to write our own since the NVIDIA libraries didnt perform well on typical power-law data. Intel MKL sparse BLAS also have issues and we only use some of them. You also need 2D reductions, scan operations, slicing, element-wise transcendental functions and operators, many kinds of sort, random number generators etc, and some kind of memory management strategy. Some of this was layered on top of Thrust in BIDMat, but most had to be written from scratch. Its all been rooflined, typically to memory throughput of current GPUs (around 200 GB/s). When you have all this you can write Learning Algorithms in the same high-level primitives available in Breeze or Numpy/Scipy. Its literally the same in BIDMat, since the generic matrix operations are implemented on both CPU and GPU, so the same code runs on either platform. A lesser known fact is that GPUs are around 10x faster for *all* those operations, not just dense BLAS. Its mostly due to faster streaming memory speeds, but some kernels (random number generation and transcendentals) are more than an order of magnitude thanks to some specialized hardware for power series on the GPU chip. When you have all this there is no need to move data back and forth across the PCI bus. The CPU only has to pull chunks of data off disk, unpack them, and feed them to the available GPUs. Most models fit comfortably in GPU memory these days (4-12 GB). With minibatch algorithms you can push TBs of data through the GPU this way. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Using CUDA within Spark / boosting linear algebra
Hi Reynold, I left Chester with a copy of the slides, so I assume they'll be posted on the SF ML or Big Data sites. We have a draft paper under review. I can ask the co-authors about arxiv'ing it. We have a few heuristics for power-law data. One of them is to keep the feature set sorted by frequency. Power-law data has roughly the same mass in each power-of-two range of feature frequency. By keeping the most frequent features together, you get a lot more value out of the caches on the device (even GPUs have them, albeit smaller ones). e.g. with 100 million features, 1/2 of the feature instances will be in the range 1...,10,000. If they're consecutive they will all hit a fast cache. Another 1/4 will be in 1,...,1,000,000 hitting the next cache etc. Another is to subdivide sparse matrices using the vector of elements rather than rows or columns. Splitting power-law matrices by either rows or columns gives very uneven splits. That means we store sparse matrices in coordinate form rather than compressed row or column format. Other than that, rooflining gives you a goal that you should be able to reach. If you arent at the limit, just knowing that gives you a target to aim at. You can try profiling the kernel to figure out why its slower than it should be. There are a few common reasons (low occupancy, imbalanced thread blocks, thread divergence) that you can discover with the profiler. Then hopefully you can solve them. -John On 3/12/2015 10:56 PM, rxin [via Apache Spark Developers List] wrote: Thanks for chiming in, John. I missed your meetup last night - do you have any writeups or slides about roofline design? In particular, I'm curious about what optimizations are available for power-law dense * sparse? (I don't have any background in optimizations) On Thu, Mar 12, 2015 at 8:50 PM, jfcanny [hidden email] /user/SendEmail.jtp?type=nodenode=11022i=0 wrote: If you're contemplating GPU acceleration in Spark, its important to look beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the datasets we've tested in BIDMach, and we've tried to make them representative of industry machine learning workloads. Unless you're crunching images or audio, the majority of data will be very sparse and power law distributed. You need a good sparse BLAS, and in practice it seems like you need a sparse BLAS tailored for power-law data. We had to write our own since the NVIDIA libraries didnt perform well on typical power-law data. Intel MKL sparse BLAS also have issues and we only use some of them. You also need 2D reductions, scan operations, slicing, element-wise transcendental functions and operators, many kinds of sort, random number generators etc, and some kind of memory management strategy. Some of this was layered on top of Thrust in BIDMat, but most had to be written from scratch. Its all been rooflined, typically to memory throughput of current GPUs (around 200 GB/s). When you have all this you can write Learning Algorithms in the same high-level primitives available in Breeze or Numpy/Scipy. Its literally the same in BIDMat, since the generic matrix operations are implemented on both CPU and GPU, so the same code runs on either platform. A lesser known fact is that GPUs are around 10x faster for *all* those operations, not just dense BLAS. Its mostly due to faster streaming memory speeds, but some kernels (random number generation and transcendentals) are more than an order of magnitude thanks to some specialized hardware for power series on the GPU chip. When you have all this there is no need to move data back and forth across the PCI bus. The CPU only has to pull chunks of data off disk, unpack them, and feed them to the available GPUs. Most models fit comfortably in GPU memory these days (4-12 GB). With minibatch algorithms you can push TBs of data through the GPU this way. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] /user/SendEmail.jtp?type=nodenode=11022i=1 For additional commands, e-mail: [hidden email] /user/SendEmail.jtp?type=nodenode=11022i=2 If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11022.html To unsubscribe from Using CUDA within Spark / boosting linear algebra, click here http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode
RE: Using CUDA within Spark / boosting linear algebra
I can run benchmark on another machine with GPU nVidia Titan and Intel Xeon E5-2650 v2, although it runs Windows and I have to run Linux tests in VirtualBox. It would be also interesting to add results on netlib+nvblas, however I am not sure I understand in details how to build this and will appreciate any help from you ☺ From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.commailto:sam.halli...@gmail.com] Sent: Tuesday, March 03, 2015 1:54 PM To: Xiangrui Meng; Joseph Bradley Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra BTW, is anybody on this list going to the London Meetup in a few weeks? https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community Would be nice to meet other people working on the guts of Spark! :-) Xiangrui Meng men...@gmail.commailto:men...@gmail.com writes: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.commailto:jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.commailto:evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of BIDMat+magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo= =netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx 378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native
RE: Using CUDA within Spark / boosting linear algebra
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Tuesday, March 03, 2015 1:54 PM To: Xiangrui Meng; Joseph Bradley Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra BTW, is anybody on this list going to the London Meetup in a few weeks? https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community Would be nice to meet other people working on the guts of Spark! :-) Xiangrui Meng men...@gmail.com writes: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of BIDMat+magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo= =netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx 378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Great - perhaps we can move this discussion off-list and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705) It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us
RE: Using CUDA within Spark / boosting linear algebra
Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Tuesday, March 03, 2015 1:54 PM To: Xiangrui Meng; Joseph Bradley Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra BTW, is anybody on this list going to the London Meetup in a few weeks? https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community Would be nice to meet other people working on the guts of Spark! :-) Xiangrui Meng men...@gmail.com writes: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of BIDMat+magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo= =netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx 378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re
Re: Using CUDA within Spark / boosting linear algebra
BTW, is anybody on this list going to the London Meetup in a few weeks? https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community Would be nice to meet other people working on the guts of Spark! :-) Xiangrui Meng men...@gmail.com writes: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Great - perhaps we can move this discussion off-list and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705) It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us who really care about fast linear algebra!) - Evan On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlib-java and perform benchmark again. Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par
RE: Using CUDA within Spark / boosting linear algebra
Thanks Sam for suggestion! I should try doing this. Now I suppose that netlib-java linked with cuBlas during the execution time does fall back to cblas library in my system, which is atlas. If I remove atlas, netlib (linked with cublas) fails with the message undefined symbol: cblas_dgemm. In the meantime, I have updated my spreadsheet with BIDMat-cuda results that does copy from main memory to GPU, multiplies and the copies it back to main memory (similar to what Xiangrui did). Surprisingly (for myself), the copying overhead seems quite small, especially for the bigger matrices. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 02, 2015 1:24 PM To: Ulanov, Alexander Subject: Re: Using CUDA within Spark / boosting linear algebra That's correct. It's highly unusual for a libblas.so to only provide the Fortran API. Oh well... CBLAS sources are available in the netlib-java repository so you could simply compile them and link against whatever libblas.so[fortran] you like. On 2 March 2015 at 21:04, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Xiangrui, Thanks for the link, I am currently trying to use nvblas. It seems that netlib wrappers are implemented with C-BLAS interface and nvblas does not have c-blas. I wonder how it is going to work. I'll keep you updated. Alexander -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Monday, March 02, 2015 11:42 AM To: Sam Halliday Cc: Joseph Bradley; Ulanov, Alexander; dev; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday sam.halli...@gmail.com wrote: Also, check the JNILoader output. Remember, for netlib-java to use your system libblas all you need to do is setup libblas.so.3 like any native application would expect. I haven't ever used the cublas real BLAS implementation, so I'd be interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to check that all the runtime links are in order. There are two shared libraries in this hybrid setup. nvblas.so must be loaded before libblas.so to intercept level 3 routines using GPU. More details are at: http://docs.nvidia.com/cuda/nvblas/index.html#Usage Btw, I have some DGEMM wrappers in my netlib-java performance module... and I also planned to write more in MultiBLAS (until I mothballed the project for the hardware to catch up, which is probably has and now I just need a reason to look at it) On 27 Feb 2015 20:26, Xiangrui Meng men...@gmail.com wrote: Hey Sam, The running times are not big O estimates: The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I think there is something wrong with the netlib/cublas combination. Sam already mentioned that cuBLAS doesn't implement the CPU BLAS interfaces. I checked the CUDA doc and it seems that to use GPU BLAS through the CPU BLAS interface we need to use NVBLAS, which intercepts some Level 3 CPU BLAS calls (including GEMM). So we need to load nvblas.so first and then some CPU BLAS library in JNI. I wonder whether the setup was correct. Alexander, could you check whether GPU is used in the netlib-cublas experiments? You can tell it by watching CPU/GPU usage. Best, Xiangrui On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday sam.halli...@gmail.com wrote: Don't use big O estimates, always measure. It used to work back in the days when double multiplication was a bottleneck. The computation cost is effectively free on both the CPU and GPU and you're seeing pure copying costs. Also, I'm dubious that cublas is doing what you think it is. Can you link me to the source code for DGEMM? I show all of this in my talk, with explanations, I can't stress enough how much I recommend that you watch it if you want to understand high performance hardware acceleration for linear algebra :-) On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote: The copying overhead should be quadratic on n, while the computation cost is cubic on n. I can understand that netlib-cublas is slower than netlib-openblas on small problems. But I'm surprised to see that it is still 20x slower on 1x1. I did the following on a g2.2xlarge instance with BIDMat: val n = 1 val f = rand(n, n) flip; f*f; val rf = flop flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop flip; g*g; val rgg = flop The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas path. But based on the result, the data copying overhead is definitely not as big
RE: Using CUDA within Spark / boosting linear algebra
Hi Xiangrui, Thanks for the link, I am currently trying to use nvblas. It seems that netlib wrappers are implemented with C-BLAS interface and nvblas does not have c-blas. I wonder how it is going to work. I'll keep you updated. Alexander -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Monday, March 02, 2015 11:42 AM To: Sam Halliday Cc: Joseph Bradley; Ulanov, Alexander; dev; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday sam.halli...@gmail.com wrote: Also, check the JNILoader output. Remember, for netlib-java to use your system libblas all you need to do is setup libblas.so.3 like any native application would expect. I haven't ever used the cublas real BLAS implementation, so I'd be interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to check that all the runtime links are in order. There are two shared libraries in this hybrid setup. nvblas.so must be loaded before libblas.so to intercept level 3 routines using GPU. More details are at: http://docs.nvidia.com/cuda/nvblas/index.html#Usage Btw, I have some DGEMM wrappers in my netlib-java performance module... and I also planned to write more in MultiBLAS (until I mothballed the project for the hardware to catch up, which is probably has and now I just need a reason to look at it) On 27 Feb 2015 20:26, Xiangrui Meng men...@gmail.com wrote: Hey Sam, The running times are not big O estimates: The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I think there is something wrong with the netlib/cublas combination. Sam already mentioned that cuBLAS doesn't implement the CPU BLAS interfaces. I checked the CUDA doc and it seems that to use GPU BLAS through the CPU BLAS interface we need to use NVBLAS, which intercepts some Level 3 CPU BLAS calls (including GEMM). So we need to load nvblas.so first and then some CPU BLAS library in JNI. I wonder whether the setup was correct. Alexander, could you check whether GPU is used in the netlib-cublas experiments? You can tell it by watching CPU/GPU usage. Best, Xiangrui On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday sam.halli...@gmail.com wrote: Don't use big O estimates, always measure. It used to work back in the days when double multiplication was a bottleneck. The computation cost is effectively free on both the CPU and GPU and you're seeing pure copying costs. Also, I'm dubious that cublas is doing what you think it is. Can you link me to the source code for DGEMM? I show all of this in my talk, with explanations, I can't stress enough how much I recommend that you watch it if you want to understand high performance hardware acceleration for linear algebra :-) On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote: The copying overhead should be quadratic on n, while the computation cost is cubic on n. I can understand that netlib-cublas is slower than netlib-openblas on small problems. But I'm surprised to see that it is still 20x slower on 1x1. I did the following on a g2.2xlarge instance with BIDMat: val n = 1 val f = rand(n, n) flip; f*f; val rf = flop flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop flip; g*g; val rgg = flop The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas path. But based on the result, the data copying overhead is definitely not as big as 20x at n = 1. Best, Xiangrui On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com wrote: I've had some email exchanges with the author of BIDMat: it does exactly what you need to get the GPU benefit and writes higher level algorithms entirely in the GPU kernels so that the memory stays there as long as possible. The restriction with this approach is that it is only offering high-level algorithms so is not a toolkit for applied mathematics research and development --- but it works well as a toolkit for higher level analysis (e.g. for analysts and practitioners). I believe BIDMat's approach is the best way to get performance out of GPU hardware at the moment but I also have strong evidence to suggest that the hardware will catch up and the memory transfer costs between CPU/GPU will disappear meaning that there will be no need for custom GPU kernel implementations. i.e. please continue to use BLAS primitives when writing new algorithms and only go to the GPU for an alternative optimised implementation. Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer an API that looks like BLAS but takes
Re: Using CUDA within Spark / boosting linear algebra
the potential to eliminate the green line. Best regards, Sam Ulanov, Alexander alexander.ula...@hp.com writes: Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress. Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk. Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, February 26, 2015 1:56 PM To: Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results. Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem) On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.commailto:men...@gmail.com wrote: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.commailto:jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.commailto:evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev
Re: Using CUDA within Spark / boosting linear algebra
26, 2015 1:56 PM To: Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results. Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem) On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.commailto:men...@gmail.com wrote: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.commailto:jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.commailto:evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Great - perhaps we can move
Re: Using CUDA within Spark / boosting linear algebra
. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress. Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk. Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, February 26, 2015 1:56 PM To: Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results. Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem) On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.commailto:men...@gmail.com wrote: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.commailto:jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.commailto:evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL
Re: Using CUDA within Spark / boosting linear algebra
Don't use big O estimates, always measure. It used to work back in the days when double multiplication was a bottleneck. The computation cost is effectively free on both the CPU and GPU and you're seeing pure copying costs. Also, I'm dubious that cublas is doing what you think it is. Can you link me to the source code for DGEMM? I show all of this in my talk, with explanations, I can't stress enough how much I recommend that you watch it if you want to understand high performance hardware acceleration for linear algebra :-) On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote: The copying overhead should be quadratic on n, while the computation cost is cubic on n. I can understand that netlib-cublas is slower than netlib-openblas on small problems. But I'm surprised to see that it is still 20x slower on 1x1. I did the following on a g2.2xlarge instance with BIDMat: val n = 1 val f = rand(n, n) flip; f*f; val rf = flop flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop flip; g*g; val rgg = flop The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas path. But based on the result, the data copying overhead is definitely not as big as 20x at n = 1. Best, Xiangrui On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com wrote: I've had some email exchanges with the author of BIDMat: it does exactly what you need to get the GPU benefit and writes higher level algorithms entirely in the GPU kernels so that the memory stays there as long as possible. The restriction with this approach is that it is only offering high-level algorithms so is not a toolkit for applied mathematics research and development --- but it works well as a toolkit for higher level analysis (e.g. for analysts and practitioners). I believe BIDMat's approach is the best way to get performance out of GPU hardware at the moment but I also have strong evidence to suggest that the hardware will catch up and the memory transfer costs between CPU/GPU will disappear meaning that there will be no need for custom GPU kernel implementations. i.e. please continue to use BLAS primitives when writing new algorithms and only go to the GPU for an alternative optimised implementation. Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer an API that looks like BLAS but takes pointers to special regions in the GPU memory region. Somebody has written a wrapper around CUDA to create a proper BLAS library but it only gives marginal performance over the CPU because of the memory transfer overhead. This slide from my talk http://fommil.github.io/scalax14/#/11/2 says it all. X axis is matrix size, Y axis is logarithmic time to do DGEMM. Black line is the cheating time for the GPU and the green line is after copying the memory to/from the GPU memory. APUs have the potential to eliminate the green line. Best regards, Sam Ulanov, Alexander alexander.ula...@hp.com writes: Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress. Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk. Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, February 26, 2015 1:56 PM To: Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results. Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem) On 26 Feb 2015 21:16, Xiangrui Meng men
Re: Using CUDA within Spark / boosting linear algebra
Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results. Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem) On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.com wrote: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Great - perhaps we can move this discussion off-list and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705) It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us who really care about fast linear algebra!) - Evan On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for explanation and useful link
Re: Using CUDA within Spark / boosting linear algebra
I couldn't agree with you more, Sam. The GPU/Matrix guys typically don't count their copy times, but claim that you should be doing *as much as possible* on the GPU - so, maybe for some applications where you can generate the data on the GPU this makes sense. But, in the context of Spark we should be *very* careful about enumerating the applications we want GPU support for and deciding whether it's appropriate to measure the overheads of getting the data to the GPU. On Thu, Feb 26, 2015 at 1:55 PM, Sam Halliday sam.halli...@gmail.com wrote: Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results. Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem) On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.com wrote: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev
Re: Using CUDA within Spark / boosting linear algebra
Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Great - perhaps we can move this discussion off-list and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705) It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us who really care about fast linear algebra!) - Evan On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlib-java and perform benchmark again. Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads. Though, it might be interesting to link Netlib-java with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) interested to compare their libraries. Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa
Re: Using CUDA within Spark / boosting linear algebra
Hi all, I'm not surprised if the GPU is slow. It's about the bottleneck copying the memory. Watch my talk, linked from the netlib-java github page, to understand further. The only way to currently make use of a GPU is to do all the operations using the GPU's kernel. You can find some prepackaged high level algorithms than do this, but it's extremely limiting. I believe hardware will fix this problem eventually, so I still advocate using the netlib primitives. I'm particularly interested in APU approaches and I'm very interested in finding somebody to fund me to look into it. It's too much work for a side project. Look on the last few slides of my talk to see the potential performance gains. Best regards, Sam On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.com wrote: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Great - perhaps we can move this discussion off-list and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705) It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us who really care about fast linear algebra!) - Evan On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander alexander.ula
RE: Using CUDA within Spark / boosting linear algebra
Typo - CPU was 2.5 cheaper (not GPU!) -Original Message- From: Ulanov, Alexander Sent: Thursday, February 26, 2015 2:01 PM To: Sam Halliday; Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress. Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk. Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, February 26, 2015 1:56 PM To: Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results. Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem) On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.commailto:men...@gmail.com wrote: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.commailto:jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.commailto:evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZH l6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of BIDMat+magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo== netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx3 78T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly
RE: Using CUDA within Spark / boosting linear algebra
I've had some email exchanges with the author of BIDMat: it does exactly what you need to get the GPU benefit and writes higher level algorithms entirely in the GPU kernels so that the memory stays there as long as possible. The restriction with this approach is that it is only offering high-level algorithms so is not a toolkit for applied mathematics research and development --- but it works well as a toolkit for higher level analysis (e.g. for analysts and practitioners). I believe BIDMat's approach is the best way to get performance out of GPU hardware at the moment but I also have strong evidence to suggest that the hardware will catch up and the memory transfer costs between CPU/GPU will disappear meaning that there will be no need for custom GPU kernel implementations. i.e. please continue to use BLAS primitives when writing new algorithms and only go to the GPU for an alternative optimised implementation. Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer an API that looks like BLAS but takes pointers to special regions in the GPU memory region. Somebody has written a wrapper around CUDA to create a proper BLAS library but it only gives marginal performance over the CPU because of the memory transfer overhead. This slide from my talk http://fommil.github.io/scalax14/#/11/2 says it all. X axis is matrix size, Y axis is logarithmic time to do DGEMM. Black line is the cheating time for the GPU and the green line is after copying the memory to/from the GPU memory. APUs have the potential to eliminate the green line. Best regards, Sam Ulanov, Alexander alexander.ula...@hp.com writes: Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress. Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk. Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, February 26, 2015 1:56 PM To: Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results. Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem) On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.commailto:men...@gmail.com wrote: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.commailto:jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.commailto:evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU
RE: Using CUDA within Spark / boosting linear algebra
Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress. Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk. Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, February 26, 2015 1:56 PM To: Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results. Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem) On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.commailto:men...@gmail.com wrote: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.commailto:jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.commailto:evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas
Re: Using CUDA within Spark / boosting linear algebra
The copying overhead should be quadratic on n, while the computation cost is cubic on n. I can understand that netlib-cublas is slower than netlib-openblas on small problems. But I'm surprised to see that it is still 20x slower on 1x1. I did the following on a g2.2xlarge instance with BIDMat: val n = 1 val f = rand(n, n) flip; f*f; val rf = flop flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop flip; g*g; val rgg = flop The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas path. But based on the result, the data copying overhead is definitely not as big as 20x at n = 1. Best, Xiangrui On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com wrote: I've had some email exchanges with the author of BIDMat: it does exactly what you need to get the GPU benefit and writes higher level algorithms entirely in the GPU kernels so that the memory stays there as long as possible. The restriction with this approach is that it is only offering high-level algorithms so is not a toolkit for applied mathematics research and development --- but it works well as a toolkit for higher level analysis (e.g. for analysts and practitioners). I believe BIDMat's approach is the best way to get performance out of GPU hardware at the moment but I also have strong evidence to suggest that the hardware will catch up and the memory transfer costs between CPU/GPU will disappear meaning that there will be no need for custom GPU kernel implementations. i.e. please continue to use BLAS primitives when writing new algorithms and only go to the GPU for an alternative optimised implementation. Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer an API that looks like BLAS but takes pointers to special regions in the GPU memory region. Somebody has written a wrapper around CUDA to create a proper BLAS library but it only gives marginal performance over the CPU because of the memory transfer overhead. This slide from my talk http://fommil.github.io/scalax14/#/11/2 says it all. X axis is matrix size, Y axis is logarithmic time to do DGEMM. Black line is the cheating time for the GPU and the green line is after copying the memory to/from the GPU memory. APUs have the potential to eliminate the green line. Best regards, Sam Ulanov, Alexander alexander.ula...@hp.com writes: Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress. Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk. Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, February 26, 2015 1:56 PM To: Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results. Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem) On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.commailto:men...@gmail.com wrote: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.commailto:jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.commailto:evan.spa...@gmail.com wrote: Thanks for compiling all the data and running
Re: Using CUDA within Spark / boosting linear algebra
Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Great - perhaps we can move this discussion off-list and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705) It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us who really care about fast linear algebra!) - Evan On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.com mailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlib-java and perform benchmark again. Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads. Though, it might be interesting to link Netlib-java with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) interested to compare their libraries. Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto: evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:58 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance
RE: Using CUDA within Spark / boosting linear algebra
Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo==netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Great - perhaps we can move this discussion off-list and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705) It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us who really care about fast linear algebra!) - Evan On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlib-java and perform benchmark again. Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads. Though, it might be interesting to link Netlib-java with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) interested to compare their libraries. Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:58 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL. To make sure the right library is getting used, you have to make sure it's first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. For some examples of getting netlib-java setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrix-bench In particular - build-openblas-ec2.sh shows you how to build the library and set up symlinks correctly, and scala/run-netlib.sh shows you how to get the path setup and get that library picked up by netlib-java. In this way - you could probably get cuBLAS set up to be used by netlib-java as well. - Evan On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:19 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev
RE: Using CUDA within Spark / boosting linear algebra
Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Great - perhaps we can move this discussion off-list and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705) It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us who really care about fast linear algebra!) - Evan On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlib-java and perform benchmark again. Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads. Though, it might be interesting to link Netlib-java with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) interested to compare their libraries. Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:58 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL. To make sure the right library is getting used, you have to make sure it's first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. For some examples of getting netlib-java setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrix-bench In particular - build-openblas-ec2.sh shows you how to build the library and set up symlinks correctly, and scala/run-netlib.sh shows you how to get the path setup and get that library picked up by netlib-java. In this way - you could probably get cuBLAS set up to be used by netlib-java as well. - Evan On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:19 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas
Re: Using CUDA within Spark / boosting linear algebra
Great - perhaps we can move this discussion off-list and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705) It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us who really care about fast linear algebra!) - Evan On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Evan, Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlib-java and perform benchmark again. Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads. Though, it might be interesting to link Netlib-java with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) interested to compare their libraries. Best regards, Alexander *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] *Sent:* Friday, February 06, 2015 5:58 PM *To:* Ulanov, Alexander *Cc:* Joseph Bradley; dev@spark.apache.org *Subject:* Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL. To make sure the right library is getting used, you have to make sure it's first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. For some examples of getting netlib-java setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrix-bench In particular - build-openblas-ec2.sh shows you how to build the library and set up symlinks correctly, and scala/run-netlib.sh shows you how to get the path setup and get that library picked up by netlib-java. In this way - you could probably get cuBLAS set up to be used by netlib-java as well. - Evan On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] *Sent:* Friday, February 06, 2015 5:19 PM *To:* Ulanov, Alexander *Cc:* Joseph Bradley; dev@spark.apache.org *Subject:* Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU
Re: Using CUDA within Spark / boosting linear algebra
Maybe you can ask prof john canny himself:-) as I invited him to give a talk at Alpine data labs in March's meetup (SF big Analytics SF machine learning joined meetup) , 3/11. To be announced in next day or so. Chester Sent from my iPhone On Feb 9, 2015, at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Evan, Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlib-java and perform benchmark again. Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads. Though, it might be interesting to link Netlib-java with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) interested to compare their libraries. Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:58 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL. To make sure the right library is getting used, you have to make sure it's first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. For some examples of getting netlib-java setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrix-bench In particular - build-openblas-ec2.sh shows you how to build the library and set up symlinks correctly, and scala/run-netlib.sh shows you how to get the path setup and get that library picked up by netlib-java. In this way - you could probably get cuBLAS set up to be used by netlib-java as well. - Evan On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:19 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.commailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good
RE: Using CUDA within Spark / boosting linear algebra
Hi Evan, Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlib-java and perform benchmark again. Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads. Though, it might be interesting to link Netlib-java with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) interested to compare their libraries. Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:58 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL. To make sure the right library is getting used, you have to make sure it's first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. For some examples of getting netlib-java setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrix-bench In particular - build-openblas-ec2.sh shows you how to build the library and set up symlinks correctly, and scala/run-netlib.sh shows you how to get the path setup and get that library picked up by netlib-java. In this way - you could probably get cuBLAS set up to be used by netlib-java as well. - Evan On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:19 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.commailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Thank you
Re: Using CUDA within Spark / boosting linear algebra
I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL. To make sure the right library is getting used, you have to make sure it's first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. For some examples of getting netlib-java setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrix-bench In particular - build-openblas-ec2.sh shows you how to build the library and set up symlinks correctly, and scala/run-netlib.sh shows you how to get the path setup and get that library picked up by netlib-java. In this way - you could probably get cuBLAS set up to be used by netlib-java as well. - Evan On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] *Sent:* Friday, February 06, 2015 5:19 PM *To:* Ulanov, Alexander *Cc:* Joseph Bradley; dev@spark.apache.org *Subject:* Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib. I am very interested to find out what will be better within Spark: BIDMat or netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning. From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 1:29 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1] BIDMach is highly
Re: Using CUDA within Spark / boosting linear algebra
Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib. I am very interested to find out what will be better within Spark: BIDMat or netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning. From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 1:29 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1] BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs. In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets - order of terabytes. For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib - keep in mind we need to be careful about maintaining cross-language compatibility for our Java and Python-users, though. - Evan [1] - http://arxiv.org/abs/1409.5402 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com mailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlib-java? The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning? Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto: evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 12:09 PM To: Ulanov, Alexander Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
RE: Using CUDA within Spark / boosting linear algebra
Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:19 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.commailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib. I am very interested to find out what will be better within Spark: BIDMat or netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning. From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 1:29 PM To: Ulanov, Alexander Cc: dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1] BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs. In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets - order of terabytes. For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib - keep in mind we need to be careful about maintaining cross-language compatibility for our Java and Python-users, though. - Evan [1] - http://arxiv.org/abs/1409.5402 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf On Thu, Feb 5, 2015 at 1
RE: Using CUDA within Spark / boosting linear algebra
Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib. I am very interested to find out what will be better within Spark: BIDMat or netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning. From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 1:29 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1] BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs. In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets - order of terabytes. For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib - keep in mind we need to be careful about maintaining cross-language compatibility for our Java and Python-users, though. - Evan [1] - http://arxiv.org/abs/1409.5402 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlib-java? The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning? Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 12:09 PM To: Ulanov, Alexander Cc: dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many cases. You might consider taking a look at the codepaths that BIDMat (https://github.com/BIDData/BIDMat) takes and comparing them to netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
Re: Using CUDA within Spark / boosting linear algebra
Lemme butt in randomly here and say there is an interesting discussion on this Spark PR https://github.com/apache/spark/pull/4448 about netlib-java, JBLAS, Breeze, and other things I know nothing of, that y'all may find interesting. Among the participants is the author of netlib-java. On Sun Feb 08 2015 at 2:48:19 AM Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib. I am very interested to find out what will be better within Spark: BIDMat or netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning. From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 1:29 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1] BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs. In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets - order of terabytes. For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib - keep in mind we need to be careful about maintaining cross-language compatibility for our Java and Python-users, though. - Evan [1] - http://arxiv.org/abs/1409.5402 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com mailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlib-java? The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning? Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto: evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 12:09 PM To: Ulanov
RE: Using CUDA within Spark / boosting linear algebra
Hi Evan, Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlib-java? The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning? Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 12:09 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many cases. You might consider taking a look at the codepaths that BIDMat (https://github.com/BIDData/BIDMat) takes and comparing them to netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to make this work really fast from Scala. I've run it on my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of layers of indirection here and you really want to avoid data copying as much as possible. We could also consider swapping out BIDMat for Breeze, but that would be a big project and if we can figure out how to get breeze+cublas to comparable performance that would be a big win. On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Dear Spark developers, I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrix-matrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlib-java page https://github.com/fommil/netlib-java. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I would like to boost performance more. GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpu-based blas) with Netlib-java wrapper and put it into Spark, so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial neural network batch learning in Spark MLlib that involves matrix-matrix multiplications. It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved. One of the reasons for slowdown might be the overhead of copying the matrices from computer memory to graphic card memory and back. So, few questions: 1) Do these results with CUDA make sense? 2) If the problem is with copy overhead, are there any libraries that allow to force intermediate results to stay in graphic card memory thus removing the overhead? 3) Any other options to speed-up linear algebra in Spark? Thank you, Alexander - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.orgmailto:dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.orgmailto:dev-h...@spark.apache.org
Re: Using CUDA within Spark / boosting linear algebra
I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1] BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs. In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets - order of terabytes. For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib - keep in mind we need to be careful about maintaining cross-language compatibility for our Java and Python-users, though. - Evan [1] - http://arxiv.org/abs/1409.5402 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Evan, Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlib-java? The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning? Best regards, Alexander *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] *Sent:* Thursday, February 05, 2015 12:09 PM *To:* Ulanov, Alexander *Cc:* dev@spark.apache.org *Subject:* Re: Using CUDA within Spark / boosting linear algebra I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many cases. You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to make this work really fast from Scala. I've run it on my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of layers of indirection here and you really want to avoid data copying as much as possible. We could also consider swapping out BIDMat for Breeze, but that would be a big project and if we can figure out how to get breeze+cublas to comparable performance that would be a big win. On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Dear Spark developers, I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrix-matrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlib-java page https://github.com/fommil/netlib-java. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I would like to boost performance more. GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpu-based blas) with Netlib-java wrapper and put it into Spark, so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial neural network batch learning in Spark MLlib that involves matrix-matrix multiplications. It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved. One of the reasons for slowdown might be the overhead of copying the matrices from computer memory to graphic card memory and back. So, few questions: 1) Do these results with CUDA make sense? 2) If the problem is with copy overhead, are there any libraries that allow to force intermediate results to stay in graphic card
RE: Using CUDA within Spark / boosting linear algebra
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib. I am very interested to find out what will be better within Spark: BIDMat or netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning. From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 1:29 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1] BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs. In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets - order of terabytes. For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib - keep in mind we need to be careful about maintaining cross-language compatibility for our Java and Python-users, though. - Evan [1] - http://arxiv.org/abs/1409.5402 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlib-java? The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning? Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 12:09 PM To: Ulanov, Alexander Cc: dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many cases. You might consider taking a look at the codepaths that BIDMat (https://github.com/BIDData/BIDMat) takes and comparing them to netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to make this work really fast from Scala. I've run it on my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of layers of indirection here and you really want to avoid data copying as much as possible. We could also consider swapping out BIDMat for Breeze, but that would be a big project and if we can figure out how to get breeze+cublas to comparable performance that would be a big win. On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Dear Spark developers, I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrix-matrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlib-java page https://github.com/fommil/netlib-java. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I would like to boost performance more. GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpu-based blas) with Netlib-java wrapper and put
Re: Using CUDA within Spark / boosting linear algebra
Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib. I am very interested to find out what will be better within Spark: BIDMat or netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning. From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 1:29 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1] BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs. In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets - order of terabytes. For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib - keep in mind we need to be careful about maintaining cross-language compatibility for our Java and Python-users, though. - Evan [1] - http://arxiv.org/abs/1409.5402 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com mailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlib-java? The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning? Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto: evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 12:09 PM To: Ulanov, Alexander Cc: dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many cases. You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to make this work really fast from Scala. I've run it on my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of layers of indirection here and you really want to avoid data copying as much as possible. We could also consider swapping out BIDMat for Breeze, but that would be a big project and if we can figure out how to get breeze+cublas to comparable performance that would be a big win. On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Dear Spark developers, I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning
Re: Using CUDA within Spark / boosting linear algebra
I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many cases. You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to make this work really fast from Scala. I've run it on my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of layers of indirection here and you really want to avoid data copying as much as possible. We could also consider swapping out BIDMat for Breeze, but that would be a big project and if we can figure out how to get breeze+cublas to comparable performance that would be a big win. On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Dear Spark developers, I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrix-matrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlib-java page https://github.com/fommil/netlib-java. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I would like to boost performance more. GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpu-based blas) with Netlib-java wrapper and put it into Spark, so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial neural network batch learning in Spark MLlib that involves matrix-matrix multiplications. It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved. One of the reasons for slowdown might be the overhead of copying the matrices from computer memory to graphic card memory and back. So, few questions: 1) Do these results with CUDA make sense? 2) If the problem is with copy overhead, are there any libraries that allow to force intermediate results to stay in graphic card memory thus removing the overhead? 3) Any other options to speed-up linear algebra in Spark? Thank you, Alexander - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Using CUDA within Spark / boosting linear algebra
Dear Spark developers, I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlib-java that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrix-matrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlib-java page https://github.com/fommil/netlib-java. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I would like to boost performance more. GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpu-based blas) with Netlib-java wrapper and put it into Spark, so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial neural network batch learning in Spark MLlib that involves matrix-matrix multiplications. It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved. One of the reasons for slowdown might be the overhead of copying the matrices from computer memory to graphic card memory and back. So, few questions: 1) Do these results with CUDA make sense? 2) If the problem is with copy overhead, are there any libraries that allow to force intermediate results to stay in graphic card memory thus removing the overhead? 3) Any other options to speed-up linear algebra in Spark? Thank you, Alexander - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org