Hi all, (I'm author of netlib-java)
Interesting to see this discussion come to life again. JNI is quite limiting: pinning (or critical array access) essentially disables the GC for the whole JVM for the duration of the native call. I can justify this for CPU heavy tasks because frankly there are not going to be any free cycles to do anything other than BLAS when a dense matrix in being crunched. For GPU tasks, you could get into some hairy problems and achieve OOM just by doing basic work. The other big problem with JNI is that this memory is either on the heap (and subject to the whims of GC, large pause times in tenured cleanups) or is a lightweight reference to a huge off-heap object and the GC might never clean it up. There are hacks around this, but none are satisfactory. More at my talk at Scala Exchange http://fommil.github.io/scalax14/#/ I have a roadmap to move netlib-java over to ByteBuffers as they solve all the problems I have seen. It would be an effective rewrite (down to the Fortran JVM compiler) and would change the java API in a systematic way, but could support BLA-like GPUs at the same time. I would be willing to migrate all the major libraries that are using netlib-java as part of this effort. However, I have no commercial incentive to perform this work, so I would be seeking funding to do it. I will not be starting anything without funding. Please contact me if you would be a willing stakeholder. I estimate it as a 6 month project: all major platforms, along with a CI build making it easy to update, with testing. On 22 Jan 2016 3:48 p.m., "Rajesh Bordawekar" <bor...@us.ibm.com> wrote: > Hi Alexander, > > We, at IBM Watson Research, are also working on GPU acceleration of Spark, > but we have taken an approach that is complimentary to Ishizaki-san's > direction. Our focus is to develop runtime infrastructure to enable > multi-node multi-GPU exploitation in the Spark environment. The key goal of > our approach is to enable **transparent** invocation of GPUs, without > requiring the user to change a single line of code. Users may need to add a > Spark configuration flag to direct the system on the GPU usage (exact > semantics are currently being debated). > > Currently, we have LFBGS-based Logistic Regression model building and > prediction implemented on a multi-node multi-GPU environment (the model > building is done on single node). We are using our own implementation of > LBFGS as a baseline for the GPU code. The GPU code used cublas (I presume > that's what you meant by NVBLAS) wherever possible, and indeed, we arrange > the execution so that cublas operates on larger matrices. We are using JNI > to invoke CUDA from Scala and we have not seen any performance degradation > due to JNI-based invocation. > > We are in the process of implementing ADMM based distributed optimization > function, which would build the model in parallel (currently uses LBFGS as > its individual kernel- can be replaced by any other kernel as well). The > ADMM function would also be accelerated in a multi-node multi-user > environment. We are planning to shift to DataSets/Dataframes soon and > support other Logistic regression kernels such as Quasi-Newton based > approaches. > > We have also enabled the Spark MLLib ALS algorithm to run on a multi-node > multi-GPU system (ALS code also uses cublas/cusparse). Next, we will be > covering additional functions for GPU exploitation, e.g., word2Vec (CBOW > and Skip-gram with Negative Sampling), Glove, etc.. > > Regarding comparison to BIDMat/BIDMach, we have studied it in detail and > have been using it as a guide on integrating GPU code with Scala. However, > I think comparing end-to-end results would not be appropriate as we are > affected by Spark's runtime costs; specifically, a single Spark function to > convert RDD to arrays is very expensive and impacts our end-to-end > performance severely (from 200+ gain for the GPU kernel to 25+ for the > Spark library function). In contrast, BIDMach has a very light and > efficient layer between their GPU kernel and the user program. > > Finally, we are building a comprehensive multi-node multi-GPU resource > management and discovery component in spark. We are planning to augment the > existing Spark resource management UI to include GPU resources. > > Please let me know if you have questions/comments! I will attending at the > Spark Summit East, and can meet in person to discuss any details. > > -regards, > Rajesh > > > ----- Forwarded by Randy Swanberg/Austin/IBM on 01/21/2016 09:31 PM ----- > > From: "Ulanov, Alexander" <alexander.ula...@hpe.com> > To: Kazuaki Ishizaki <ishiz...@jp.ibm.com>, "dev@spark.apache.org" < > dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com> > Cc: John Canny <ca...@berkeley.edu>, "Evan R. Sparks" < > evan.spa...@gmail.com>, Xiangrui Meng <men...@gmail.com>, Sam Halliday < > sam.halli...@gmail.com> > Date: 01/21/2016 01:16 PM > Subject: RE: Using CUDA within Spark / boosting linear algebra > ------------------------------ > > > > Hi Kazuaki, > > Indeed, moving data to/from GPU is costly and this benchmark summarizes > the costs for moving different data sizes with regards to matrices > multiplication. These costs are paid for the convenience of using the > standard BLAS API that Nvidia NVBLAS provides. The thing is that there are > no code changes required (in Spark), one just needs to reference BLAS > implementation with the system variable. Naturally, hardware-specific > implementation will always be faster than default. The benchmark results > show that fact by comparing jCuda (by means of BIDMat) and NVBLAS. However, > it also shows that it worth using NVBLAS for large matrices because it can > take advantage of several GPUs and it will be faster despite the copying > overhead. That is also a known thing advertised by Nvidia. > > By the way, I don’t think that the column/row friendly format is an issue, > because one can use transposed matrices to fit the required format. I > believe that is just a software preference. > > My suggestion with regards to your prototype would be to make comparisons > with Spark’s implementation of logistic regression (that does not take > advantage of GPU) and also with BIDMach’s (that takes advantage of GPUs). > It will give the users a better understanding of your’s implementation > performance. Currently you compare it with Spark’s example logistic > regression implementation that is supposed to be a reference for learning > Spark rather than benchmarking its performance. > > Best regards, Alexander > > > ------------------------------------------------------ > Rajesh R. Bordawekar > Research Staff Member > IBM T. J. Watson Research Center > bor...@us.ibm.com > Office: 914-945-2097 >