Thanks for chiming in, John. I missed your meetup last night - do you have any writeups or slides about roofline design? In particular, I'm curious about what optimizations are available for power-law dense * sparse? (I don't have any background in optimizations)
On Thu, Mar 12, 2015 at 8:50 PM, jfcanny <ca...@berkeley.edu> wrote: > If you're contemplating GPU acceleration in Spark, its important to look > beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the > datasets we've tested in BIDMach, and we've tried to make them > representative of industry machine learning workloads. Unless you're > crunching images or audio, the majority of data will be very sparse and > power law distributed. You need a good sparse BLAS, and in practice it > seems > like you need a sparse BLAS tailored for power-law data. We had to write > our > own since the NVIDIA libraries didnt perform well on typical power-law > data. > Intel MKL sparse BLAS also have issues and we only use some of them. > > You also need 2D reductions, scan operations, slicing, element-wise > transcendental functions and operators, many kinds of sort, random number > generators etc, and some kind of memory management strategy. Some of this > was layered on top of Thrust in BIDMat, but most had to be written from > scratch. Its all been rooflined, typically to memory throughput of current > GPUs (around 200 GB/s). > > When you have all this you can write Learning Algorithms in the same > high-level primitives available in Breeze or Numpy/Scipy. Its literally the > same in BIDMat, since the generic matrix operations are implemented on both > CPU and GPU, so the same code runs on either platform. > > A lesser known fact is that GPUs are around 10x faster for *all* those > operations, not just dense BLAS. Its mostly due to faster streaming memory > speeds, but some kernels (random number generation and transcendentals) are > more than an order of magnitude thanks to some specialized hardware for > power series on the GPU chip. > > When you have all this there is no need to move data back and forth across > the PCI bus. The CPU only has to pull chunks of data off disk, unpack them, > and feed them to the available GPUs. Most models fit comfortably in GPU > memory these days (4-12 GB). With minibatch algorithms you can push TBs of > data through the GPU this way. > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >