Reyonld,
Prof Canny gives me the slides yesterday I will posted the link to the
slides to both SF BIg Analytics and SF Machine Learning meetups.
Chester
Sent from my iPad
On Mar 12, 2015, at 22:53, Reynold Xin <[email protected]> wrote:
> Thanks for chiming in, John. I missed your meetup last night - do you have
> any writeups or slides about roofline design? In particular, I'm curious
> about what optimizations are available for power-law dense * sparse? (I
> don't have any background in optimizations)
>
>
>
> On Thu, Mar 12, 2015 at 8:50 PM, jfcanny <[email protected]> wrote:
>
>> If you're contemplating GPU acceleration in Spark, its important to look
>> beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the
>> datasets we've tested in BIDMach, and we've tried to make them
>> representative of industry machine learning workloads. Unless you're
>> crunching images or audio, the majority of data will be very sparse and
>> power law distributed. You need a good sparse BLAS, and in practice it
>> seems
>> like you need a sparse BLAS tailored for power-law data. We had to write
>> our
>> own since the NVIDIA libraries didnt perform well on typical power-law
>> data.
>> Intel MKL sparse BLAS also have issues and we only use some of them.
>>
>> You also need 2D reductions, scan operations, slicing, element-wise
>> transcendental functions and operators, many kinds of sort, random number
>> generators etc, and some kind of memory management strategy. Some of this
>> was layered on top of Thrust in BIDMat, but most had to be written from
>> scratch. Its all been rooflined, typically to memory throughput of current
>> GPUs (around 200 GB/s).
>>
>> When you have all this you can write Learning Algorithms in the same
>> high-level primitives available in Breeze or Numpy/Scipy. Its literally the
>> same in BIDMat, since the generic matrix operations are implemented on both
>> CPU and GPU, so the same code runs on either platform.
>>
>> A lesser known fact is that GPUs are around 10x faster for *all* those
>> operations, not just dense BLAS. Its mostly due to faster streaming memory
>> speeds, but some kernels (random number generation and transcendentals) are
>> more than an order of magnitude thanks to some specialized hardware for
>> power series on the GPU chip.
>>
>> When you have all this there is no need to move data back and forth across
>> the PCI bus. The CPU only has to pull chunks of data off disk, unpack them,
>> and feed them to the available GPUs. Most models fit comfortably in GPU
>> memory these days (4-12 GB). With minibatch algorithms you can push TBs of
>> data through the GPU this way.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]