Re: GPU, lapack Matrix adaptations

Ted Dunning Wed, 13 Aug 2014 17:08:41 -0700

Now try multiplying a 1 million by 1 million sparse matrix with 100
non-zeros in each row by another such matrix.


Also try a 16k x 16k dense matrix.

And a 10 x 10 dense matrix.

The moral is that jBlas and similar things are great for medium sized dense
matrix.  Sparse systems aren't helped.  Large dense systems have problem on
GPU's but work great with native BLAS.  Small dense systems have problems
JNI boundaries and GPU memory architectures.

So far, much of the Mahout work has been large sparse systems so it has
been worthwhile to build a sparse optimizer, but not so very worthwhile to
build fancy stuff for the dense cases.

That may have changed which the higher profile of things like ALS and
random projection decompositions.  Even k-means can be recast using random
projection to be a dense matrix heavy algorithm.

What do you think is the right course?




On Wed, Aug 13, 2014 at 3:39 PM, Anand Avati <[email protected]> wrote:

> On Fri, Jul 18, 2014 at 12:01 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > On Fri, Jul 18, 2014 at 11:54 AM, Anand Avati <[email protected]> wrote:
> >
> > > On Fri, Jul 18, 2014 at 11:42 AM, Dmitriy Lyubimov <[email protected]>
> > > wrote:
> > >
> > >
> > > Co incidentally I was wildly imagining/exploring integration with the
> > > fortran blas behind the in-core DSL using jni. I had not come across
> > these
> > > BIDData projects. I'm happy to reorient that effort towards exploring
> > > these.
> > >
> >
> > Well, it's both. JBlas & JCublas. should be too expensive.
> >
> > if i had to choose, i'd say integrate jCublas first, seems to be a bit of
> > an edge here. We already know from Sebastien's work with jblas that its
> > integration for sparse methods is not that interesting.
> >
> > However, even  vector-vector operations over views of gpu-stored data
> > become somewhat interesting in context of dsl operators.
> >
>
> FYI, I was toying around a jBLAS backend for Matrix / Vector (at
> https://github.com/apache/mahout/pull/44). Started with jBLAS only because
> I found better documentation. Testing on my laptop a 1024x1024 matrix
> multiplication of random numbers, found a solid 56x faster runtime:
>
> Run starting. Expected test count is: 1
> DiscoverySuite:
> JBlasSuite:
> Normal multiplication (ms) = 15900
> jBLAS multiplication (ms) = 284
> - matrix multiplication
> Run completed in 16 seconds, 793 milliseconds.
>
>
> This is a very trivial implementation with only matrix multiplication
> optimized. Better vector integration is possible along the same steps.
> However for deeper integration (for e.g transparent offloading of
> decompositions into jblas), some restructuring of API will make it simple
> and easy for consumers. For example what I mean is - instead of public
> CholeskyDecomposition(Matrix
> A) constructor, have public CholeskyDecomposition choleskydecompose() in
> Matrix interface. This way JBlasMatrix can transparently insert its own
> optimized decomp code and return it as an inherited object of
> CholeskyDecomposition class.
>
> Comments/feedback welcome.
>
> I also discovered there are other common code refactoring which can be done
> (iterator, non-zero iterator code etc repeated many places) - separate PRs
> for them.
>

Re: GPU, lapack Matrix adaptations

Reply via email to