Re: GPU, lapack Matrix adaptations

Anand Avati Wed, 13 Aug 2014 21:34:08 -0700

Yes, obviously this works well only for Dense matrices. I had even
contemplated on inheriting JBlasMatrix from DenseMatrix.


Is there a roadmap (or collection of thoughts which approximate a road
map), so that there is some sort of a guideline as to what lines of
investigation for contributions makes sense?


On Wed, Aug 13, 2014 at 5:06 PM, Ted Dunning <[email protected]> wrote:

> Now try multiplying a 1 million by 1 million sparse matrix with 100
> non-zeros in each row by another such matrix.
>
> Also try a 16k x 16k dense matrix.
>
> And a 10 x 10 dense matrix.
>
> The moral is that jBlas and similar things are great for medium sized dense
> matrix.  Sparse systems aren't helped.  Large dense systems have problem on
> GPU's but work great with native BLAS.  Small dense systems have problems
> JNI boundaries and GPU memory architectures.
>
> So far, much of the Mahout work has been large sparse systems so it has
> been worthwhile to build a sparse optimizer, but not so very worthwhile to
> build fancy stuff for the dense cases.
>
> That may have changed which the higher profile of things like ALS and
> random projection decompositions.  Even k-means can be recast using random
> projection to be a dense matrix heavy algorithm.
>
> What do you think is the right course?
>
>
>
>
> On Wed, Aug 13, 2014 at 3:39 PM, Anand Avati <[email protected]> wrote:
>
> > On Fri, Jul 18, 2014 at 12:01 PM, Dmitriy Lyubimov <[email protected]>
> > wrote:
> >
> > > On Fri, Jul 18, 2014 at 11:54 AM, Anand Avati <[email protected]>
> wrote:
> > >
> > > > On Fri, Jul 18, 2014 at 11:42 AM, Dmitriy Lyubimov <
> [email protected]>
> > > > wrote:
> > > >
> > > >
> > > > Co incidentally I was wildly imagining/exploring integration with the
> > > > fortran blas behind the in-core DSL using jni. I had not come across
> > > these
> > > > BIDData projects. I'm happy to reorient that effort towards exploring
> > > > these.
> > > >
> > >
> > > Well, it's both. JBlas & JCublas. should be too expensive.
> > >
> > > if i had to choose, i'd say integrate jCublas first, seems to be a bit
> of
> > > an edge here. We already know from Sebastien's work with jblas that its
> > > integration for sparse methods is not that interesting.
> > >
> > > However, even  vector-vector operations over views of gpu-stored data
> > > become somewhat interesting in context of dsl operators.
> > >
> >
> > FYI, I was toying around a jBLAS backend for Matrix / Vector (at
> > https://github.com/apache/mahout/pull/44). Started with jBLAS only
> because
> > I found better documentation. Testing on my laptop a 1024x1024 matrix
> > multiplication of random numbers, found a solid 56x faster runtime:
> >
> > Run starting. Expected test count is: 1
> > DiscoverySuite:
> > JBlasSuite:
> > Normal multiplication (ms) = 15900
> > jBLAS multiplication (ms) = 284
> > - matrix multiplication
> > Run completed in 16 seconds, 793 milliseconds.
> >
> >
> > This is a very trivial implementation with only matrix multiplication
> > optimized. Better vector integration is possible along the same steps.
> > However for deeper integration (for e.g transparent offloading of
> > decompositions into jblas), some restructuring of API will make it simple
> > and easy for consumers. For example what I mean is - instead of public
> > CholeskyDecomposition(Matrix
> > A) constructor, have public CholeskyDecomposition choleskydecompose() in
> > Matrix interface. This way JBlasMatrix can transparently insert its own
> > optimized decomp code and return it as an inherited object of
> > CholeskyDecomposition class.
> >
> > Comments/feedback welcome.
> >
> > I also discovered there are other common code refactoring which can be
> done
> > (iterator, non-zero iterator code etc repeated many places) - separate
> PRs
> > for them.
> >
>

Re: GPU, lapack Matrix adaptations

Reply via email to