Re: GPU, lapack Matrix adaptations

Dmitriy Lyubimov Fri, 15 Aug 2014 09:55:36 -0700

As i indicated, i think it is a worthy move. As i said before (including
Spark list), it is true that dense algebra is by far more compelling than
dense one; however, there are some considerations that make this work very
much worthwhile. to sum up my motivations:

(1) even in methods currently put in, dense multiplications and
decompositions are happening, and may actually speed up things in certain
cases.

(2) since main idea is ease of customization, it should be fairly low
consideration of how it may be useful for what's already inside, but rather
for potential use. I have developed internally methods using that algebra
that by sheer number outnumber those present in Mahout. Assuming other
power users will do the same (which is still largely just a hope at this
point), we'd be just looking like cavemen if we do not provide jCuda and
jBlas bindings.

so that sums the motivation.

Re: pull request. So that's a good start.

As was mentioned in previous discussions, we are lacking cost-based
optimizer for binary matrix operators the same way it was done for vectors.

E.g. we need some sort of generic entry point into matrix-matrix operations
that will make specific algorithm selection based on operand types. For
sparse types, some algorithms were already added by Ted but they were not
connected to this decision tree properly. For dense types, we probably will
need to run some empiric cost calibration analysis (i.e. if arg A has
native T multiplication and arg B does not, will it be faster to convert B
to native T and proceed natively, or vice versa, given geometry and number
of elements, etc. etc.) Imo this stuff has pretty unique architectural
opportunities for matrix centric operations.

On another note, i think it is not worthwhile to support lapack/cuda
operation for vectors.

On Wed, Aug 13, 2014 at 3:39 PM, Anand Avati <[email protected]> wrote:

> On Fri, Jul 18, 2014 at 12:01 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > On Fri, Jul 18, 2014 at 11:54 AM, Anand Avati <[email protected]> wrote:
> >
> > > On Fri, Jul 18, 2014 at 11:42 AM, Dmitriy Lyubimov <[email protected]>
> > > wrote:
> > >
> > >
> > > Co incidentally I was wildly imagining/exploring integration with the
> > > fortran blas behind the in-core DSL using jni. I had not come across
> > these
> > > BIDData projects. I'm happy to reorient that effort towards exploring
> > > these.
> > >
> >
> > Well, it's both. JBlas & JCublas. should be too expensive.
> >
> > if i had to choose, i'd say integrate jCublas first, seems to be a bit of
> > an edge here. We already know from Sebastien's work with jblas that its
> > integration for sparse methods is not that interesting.
> >
> > However, even  vector-vector operations over views of gpu-stored data
> > become somewhat interesting in context of dsl operators.
> >
>
> FYI, I was toying around a jBLAS backend for Matrix / Vector (at
> https://github.com/apache/mahout/pull/44). Started with jBLAS only because
> I found better documentation. Testing on my laptop a 1024x1024 matrix
> multiplication of random numbers, found a solid 56x faster runtime:
>
> Run starting. Expected test count is: 1
> DiscoverySuite:
> JBlasSuite:
> Normal multiplication (ms) = 15900
> jBLAS multiplication (ms) = 284
> - matrix multiplication
> Run completed in 16 seconds, 793 milliseconds.
>
>
> This is a very trivial implementation with only matrix multiplication
> optimized. Better vector integration is possible along the same steps.
> However for deeper integration (for e.g transparent offloading of
> decompositions into jblas), some restructuring of API will make it simple
> and easy for consumers. For example what I mean is - instead of public
> CholeskyDecomposition(Matrix
> A) constructor, have public CholeskyDecomposition choleskydecompose() in
> Matrix interface. This way JBlasMatrix can transparently insert its own
> optimized decomp code and return it as an inherited object of
> CholeskyDecomposition class.
>
> Comments/feedback welcome.
>
> I also discovered there are other common code refactoring which can be done
> (iterator, non-zero iterator code etc repeated many places) - separate PRs
> for them.
>

Re: GPU, lapack Matrix adaptations

Reply via email to