solvers on spark

Dmitriy Lyubimov Mon, 24 Jun 2013 13:56:29 -0700

On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath <nick.pentre...@gmail.com>wrote:


> That looks great Dmitry!
>
>
> The thing about Breeze that drives the complexity in it is partly
> specialization for Float, Double and Int matrices, and partly getting the
> syntax to "just work" for all combinations of matrix types and operands
> etc. mostly it does "just work" but occasionally not.

yes i noticed that, but since i am wrapping Mahout matrices, there's only a
choice of double-filled matrices and vectors. Actually, i would argue
that's the way it is supposed to be in the interest of KISS principle. I am
not sure i see a value in "int" matrices for any problem i ever worked on,
and skipping on precision to save the space is even more far-fetched notion
as in real life numbers don't take as much space as their pre-vectorized
features and annotations. In fact. model training parts and linear algebra
are not where memory bottleneck seems to fat-up at all in my experience.
There's often exponentially growing cpu-bound behavior, yes, but not RAM.



>
>
> I am surprised that dense * sparse matrix doesn't work but I guess as I
> previously mentioned the sparse matrix support is a bit shaky.
>
This is solely based on eye-balling the trait architecture. I did not
actually attempt it. But there's no single unifying trait for sure.

>
>
> David Hall is pretty happy to both look into enhancements and help out for
> contributions (eg I'm hoping to find time to look into a proper Diagonal
> matrix implementation and he was very helpful with pointers etc), so please
> do drop things into the google group mailing list. Hopefully wider adoption
> especially by this type of community will drive Breeze development.
>
>
> In another note I also really like Scaldings matrix API so scala ish
> wrappers for mahout would be cool - another pet project of mine is a port
> of that API to spark too :)
>
>
> N
>
>
>
> —
> Sent from Mailbox for iPhone
>
> On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix <jake.man...@gmail.com>
> wrote:
>
> > Yeah, I'm totally on board with a pretty scala DSL on top of some of our
> > stuff.
> > In particular, I've been experimenting with with wrapping the
> > DistributedRowMatrix
> > in a scalding wrapper, so we can do things like
> > val matrixAsTypedPipe =
> >    DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
> > path, conf))
> > // e.g. L1 normalize:
> >   matrixAsTypedPipe.map((idx, v) : (Int, Vector) => (idx,
> v.normalize(1)) )
> >                                  .write(new
> > DistributedRowMatrixPipe(outputPath, conf))
> > // and anything else you would want to do with a scalding TypedPipe[Int,
> > Vector]
> > Currently I've been doing this with a package structure directly in
> Mahout,
> > in:
> >    mahout/contrib/scalding
> > What do people think about having this be something real, after 0.8 goes
> > out?  Are
> > we ready for contrib modules which fold in diverse external projects in
> new
> > ways?
> > Integrating directly with pig and scalding is a bit too wide of a tent
> for
> > Mahout core,
> > but putting these integrations in entirely new projects is maybe a bit
> too
> > far away.
> > On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> >> Dmitriy,
> >>
> >> This is very pretty.
> >>
> >>
> >>
> >>
> >> On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> >> wrote:
> >>
> >> > Ok, so i was fairly easily able to build some DSL for our matrix
> >> > manipulation (similar to breeze) in scala:
> >> >
> >> > inline matrix or vector:
> >> >
> >> > val  a = dense((1, 2, 3), (3, 4, 5))
> >> >
> >> > val b:Vector = (1,2,3)
> >> >
> >> > block views and assignments (element/row/vector/block/block of row or
> >> > vector)
> >> >
> >> >
> >> > a(::, 0)
> >> > a(1, ::)
> >> > a(0 to 1, 1 to 2)
> >> >
> >> > assignments
> >> >
> >> > a(0, ::) :=(3, 5, 7)
> >> > a(0, 0 to 1) :=(3, 5)
> >> > a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
> >> >
> >> > operators
> >> >
> >> > // hadamard
> >> > val c = a * b
> >> >  a *= b
> >> >
> >> > // matrix mul
> >> >  val m = a %*% b
> >> >
> >> > and bunch of other little things like sum, mean, colMeans etc. That
> much
> >> is
> >> > easy.
> >> >
> >> > Also stuff like the ones found in breeze along the lines
> >> >
> >> > val (u,v,s) = svd(a)
> >> >
> >> > diag ((1,2,3))
> >> >
> >> > and Cholesky in similar ways.
> >> >
> >> > I don't have "inline" initialization for sparse things (yet) simply
> >> because
> >> > i don't need them, but of course all regular java constructors and
> >> methods
> >> > are retained, all that is just a syntactic sugar in the spirit of
> DSLs in
> >> > hope to make things a bit mroe readable.
> >> >
> >> > my (very little, and very insignificantly opinionated, really)
> criticism
> >> of
> >> > Breeze in this context is its inconsistency between dense and sparse
> >> > representations, namely, lack of consistent overarching trait(s), so
> that
> >> > building structure-agnostic solvers like Mahout's Cholesky solver is
> >> > impossible, as well as cross-type matrix use (say, the way i
> understand
> >> it,
> >> > it is pretty much imposible to multiply a sparse matrix by a dense
> >> matrix).
> >> >
> >> > I suspect these problems stem from the fact that the authors for
> whatever
> >> > reason decided to hardwire dense things with JBlas solvers whereas i
> dont
> >> > believe matrix storage structures must be. But these problems do
> appear
> >> to
> >> > be serious enough  for me to ignore Breeze for now. If i decide to
> plug
> >> in
> >> > jblas dense solvers, i guess i will just have them as yet another
> >> top-level
> >> > routine interface taking any Matrix, e.g.
> >> >
> >> > val (u,v,s) = svd(m, jblas=true)
> >> >
> >> >
> >> >
> >> > On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> >> > wrote:
> >> >
> >> > > Thank you.
> >> > > On Jun 23, 2013 6:16 PM, "Ted Dunning" <ted.dunn...@gmail.com>
> wrote:
> >> > >
> >> > >> I think that this contract has migrated a bit from the first
> starting
> >> > >> point.
> >> > >>
> >> > >> My feeling is that there is a de facto contract now that the matrix
> >> > slice
> >> > >> is a single row.
> >> > >>
> >> > >> Sent from my iPhone
> >> > >>
> >> > >> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dlie...@gmail.com>
> >> wrote:
> >> > >>
> >> > >> > What does Matrix. iterateAll() contractually do? Practically it
> >> seems
> >> > >> to be
> >> > >> > row wise iteration for some implementations but it doesnt seem
> >> > >> > contractually state so in the javadoc. What is MatrixSlice if it
> is
> >> > >> neither
> >> > >> > a row nor a colimn? How can i tell what exactly it is i am
> iterating
> >> > >> over?
> >> > >> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <ted.dunn...@gmail.com>
> >> > wrote:
> >> > >> >
> >> > >> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <
> >> jake.man...@gmail.com>
> >> > >> >> wrote:
> >> > >> >>
> >> > >> >>>> Question #2: which in-core solvers are available for Mahout
> >> > >> matrices? I
> >> > >> >>>> know there's SSVD, probably Cholesky, is there something
> else? In
> >> > >> >>>> paticular, i need to be solving linear systems, I guess
> Cholesky
> >> > >> should
> >> > >> >>> be
> >> > >> >>>> equipped enough to do just that?
> >> > >> >>>>
> >> > >> >>>> Question #3: why did we try to import Colt solvers rather than
> >> > >> actually
> >> > >> >>>> depend on Colt in the first place? Why did we not accept
> Colt's
> >> > >> sparse
> >> > >> >>>> matrices and created native ones instead?
> >> > >> >>>>
> >> > >> >>>> Colt seems to have a notion of parse in-core matrices too and
> >> seems
> >> > >> >> like
> >> > >> >>> a
> >> > >> >>>> well-rounded solution. However, it doesn't seem like being
> >> actively
> >> > >> >>>> supported, whereas I know Mahout experienced continued
> >> enhancements
> >> > >> to
> >> > >> >>> the
> >> > >> >>>> in-core matrix support.
> >> > >> >>>>
> >> > >> >>>
> >> > >> >>> Colt was totally abandoned, and I talked to the original author
> >> and
> >> > he
> >> > >> >>> blessed it's adoption.  When we pulled it in, we found it was
> >> > woefully
> >> > >> >>> undertested,
> >> > >> >>> and tried our best to hook it in with proper tests and use APIs
> >> that
> >> > >> fit
> >> > >> >>> with
> >> > >> >>> the use cases we had.  Plus, we already had the start of some
> >> linear
> >> > >> apis
> >> > >> >>> (i.e.
> >> > >> >>> the Vector interface) and dropping the API completely seemed
> not
> >> > >> terribly
> >> > >> >>> worth it at the time.
> >> > >> >>>
> >> > >> >>
> >> > >> >> There was even more to it than that.
> >> > >> >>
> >> > >> >> Colt was under-tested and there have been warts that had to be
> >> pulled
> >> > >> out
> >> > >> >> in much of the code.
> >> > >> >>
> >> > >> >> But, worse than that, Colt's matrix and vector structure was a
> real
> >> > >> bugger
> >> > >> >> to extend or change.  It also had all kinds of cruft where it
> >> > >> pretended to
> >> > >> >> support matrices of things, but in fact only supported matrices
> of
> >> > >> doubles
> >> > >> >> and floats.
> >> > >> >>
> >> > >> >> So using Colt as it was (and is since it is largely abandoned)
> was
> >> a
> >> > >> >> non-starter.
> >> > >> >>
> >> > >> >> As far as in-memory solvers, we have:
> >> > >> >>
> >> > >> >> 1) LR decomposition (tested and kinda fast)
> >> > >> >>
> >> > >> >> 2) Cholesky decomposition (tested)
> >> > >> >>
> >> > >> >> 3) SVD (tested)
> >> > >> >>
> >> > >>
> >> > >
> >> >
> >>
> > --
> >   -jake
>

Re: Mahout vectors/matrices/solvers on spark

Reply via email to