RE: [jira] Commented: (MAHOUT-6) Need a matrix implementation

Jeff Eastman Tue, 26 Feb 2008 17:24:44 -0800

I'm always happy to delay the really hard problems. Often it means you
never actually have to solve them at all.


It sounds like SparseBinaryVector could use a Map<Integer, Boolean> to
save some space. Having to widen every value to double to sneak it
through the API is annoying. Is it desirable to use generics to avoid
this? Trying it out, a few of the operators don't apply but it seems to
work otherwise. Perhaps its another good thing to delay?

Even though I got hung up on side-effects <grin>, I'm no purist either.
Since we are inventing a new package though, perhaps we ought to stick
with the Matrix1D terminology. It is harder to type but it is more
uniform as you noted. It is also much easier to change now than later
once it is committed.

Finally, a few stories about the remaining DoubleDoubleFunction and
DoubleFunction operations I've procrastinated would be helpful.

Jeff

-----Original Message-----
From: Ted Dunning (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 26, 2008 3:49 PM
To: [email protected]
Subject: [jira] Commented: (MAHOUT-6) Need a matrix implementation


    [
https://issues.apache.org/jira/browse/MAHOUT-6?page=com.atlassian.jira.p
lugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572720#a
ction_12572720 ] 

Ted Dunning commented on MAHOUT-6:
----------------------------------


A hash map is a great first implementation for a sparse vector.
Ultimately,
it will need to be replaced, but delaying that day is a good thing.
Also, a
really efficient structure is a pain in the ass to get exactly right.
The
hash map you have will work right off the bat.

The primary use of SparseBinaryVector is as a row or column of a
SparseBinaryMatrix.  A binary matrix is useful in cases where reduction
to
binary values makes sense (many behavioral analysis cases are good for
that,
as are many text analysis cases).  It only makes sense, however, when
there
is beginning to be serious memory pressure since its virtue is that you
save
8 bytes per value.  That can be 2/3 of the storage of some matrices.
For
some of my key programs, I need fast row and column access to very lare
binary matrices and getting 3x larger matrices to fit in memory (and
buying
more memory) really helped.

I used Matrix1D out of inertia from Colt.  The only virtue to the
notation
is that it makes sense to go eventually to Matrix3D and Matrix4D, but
the
vector terminology is so well known that I wouldn't think it a problem.
Nobody is ever going to be confused.  Some purists might object that a
vector is an object from linear algebra whereas what we have is a
single-indexed array with a few linear algebra operations tacked on.  I
am
not a purist.





> Need a matrix implementation
> ----------------------------
>
>                 Key: MAHOUT-6
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-6
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-6a.diff, MAHOUT-6b.diff, MAHOUT-6c.diff,
MAHOUT-6d.diff, MAHOUT-6e.diff, MAHOUT-6f.diff
>
>
> We need matrices for Mahout.
> An initial set of basic requirements includes:
> a) sparse and dense support are required
> b) row and column labels are important
> c) serialization for hadoop use is required
> d) reasonable floating point performance is required, but awesome FP
is not
> e) the API should be simple enough to understand
> f) it should be easy to carve out sub-matrices for sending to
different reducers
> g) a reasonable set of matrix operations should be supported, these
should eventually include:
>     simple matrix-matrix and matrix-vector and matrix-scalar linear
algebra operations, A B, A + B, A v, A + x, v + x, u + v, dot(u, v)
>     row and column sums  
>     generalized level 2 and 3 BLAS primitives, alpha A B + beta C and
A u + beta v
> h) easy and efficient iteration constructs, especially for sparse
matrices
> i) easy to extend with new implementations

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: [jira] Commented: (MAHOUT-6) Need a matrix implementation

Reply via email to