I'm going to respond to my own questions, to give everybody an idea of
my latest thoughts. I welcome additional perspectives, as this is a team
discussion.
1. While I find the notation awkward and tedious, I think making the
Matrix interfaces use generics is the best solution for supporting a
family of efficient implementations: from Boolean through Integer to
Double. I note that boxed types would need to replace the primitives but
also that the sparse implementations already use them so the main impact
would be on the dense implementations. I'm an old Smalltalk guy, and the
idea of everything being an object just does not bother me.
2. I've flip-flopped on the name question and am now leaning towards
renaming Matrix1D to Vector. It is more intuitive, easier to type and
imposing the name uniformity because someday we might add Matrix3D/4D
seems to be premature abstraction. I am less fervent on renaming
Matrix2D to be just Matrix, but lean the same way there too. I think the
commonest terms make the most sense.
3. I'm going to make an educated guess that these are the intent:
public interface DoubleFunction {
public double apply(double arg1);
}
public interface DoubleDoubleFunction {
public double apply(double arg1, double arg2);
}
... and that Matrix1D assign(DoubleFunction function) has the effect of
applying the function to the elements of the vector in a destructive
manner; and that Matrix1D assign(Matrix1D y, DoubleDoubleFunction
function) has the effect of applying the function to each element of the
argument and receiver, also destructively.
Ted, how am I doing<grin>?
Jeff
-----Original Message-----
From: Jeff Eastman [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 26, 2008 5:24 PM
To: [email protected]
Subject: RE: [jira] Commented: (MAHOUT-6) Need a matrix implementation
I'm always happy to delay the really hard problems. Often it means you
never actually have to solve them at all.
It sounds like SparseBinaryVector could use a Map<Integer, Boolean> to
save some space. Having to widen every value to double to sneak it
through the API is annoying. Is it desirable to use generics to avoid
this? Trying it out, a few of the operators don't apply but it seems to
work otherwise. Perhaps its another good thing to delay?
Even though I got hung up on side-effects <grin>, I'm no purist either.
Since we are inventing a new package though, perhaps we ought to stick
with the Matrix1D terminology. It is harder to type but it is more
uniform as you noted. It is also much easier to change now than later
once it is committed.
Finally, a few stories about the remaining DoubleDoubleFunction and
DoubleFunction operations I've procrastinated would be helpful.
Jeff
-----Original Message-----
From: Ted Dunning (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 26, 2008 3:49 PM
To: [email protected]
Subject: [jira] Commented: (MAHOUT-6) Need a matrix implementation
[
https://issues.apache.org/jira/browse/MAHOUT-6?page=com.atlassian.jira.p
lugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572720#a
ction_12572720 ]
Ted Dunning commented on MAHOUT-6:
----------------------------------
A hash map is a great first implementation for a sparse vector.
Ultimately,
it will need to be replaced, but delaying that day is a good thing.
Also, a
really efficient structure is a pain in the ass to get exactly right.
The
hash map you have will work right off the bat.
The primary use of SparseBinaryVector is as a row or column of a
SparseBinaryMatrix. A binary matrix is useful in cases where reduction
to
binary values makes sense (many behavioral analysis cases are good for
that,
as are many text analysis cases). It only makes sense, however, when
there
is beginning to be serious memory pressure since its virtue is that you
save
8 bytes per value. That can be 2/3 of the storage of some matrices.
For
some of my key programs, I need fast row and column access to very lare
binary matrices and getting 3x larger matrices to fit in memory (and
buying
more memory) really helped.
I used Matrix1D out of inertia from Colt. The only virtue to the
notation
is that it makes sense to go eventually to Matrix3D and Matrix4D, but
the
vector terminology is so well known that I wouldn't think it a problem.
Nobody is ever going to be confused. Some purists might object that a
vector is an object from linear algebra whereas what we have is a
single-indexed array with a few linear algebra operations tacked on. I
am
not a purist.
> Need a matrix implementation
> ----------------------------
>
> Key: MAHOUT-6
> URL: https://issues.apache.org/jira/browse/MAHOUT-6
> Project: Mahout
> Issue Type: New Feature
> Reporter: Ted Dunning
> Attachments: MAHOUT-6a.diff, MAHOUT-6b.diff, MAHOUT-6c.diff,
MAHOUT-6d.diff, MAHOUT-6e.diff, MAHOUT-6f.diff
>
>
> We need matrices for Mahout.
> An initial set of basic requirements includes:
> a) sparse and dense support are required
> b) row and column labels are important
> c) serialization for hadoop use is required
> d) reasonable floating point performance is required, but awesome FP
is not
> e) the API should be simple enough to understand
> f) it should be easy to carve out sub-matrices for sending to
different reducers
> g) a reasonable set of matrix operations should be supported, these
should eventually include:
> simple matrix-matrix and matrix-vector and matrix-scalar linear
algebra operations, A B, A + B, A v, A + x, v + x, u + v, dot(u, v)
> row and column sums
> generalized level 2 and 3 BLAS primitives, alpha A B + beta C and
A u + beta v
> h) easy and efficient iteration constructs, especially for sparse
matrices
> i) easy to extend with new implementations
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.