I'm always happy to delay the really hard problems. Often it means you never actually have to solve them at all.
It sounds like SparseBinaryVector could use a Map<Integer, Boolean> to save some space. Having to widen every value to double to sneak it through the API is annoying. Is it desirable to use generics to avoid this? Trying it out, a few of the operators don't apply but it seems to work otherwise. Perhaps its another good thing to delay? Even though I got hung up on side-effects <grin>, I'm no purist either. Since we are inventing a new package though, perhaps we ought to stick with the Matrix1D terminology. It is harder to type but it is more uniform as you noted. It is also much easier to change now than later once it is committed. Finally, a few stories about the remaining DoubleDoubleFunction and DoubleFunction operations I've procrastinated would be helpful. Jeff -----Original Message----- From: Ted Dunning (JIRA) [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 26, 2008 3:49 PM To: [email protected] Subject: [jira] Commented: (MAHOUT-6) Need a matrix implementation [ https://issues.apache.org/jira/browse/MAHOUT-6?page=com.atlassian.jira.p lugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572720#a ction_12572720 ] Ted Dunning commented on MAHOUT-6: ---------------------------------- A hash map is a great first implementation for a sparse vector. Ultimately, it will need to be replaced, but delaying that day is a good thing. Also, a really efficient structure is a pain in the ass to get exactly right. The hash map you have will work right off the bat. The primary use of SparseBinaryVector is as a row or column of a SparseBinaryMatrix. A binary matrix is useful in cases where reduction to binary values makes sense (many behavioral analysis cases are good for that, as are many text analysis cases). It only makes sense, however, when there is beginning to be serious memory pressure since its virtue is that you save 8 bytes per value. That can be 2/3 of the storage of some matrices. For some of my key programs, I need fast row and column access to very lare binary matrices and getting 3x larger matrices to fit in memory (and buying more memory) really helped. I used Matrix1D out of inertia from Colt. The only virtue to the notation is that it makes sense to go eventually to Matrix3D and Matrix4D, but the vector terminology is so well known that I wouldn't think it a problem. Nobody is ever going to be confused. Some purists might object that a vector is an object from linear algebra whereas what we have is a single-indexed array with a few linear algebra operations tacked on. I am not a purist. > Need a matrix implementation > ---------------------------- > > Key: MAHOUT-6 > URL: https://issues.apache.org/jira/browse/MAHOUT-6 > Project: Mahout > Issue Type: New Feature > Reporter: Ted Dunning > Attachments: MAHOUT-6a.diff, MAHOUT-6b.diff, MAHOUT-6c.diff, MAHOUT-6d.diff, MAHOUT-6e.diff, MAHOUT-6f.diff > > > We need matrices for Mahout. > An initial set of basic requirements includes: > a) sparse and dense support are required > b) row and column labels are important > c) serialization for hadoop use is required > d) reasonable floating point performance is required, but awesome FP is not > e) the API should be simple enough to understand > f) it should be easy to carve out sub-matrices for sending to different reducers > g) a reasonable set of matrix operations should be supported, these should eventually include: > simple matrix-matrix and matrix-vector and matrix-scalar linear algebra operations, A B, A + B, A v, A + x, v + x, u + v, dot(u, v) > row and column sums > generalized level 2 and 3 BLAS primitives, alpha A B + beta C and A u + beta v > h) easy and efficient iteration constructs, especially for sparse matrices > i) easy to extend with new implementations -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
