[ https://issues.apache.org/jira/browse/MAHOUT-263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jake Mannix updated MAHOUT-263: ------------------------------- Attachment: (was: MAHOUT-263.diff) > Matrix interface should extend Iterable<Vector> for better integration with > distributed storage > ----------------------------------------------------------------------------------------------- > > Key: MAHOUT-263 > URL: https://issues.apache.org/jira/browse/MAHOUT-263 > Project: Mahout > Issue Type: Improvement > Components: Math > Affects Versions: 0.2 > Environment: all > Reporter: Jake Mannix > Assignee: Jake Mannix > Fix For: 0.3 > > > Many sparse algorithms for dealing with Matrices just make sequential passes > over the data, but don't need to see the entire matrix at once. The way they > would be implemented currently is: > {code} > Matrix m = getInputCorpus(); > for(int i=0; i<m.numRows(); i++) { > Vector v = m.getRow(i); > doStuffWithRow(v); > } > {code} > When the Matrix is backed essentially by a SequenceFile<Integer, Vector>, > this algorithm outline doesn't make sense, because it requires lots of > sequential random access reads. What makes more sense, and works for > in-memory matrices too, is something like the following: > {code} > public interface Matrix extends Iterable<Vector> { > {code} > which allows for algorithms which only need iterators over Vectors do use > them as such: > {code} > Matrix m = getInputCorpus(); > Iterator<Vector> it = m.iterator(); > Vector v; > while(it.hasNext() && (v = it.next()) != null) { > doStuffWithRow(v); > } > {code} > The Iterator interface could be easily implemented in the AbstractMatrix base > class, so implementing this idea would be transparent to all current Mahout > code. Additionally, pulling out two layers of AbstractMatrix - one which > only knows how to do the things which can be done using iterators (like > times(Vector), timesSquared(Vector), plus(Matrix), assignRow(), etc...), > which would be the direct base class for DistributedMatrix (or HDFSMatrix), > but all the random-access matrix methods currently in AbstractMatrix would go > in another abstract base class of the first one (which could be called > AbstractVectorIterable, say). > I think Iteratable<Vector> could be made more flexible by extending that to a > new interface VectorIterable, which provided iterateAll() and > iterateNonEmpty(), in case document Ids were sparse, and could also allow for > the possibility of adding other methods (things like skipTo(int rowNum), > perhaps). > Question is: should this go for all Matrices, or just SparseRowMatrix? It's > really tricky to have a matrix which is iterable both as sparse rows *and* > sparse columns. I guess the point would be that by default, it iterates over > rows, unless it's SparseColumnMatrix, which obviously iterates over columns. > Thoughts? Having to rely on random-access to a distributed-backed matrix is > making me jump through silly extra hoops on some of the stuff I'm working on > patches for. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.