[jira] Commented: (MAHOUT-263) Matrix interface should extend Iterable for better integration with distributed storage

Ted Dunning (JIRA) Wed, 20 Jan 2010 11:59:17 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802960#action_12802960
 ]


Ted Dunning commented on MAHOUT-263:
------------------------------------

The idea of iterating through inputs sequentially is absolutely key to good 
performance on sequential algorithms with good abstraction.

Some algorithms need to permute inputs to some degree, but that is easily 
handled to a moderate degree by buffering some number of rows and presenting 
them in randomized order.

{quote}
Question is: should this go for all Matrices, or just SparseRowMatrix?  It's 
really tricky to have a matrix which is iterable both as sparse rows *and* 
sparse columns.  I guess the point would be that by default, it iterates over 
rows, unless it's SparseColumnMatrix, which obviously iterates over columns.

Thoughts?  Having to rely on random-access to a distributed-backed matrix is 
making me jump through silly extra hoops on some of the stuff I'm working on 
patches for.
{quote}
My feeling is that we don't need to support iteration both ways.  It would be 
nice, but the performance hit is so prodigious that it just isn't worth it.  In 
the past, where I needed to support column and row access, generally stored two 
copies each optimized for a different kind of access.  This is very rarely 
needed since most algorithms are very row or column centric and the data can be 
transposed (and thus permuted to match the desired access pattern) ahead of 
time to accommodate the need.  In many cases, the loop nesting can be 
rearranged as well to allow sequential row access to serve as well as column 
access, especially if map-reduce can be used to rearrange intermediate products.
  

> Matrix interface should extend Iterable<Vector> for better integration with 
> distributed storage
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-263
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-263
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.2
>         Environment: all
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>             Fix For: 0.3
>
>
> Many sparse algorithms for dealing with Matrices just make sequential passes 
> over the data, but don't need to see the entire matrix at once.  The way they 
> would be implemented currently is:
> {code}
> Matrix m = getInputCorpus();
> for(int i=0; i<m.numRows(); i++) {
>   Vector v = m.getRow(i);
>   doStuffWithRow(v); 
> }
> {code}
> When the Matrix is backed essentially by a SequenceFile<Integer, Vector>, 
> this algorithm outline doesn't make sense, because it requires lots of 
> sequential random access reads.  What makes more sense, and works for 
> in-memory matrices too, is something like the following:
> {code}
> public interface Matrix extends Iterable<Vector> { 
> {code}
> which allows for algorithms which only need iterators over Vectors do use 
> them as such:
> {code}
> Matrix m = getInputCorpus();
> Iterator<Vector> it = m.iterator();
> Vector v;
> while(it.hasNext() && (v = it.next()) != null) {
>   doStuffWithRow(v); 
> }
> {code}
> The Iterator interface could be easily implemented in the AbstractMatrix base 
> class, so implementing this idea would be transparent to all current Mahout 
> code.  Additionally, pulling out two layers of AbstractMatrix - one which 
> only knows how to do the things which can be done using iterators (like 
> times(Vector), timesSquared(Vector), plus(Matrix), assignRow(), etc...), 
> which would be the direct base class for DistributedMatrix (or HDFSMatrix), 
> but all the random-access matrix methods currently in AbstractMatrix would go 
> in another abstract base class of the first one (which could be called 
> AbstractVectorIterable, say).
> I think Iteratable<Vector> could be made more flexible by extending that to a 
> new interface VectorIterable, which provided iterateAll() and 
> iterateNonEmpty(), in case document Ids were sparse, and could also allow for 
> the possibility of adding other methods (things like skipTo(int rowNum), 
> perhaps).  
> Question is: should this go for all Matrices, or just SparseRowMatrix?  It's 
> really tricky to have a matrix which is iterable both as sparse rows *and* 
> sparse columns.  I guess the point would be that by default, it iterates over 
> rows, unless it's SparseColumnMatrix, which obviously iterates over columns.
> Thoughts?  Having to rely on random-access to a distributed-backed matrix is 
> making me jump through silly extra hoops on some of the stuff I'm working on 
> patches for.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-263) Matrix interface should extend Iterable for better integration with distributed storage

Reply via email to