[jira] [Comment Edited] (MAHOUT-1693) FunctionalMatrixView materializes row vectors in scala shell

Andrew Palumbo (JIRA) Tue, 21 Apr 2015 09:58:38 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504406#comment-14504406
 ]


Andrew Palumbo edited comment on MAHOUT-1693 at 4/21/15 4:57 PM:
-----------------------------------------------------------------

bq.The question is why do we need so much memory? A 5000x5000 matrix of doubles 
should only take up ~200MB of space?" 

So it seems like the real memory hog here is:

{code:title=AbstractMatrix.java|borderStyle=solid} 
 public String toString() {
    StringBuilder s = new StringBuilder("{\n");
    Iterator<MatrixSlice> it = iterator();
    while (it.hasNext()) {
      MatrixSlice next = it.next();
      s.append("  ").append(next.index()).append("  
=>\t").append(next.vector()).append('\n');
    }
    s.append("}");
    return s.toString();
  }
} 
{code}

ie. each time a large in-core matrix is the result of an operation or a 
function within the spark-shell, the .toString() method is called (though 
truncated by the shell itself).

So if the result of an operation or function is e.g. a Dense Matrix of 5000 x 
5000 doubles the spark-shell actually tries to create a String representation 
of 250000000  doubles.

I'm not sure that this .toString() method was ever intended to be called on a 
large matrix. Possibly a better fix would be to only display an upper left 
block of a reasonable number of rows and columns. 



was (Author: andrew_palumbo):
bq.The question is why do we need so much memory? A 5000x5000 matrix of doubles 
should only take up ~200MB of space?" 

So it seems like the real memory hog here is:

{code} 
 public String toString() {
    StringBuilder s = new StringBuilder("{\n");
    Iterator<MatrixSlice> it = iterator();
    while (it.hasNext()) {
      MatrixSlice next = it.next();
      s.append("  ").append(next.index()).append("  
=>\t").append(next.vector()).append('\n');
    }
    s.append("}");
    return s.toString();
  }
} 
{code}

ie. each time a large in-core matrix is the result of an operation or a 
function within the spark-shell, the toString() method is called (though 
truncated by the shell itself).

So if the result of an operation or function is e.g. a Dense Matrix of 5000 x 
5000 doubles the spark-shell actually tries to create a String representation 
of 250000000  doubles.




> FunctionalMatrixView materializes row vectors in scala shell
> ------------------------------------------------------------
>
>                 Key: MAHOUT-1693
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1693
>             Project: Mahout
>          Issue Type: Bug
>          Components: Mahout spark shell, Math
>    Affects Versions: 0.10.0
>            Reporter: Suneel Marthi
>            Assignee: Andrew Palumbo
>            Priority: Blocker
>             Fix For: 0.10.1
>
>
> FunctionalMatrixView materializes row vectors in scala shell.
> Problem first reported by a user Michael Alton, Intel:
> "When I first tried to make a large matrix, I got an out of Java heap space 
> error. I increased the memory incrementally until I got it to work. “export 
> MAHOUT_HEAPSIZE=8000” didn’t work, but “export MAHOUT_HEAPSIZE=64000” did. 
> The question is why do we need so much memory? A 5000x5000 matrix of doubles 
> should only take up ~200MB of space?"
> Problem has been narrowed down to not override toString() method in 
> FunctionalMatrixView which causes it to materialize all of the row vectors 
> when run in Mahout Spark Shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MAHOUT-1693) FunctionalMatrixView materializes row vectors in scala shell

Reply via email to