[
https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288243#comment-13288243
]
Shannon Quinn commented on MAHOUT-986:
--------------------------------------
Ahhh, excellent catch Sebastian. I suspect that's the problem.
The test case was on my own setup: 4 Ubuntu VMs, each with 4GB memory (2GB
heap, if I recall correctly) running a fully distributed Hadoop environment. I
can't replicated that setup at the moment since I have not yet reassembled my
computer from moving, but I'm attached the patch that implements the fix
Sebastian suggested. Given the parallel constructor signatures of the
LanczosState and DistributedLanczosSolver, I'm confident this is what we want.
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
> Key: MAHOUT-986
> URL: https://issues.apache.org/jira/browse/MAHOUT-986
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.6
> Environment: Ubuntu 11.10 (64-bit)
> Reporter: Shannon Quinn
> Assignee: Shannon Quinn
> Priority: Minor
> Fix For: 0.7
>
> Attachments: MAHOUT-986.patch, MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset (
> http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with
> 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an
> OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions
> 4192499-by-4192499 (~17.5 trillion double-precision floating point values).
> Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> at
> org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> at
> org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> at
> org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at
> org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616)
> at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing
> an eigen-decomposition of the graph laplacian. For those who are more
> knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos
> parameters be tweaked to negate the requirement of a full DenseMatrix? Or
> should SKM move to SSVD instead?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira