[
https://issues.apache.org/jira/browse/MAHOUT-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Traupman updated MAHOUT-666:
-------------------------------------
Status: Patch Available (was: Open)
Here is a patch for this issue. I added a new configuration property that
allows you to ask for eager deletion of temp files. If this property is set to
true, it will delete the temp files before returning the resulting vector in
times(Vector) and timesSquared(Vector). If the property is left unset, it
defaults to false, where the old behavior remains for backward compatibility.
I did not change the transpose() or times(Matrix) methods, since it seems
you'll generally want to keep around a matrix result.
I added two unit tests to verify both the old and the new behavior for both
methods and to check that the change does not affect the results of the
computation.
All unit tests pass.
> DistributedSparseMatrix should clean up after itself when doing times(Vector)
> and timesSquared(Vector)
> ------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-666
> URL: https://issues.apache.org/jira/browse/MAHOUT-666
> Project: Mahout
> Issue Type: Bug
> Components: Math
> Affects Versions: 0.5
> Environment: Linux x86_64 2.6.18, Mac OS 10.6 64-bit, Hadoop 0.20.2,
> Java 1.6
> Reporter: Jonathan Traupman
> Priority: Minor
> Fix For: 0.5
>
> Attachments: mahout-666.patch
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> The directories created during the times() and timesSquared() methods in
> DistributedSparseMatrix leave behind a lot of cruft. While the individual
> files are tagged with deleteOnExit, but the directories are not. Also, but
> not deleting them until JVM exit, a job that does repeated matrix/vector
> multiplies, like DistributedLanczosSolver, creates a lot of temp files that
> stick around for the whole run, even though the results they contain are read
> once and then never again.
> Our cluster admins enforce both file count and size quotas, so since 5 temp
> files/directories are created on each iteration of DistributedLanczosSolver,
> we're constantly bumping into the quota with large SVDs.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira