rowsimilarityjob doesn't clean it's temp dir, and fails when seeing it again
----------------------------------------------------------------------------
Key: MAHOUT-834
URL: https://issues.apache.org/jira/browse/MAHOUT-834
Project: Mahout
Issue Type: Bug
Components: Integration
Reporter: Dan Brickley
Priority: Minor
If I do this:
mahout rowsimilarity --input matrixified/matrix --output sims/
--numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD
--excludeSelfSimilarity
then clean my output and rerun,
rm -rf sims/ # (though this step doesn't even seem needed)
then try again:
mahout rowsimilarity --input matrixified/matrix --output sims/
--numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD
--excludeSelfSimilarity
The temp files left from the first run make a re-run impossible - we get:
"Exception in thread "main"
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
temp/weights already exists".
Manually deleting the temp directory fixes this.
I get same behaviour if I explicitly pass in a --tempdir path, e.g.:
mahout rowsimilarity --input matrixified/matrix --output sims/
--numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD
--excludeSelfSimilarity --tempDir tmp2/
Presumably something like HadoopUtil.delete(getConf(),tempDirPath) is needed
somewhere? (and maybe --overwrite too ?)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira