[
https://issues.apache.org/jira/browse/MAHOUT-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195934#comment-13195934
]
Suneel Marthi commented on MAHOUT-834:
--------------------------------------
1. What was the outcome of this thread? I am having the same issue and had
opened a Jira ticket - Mahout-964 (much before I saw this thread). I do agree
that the RowSimilarityJob needs an overwrite option to cleanup the output and
temp folders from a previous run.
2. Another concern I have is if the input similarity measure specified is not a
valid one, like for example:-
mahout rowsimilarity --input matrixified/matrix --output sims_foo/
--numberOfColumns 27684 --similarityClassname SIMILARITY_COS
--excludeSelfSimilarity
then RowSimilarityJob should exit immediately instead of going ahead with
trying to execute the Normalizer, CooccurrencesMapper and UnsymmetrifyMapper.
3. The 'excludeSelfSimilarity' option needs to be given an explicit value of
'true' or 'false' otherwise the following always defaults to 'false'
mahout rowsimilarity --input matrixified/matrix --output sims_foo/
--numberOfColumns 27684 --similarityClassname SIMILARITY_COSINE
--excludeSelfSimilarity
This is inconsistent with the way --overwrite option works. Merely specifying
--excludeSelfSimilarity on the Commandline does not set it to 'true'.
> rowsimilarityjob doesn't clean it's temp dir, and fails when seeing it again
> ----------------------------------------------------------------------------
>
> Key: MAHOUT-834
> URL: https://issues.apache.org/jira/browse/MAHOUT-834
> Project: Mahout
> Issue Type: Bug
> Components: Integration
> Reporter: Dan Brickley
> Priority: Minor
>
> If I do this:
> mahout rowsimilarity --input matrixified/matrix --output sims/
> --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD
> --excludeSelfSimilarity
> then clean my output and rerun,
> rm -rf sims/ # (though this step doesn't even seem needed)
> then try again:
> mahout rowsimilarity --input matrixified/matrix --output sims/
> --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD
> --excludeSelfSimilarity
> The temp files left from the first run make a re-run impossible - we get:
> "Exception in thread "main"
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> temp/weights already exists".
> Manually deleting the temp directory fixes this.
> I get same behaviour if I explicitly pass in a --tempdir path, e.g.:
> mahout rowsimilarity --input matrixified/matrix --output sims/
> --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD
> --excludeSelfSimilarity --tempDir tmp2/
> Presumably something like HadoopUtil.delete(getConf(),tempDirPath) is needed
> somewhere? (and maybe --overwrite too ?)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira