[ 
https://issues.apache.org/jira/browse/MAHOUT-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134007#comment-13134007
 ] 

Dan Brickley commented on MAHOUT-834:
-------------------------------------

I've just gone back to do some tests. I composed a new commandline,

mahout rowsimilarity --input matrixified/matrix --output sims_foo/ 
--numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD 
--excludeSelfSimilarity

...where sims_foo/ is a new home for output. However because the previous run 
was from the same base directory, we get "Exception in thread "main" 
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
temp/weights already exists" 

This leads me to think that a cleaner pattern would be for these 'temp' files 
to be packaged inside the output directory. That seems quite common in other 
jobs, eg. clustering you get all the iterations in different subdirs, and when 
generating sparse vectors, the output directory has something along these 
lines: "df-count            dictionary.file-0       frequency.file-0        
tf-vectors              tfidf-vectors           tokenized-documents     
wordcount". But in both cases, the mess is constrained to live within the 
output dir.

To answer your question, "11/10/24 14:04:20 ERROR common.AbstractJob: 
Unexpected --overwrite while processing Job-Specific Options:" ... it seems 
rowsimilarity doesn't offer this option. Neither does the rowid job - my 
(mistakenly designed) patch for MAHOUT-839 also tried to sneak in --overwrite 
there.

As a user, it does seem counter-intuitive for any use of 'mahout rowsimilarity' 
in a given directory to create a temp/ right there, blocking any re-runs of 
that same-named job unless the intermediate files are wiped; perhaps I might 
want to run the job twice over the same data, using different similarity 
measures? For the reasons you give above, keeping the files around makes sense, 
but I'd suggest having them live beneath the output dir by default.

                
> rowsimilarityjob doesn't clean it's temp dir, and fails when seeing it again
> ----------------------------------------------------------------------------
>
>                 Key: MAHOUT-834
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-834
>             Project: Mahout
>          Issue Type: Bug
>          Components: Integration
>            Reporter: Dan Brickley
>            Priority: Minor
>
> If I do this:
> mahout rowsimilarity --input matrixified/matrix --output sims/ 
> --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD 
> --excludeSelfSimilarity
> then clean my output and rerun,
> rm -rf sims/ # (though this step doesn't even seem needed)
> then try again:
> mahout rowsimilarity --input matrixified/matrix --output sims/ 
> --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD 
> --excludeSelfSimilarity
> The temp files left from the first run make a re-run impossible - we get: 
> "Exception in thread "main" 
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
> temp/weights already exists".
> Manually deleting the temp directory fixes this.
> I get same behaviour if I explicitly pass in a --tempdir path, e.g.:
> mahout rowsimilarity --input matrixified/matrix --output sims/ 
> --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD 
> --excludeSelfSimilarity --tempDir tmp2/
> Presumably something like HadoopUtil.delete(getConf(),tempDirPath) is needed 
> somewhere?  (and maybe --overwrite too ?)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to