[
https://issues.apache.org/jira/browse/MAHOUT-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990318#comment-12990318
]
Sebastian Schelter commented on MAHOUT-577:
-------------------------------------------
RowSimilarityJob has the nice feature that it will only compute similarities
for rows that have at least one element in common (= there exists at least one
column in which both rows have an entry). It tries to avoid comparing each row
with each other so I'd say its thought to work on sparse matrices only. It will
be slower than the naive approach of comparing each row with each other on
dense matrices, it should not be used as described in the issue here.
I agree with you that a lot of small tweaks might be applyable and that
intelligent sampling techniques could help a lot depending on the usecase.
> RowSimilarityJob hangs during CooccurrencesMapper
> -------------------------------------------------
>
> Key: MAHOUT-577
> URL: https://issues.apache.org/jira/browse/MAHOUT-577
> Project: Mahout
> Issue Type: Bug
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Environment: Linux Debian 5.0.5, 12GB Ram, Hadoop 20.3 installation
> Reporter: Maya Hristakeva
> Fix For: 0.5
>
>
> Hello,
> When trying to run a RowSimilarityJob on a matrix ( 146682 x 138351 ), the
> job gets through the RowWeightMapper and WeightedOccurrencesPerColumnReducer,
> and hangs during the CooccurrencesMapper although it shows that the map tasks
> are 100% complete.
> The command I use to run the job is:
> hadoop jar mahout-core-0.4-job.jar
> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob
> -Dmapred.input.dir=/user/maya.hristakeva/mahout/core4/tf/1/0.001/title/12_07_10/lda/5/lda-sim/ldaCompressedDocumentsMatrix
>
> -Dmapred.output.dir=/user/maya.hristakeva/mahout/core4/tf/1/0.001/title/12_07_10/lda/5/lda-sim/ldaDocumentSimilarityMatrix
> -Dmapred.reduce.tasks=8 -Dmapred.map.tasks=200
> -Dmapred.job.name=LDA_ROW_SIMILARITY_TEST --tempDir
> /user/maya.hristakeva/temp/lda/5 --numberOfColumns 138351
> --similarityClassname
> org.apache.mahout.math.hadoop.similarity.vector.DistributedEuclideanDistanceVectorSimilarity
> --maxSimilaritiesPerRow 10
> And the output of the mappers which are 100% complete, but hanging is:
> syslog logs
> 01-05 18:30:00,835 INFO org.apache.hadoop.mapred.MapTask: bufstart =
> 29085149; bufend = 39038598; bufvoid = 99614720
> 2011-01-05 18:30:00,835 INFO org.apache.hadoop.mapred.MapTask: kvstart =
> 65461; kvend = 327605; length = 327680
> 2011-01-05 18:30:06,241 INFO org.apache.hadoop.mapred.MapTask: Finished spill
> 94
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: Spilling map
> output: record full = true
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: bufstart =
> 39038598; bufend = 48983989; bufvoid = 99614720
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: kvstart =
> 327605; kvend = 262068; length = 327680
> 2011-01-05 18:30:14,528 INFO org.apache.hadoop.mapred.MapTask: Finished spill
> 95
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: Spilling map
> output: record full = true
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: bufstart =
> 48983989; bufend = 58929384; bufvoid = 99614720
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: kvstart =
> 262068; kvend = 196531; length = 327680
> 2011-01-05 18:30:22,615 INFO org.apache.hadoop.mapred.MapTask: Finished spill
> 96
> .
> .
> .
> This problem does not occur when I use a toy matrix of 100 x 100, but once I
> give it the original matrix of ..... the problem is always reproducible.
> Any ideas on what could be causing this?
> Thanks,
> Maya Hristakeva
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira