[ 
https://issues.apache.org/jira/browse/MAHOUT-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979540#action_12979540
 ] 

Sebastian Schelter commented on MAHOUT-577:
-------------------------------------------

Hi Maya,

RowSimilarityJob is definitely not suited for dense matrices, I'll try to 
summarize what I think is happening in detail here:

 * the original matrix is 150k x 5, in the first step of RowSimilarityJob an 
inverted index from columns to rows is created (or mathematically spoken: the 
matrix is transposed)
 * the second step works on a 5 x 150k matrix now, on which the all cooccurring 
pairs of each row are mapped out, which is n (n - 1) / 2 pairs per row, with n 
being the number of non-zero entries
 
For this matrix it results in 5 * 150000 * (150000 - 1) / 2 ~ 56 billion pairs 
being mapped out.

> RowSimilarityJob hangs during CooccurrencesMapper
> -------------------------------------------------
>
>                 Key: MAHOUT-577
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-577
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>         Environment: Linux Debian 5.0.5, 12GB Ram, Hadoop 20.3 installation 
>            Reporter: Maya Hristakeva
>             Fix For: 0.5
>
>
> Hello,
> When trying to run a RowSimilarityJob on a matrix ( 146682 x 138351 ), the 
> job gets through the RowWeightMapper and WeightedOccurrencesPerColumnReducer, 
> and hangs during the CooccurrencesMapper although it shows that the map tasks 
> are 100% complete. 
> The command I use to run the job is: 
> hadoop jar mahout-core-0.4-job.jar 
> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob 
> -Dmapred.input.dir=/user/maya.hristakeva/mahout/core4/tf/1/0.001/title/12_07_10/lda/5/lda-sim/ldaCompressedDocumentsMatrix
>  
> -Dmapred.output.dir=/user/maya.hristakeva/mahout/core4/tf/1/0.001/title/12_07_10/lda/5/lda-sim/ldaDocumentSimilarityMatrix
>  -Dmapred.reduce.tasks=8 -Dmapred.map.tasks=200 
> -Dmapred.job.name=LDA_ROW_SIMILARITY_TEST --tempDir 
> /user/maya.hristakeva/temp/lda/5 --numberOfColumns 138351 
> --similarityClassname 
> org.apache.mahout.math.hadoop.similarity.vector.DistributedEuclideanDistanceVectorSimilarity
>  --maxSimilaritiesPerRow 10
> And the output of the mappers which are 100% complete, but hanging is: 
> syslog logs
> 01-05 18:30:00,835 INFO org.apache.hadoop.mapred.MapTask: bufstart = 
> 29085149; bufend = 39038598; bufvoid = 99614720
> 2011-01-05 18:30:00,835 INFO org.apache.hadoop.mapred.MapTask: kvstart = 
> 65461; kvend = 327605; length = 327680
> 2011-01-05 18:30:06,241 INFO org.apache.hadoop.mapred.MapTask: Finished spill 
> 94
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: Spilling map 
> output: record full = true
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: bufstart = 
> 39038598; bufend = 48983989; bufvoid = 99614720
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: kvstart = 
> 327605; kvend = 262068; length = 327680
> 2011-01-05 18:30:14,528 INFO org.apache.hadoop.mapred.MapTask: Finished spill 
> 95
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: Spilling map 
> output: record full = true
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: bufstart = 
> 48983989; bufend = 58929384; bufvoid = 99614720
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: kvstart = 
> 262068; kvend = 196531; length = 327680
> 2011-01-05 18:30:22,615 INFO org.apache.hadoop.mapred.MapTask: Finished spill 
> 96
> .
> .
> .
> This problem does not occur when I use a toy matrix of 100 x 100, but once I 
> give it the original matrix of ..... the problem is always reproducible. 
> Any ideas on what could be causing this? 
> Thanks, 
> Maya Hristakeva

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to