[ 
https://issues.apache.org/jira/browse/MAHOUT-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978416#action_12978416
 ] 

Joris Geessels commented on MAHOUT-577:
---------------------------------------

Could you give an indication of the size of the input in itemUserMatrix? Since 
it works for a toy example and the map tasks keep spitting out bytes, it seems 
to me that Sebastian's guess is right, and that the data is indeed dense. In 
any case at this point, that looks like the only plausible explanation to me.  
However I don't have any explanation for the fact that the problem occurs as 
well for the 146682x5 matrix. Maybe it's an idea to start with a smaller subset 
of the data and see if the problem still occurs? Can't think of anything better 
than that for the moment.

> RowSimilarityJob hangs during CooccurrencesMapper
> -------------------------------------------------
>
>                 Key: MAHOUT-577
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-577
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>         Environment: Linux Debian 5.0.5, 12GB Ram, Hadoop 20.3 installation 
>            Reporter: Maya Hristakeva
>            Priority: Blocker
>
> Hello,
> When trying to run a RowSimilarityJob on a matrix ( 146682 x 138351 ), the 
> job gets through the RowWeightMapper and WeightedOccurrencesPerColumnReducer, 
> and hangs during the CooccurrencesMapper although it shows that the map tasks 
> are 100% complete. 
> The command I use to run the job is: 
> hadoop jar mahout-core-0.4-job.jar 
> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob 
> -Dmapred.input.dir=/user/maya.hristakeva/mahout/core4/tf/1/0.001/title/12_07_10/lda/5/lda-sim/ldaCompressedDocumentsMatrix
>  
> -Dmapred.output.dir=/user/maya.hristakeva/mahout/core4/tf/1/0.001/title/12_07_10/lda/5/lda-sim/ldaDocumentSimilarityMatrix
>  -Dmapred.reduce.tasks=8 -Dmapred.map.tasks=200 
> -Dmapred.job.name=LDA_ROW_SIMILARITY_TEST --tempDir 
> /user/maya.hristakeva/temp/lda/5 --numberOfColumns 138351 
> --similarityClassname 
> org.apache.mahout.math.hadoop.similarity.vector.DistributedEuclideanDistanceVectorSimilarity
>  --maxSimilaritiesPerRow 10
> And the output of the mappers which are 100% complete, but hanging is: 
> syslog logs
> 01-05 18:30:00,835 INFO org.apache.hadoop.mapred.MapTask: bufstart = 
> 29085149; bufend = 39038598; bufvoid = 99614720
> 2011-01-05 18:30:00,835 INFO org.apache.hadoop.mapred.MapTask: kvstart = 
> 65461; kvend = 327605; length = 327680
> 2011-01-05 18:30:06,241 INFO org.apache.hadoop.mapred.MapTask: Finished spill 
> 94
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: Spilling map 
> output: record full = true
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: bufstart = 
> 39038598; bufend = 48983989; bufvoid = 99614720
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: kvstart = 
> 327605; kvend = 262068; length = 327680
> 2011-01-05 18:30:14,528 INFO org.apache.hadoop.mapred.MapTask: Finished spill 
> 95
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: Spilling map 
> output: record full = true
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: bufstart = 
> 48983989; bufend = 58929384; bufvoid = 99614720
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: kvstart = 
> 262068; kvend = 196531; length = 327680
> 2011-01-05 18:30:22,615 INFO org.apache.hadoop.mapred.MapTask: Finished spill 
> 96
> .
> .
> .
> This problem does not occur when I use a toy matrix of 100 x 100, but once I 
> give it the original matrix of ..... the problem is always reproducible. 
> Any ideas on what could be causing this? 
> Thanks, 
> Maya Hristakeva

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to