[ 
https://issues.apache.org/jira/browse/MAHOUT-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664465#comment-13664465
 ] 

Jakub commented on MAHOUT-1226:
-------------------------------

yes, I'm able to solve that without any optimizations, but:
- it takes 32 minutes (!)
- make lots of spills (8GB)

After small optimizations of parameters (io.sort.mb) time is down to 17 
minutes, spills are down to 900MB.
I hope to make it under 10 minutes on one node with no spills(spills take lots 
of time), then I'll introduce cluser and make it for bigger data sets.

Using libraries like ssvdlibc or gensim solves that tiny problems under 5 
minutes, but later I'll need to solve that for much bigger data.
                
> mahout ssvd Bt-job bug
> ----------------------
>
>                 Key: MAHOUT-1226
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1226
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.7
>         Environment: mahout-0.7
> hadoop-0.20.205.0
>            Reporter: Jakub
>         Attachments: core-site.xml, hdfs-site.xml, mapred-site.xml
>
>
> when using mahout ssvd job, Bt-job creates lots of spills to disk.
> Those might be minimized by tuning hadoop io.sort.mb parameter.
> However, when io.sort.mb is bigger than ~ 1100 , ie. 1500 I'm getting that 
> exception:
> java.io.IOException: Spill failed
>     at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>     at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
>     at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper$1.collect(BtJob.java:261)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper$1.collect(BtJob.java:255)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.SparseRowBlockAccumulator.flushBlock(SparseRowBlockAccumulator.java:65)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.SparseRowBlockAccumulator.collect(SparseRowBlockAccumulator.java:75)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.map(BtJob.java:158)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.map(BtJob.java:102)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>     at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>     at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.lang.RuntimeException: next value iterator failed
>     at 
> org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:166)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$OuterProductCombiner.reduce(BtJob.java:322)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$OuterProductCombiner.reduce(BtJob.java:302)
>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>     at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1502)
>     at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1436)
>     at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>     at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
> Caused by: java.io.EOFException
>     at java.io.DataInputStream.readByte(DataInputStream.java:267)
>     at org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.SparseRowBlockWritable.readFields(SparseRowBlockWritable.java:60)
>     at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>     at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>     at 
> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
>     at 
> org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:163)
>     ... 7 more
> by changing this value I've already managed to reduce spills from 100 (for 
> default io.sort.mb value) to 10, disk usage dropped from around 7 gigabytes 
> for my small data set to around 900 mb. repairing this issue might bring big 
> performance improvements.
> I've got lots of free ram, that's not some lack of memory issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to