[ 
https://issues.apache.org/jira/browse/MAHOUT-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664606#comment-13664606
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1226 at 5/22/13 10:21 PM:
--------------------------------------------------------------------

use smaller decomposition rank. whatever you do, I bet 100 is more than 
reasonable. For LSA errors are dwarfed by variation in the training corpus 
itself;  for more exact inference the tail will likely start encoding nothing 
but noise in the data anyway much sooner. the running time will not scale to -k 
parameter (it will still be somewhat ~O(k^3/2)) even if you manage to have 
enough available task resources. Bottom line, it scales much better to input 
size than -k.

but generally i agree. MR doesn't lend itself well to blockwise matrix 
operations. It has sorting overhead for data redistribution, that's imo where 
most of the overhead stems from. Bagel and Graphlab stuff are much happier 
paths in that regard imo.
                
      was (Author: dlyubimov):
    use smaller decomposition rank. whatever you do, I bet 100 is more than 
reasonable. For LSA errors are dwarfed by variation in the training corpus 
itself;  for more exact inference the tail will likely start encoding nothing 
but noise in the data anyway much sooner. the running time will not scale to -k 
parameter (it will still be somewhat O(n^3/2)) even if you manage to have 
enough available task resources. Bottom line, it scales much better to input 
size than -k.

but generally i agree. MR doesn't lend itself well to blockwise matrix 
operations. It has sorting overhead for data redistribution, that's imo where 
most of the overhead stems from. Bagel and Graphlab stuff are much happier 
paths in that regard imo.
                  
> mahout ssvd Bt-job bug
> ----------------------
>
>                 Key: MAHOUT-1226
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1226
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.7
>         Environment: mahout-0.7
> hadoop-0.20.205.0
>            Reporter: Jakub
>         Attachments: core-site.xml, hdfs-site.xml, mapred-site.xml
>
>
> when using mahout ssvd job, Bt-job creates lots of spills to disk.
> Those might be minimized by tuning hadoop io.sort.mb parameter.
> However, when io.sort.mb is bigger than ~ 1100 , ie. 1500 I'm getting that 
> exception:
> java.io.IOException: Spill failed
>     at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>     at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
>     at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper$1.collect(BtJob.java:261)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper$1.collect(BtJob.java:255)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.SparseRowBlockAccumulator.flushBlock(SparseRowBlockAccumulator.java:65)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.SparseRowBlockAccumulator.collect(SparseRowBlockAccumulator.java:75)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.map(BtJob.java:158)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.map(BtJob.java:102)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>     at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>     at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.lang.RuntimeException: next value iterator failed
>     at 
> org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:166)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$OuterProductCombiner.reduce(BtJob.java:322)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$OuterProductCombiner.reduce(BtJob.java:302)
>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>     at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1502)
>     at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1436)
>     at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>     at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
> Caused by: java.io.EOFException
>     at java.io.DataInputStream.readByte(DataInputStream.java:267)
>     at org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159)
>     at 
> org.apache.mahout.math.hadoop.stochasticsvd.SparseRowBlockWritable.readFields(SparseRowBlockWritable.java:60)
>     at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>     at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>     at 
> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
>     at 
> org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:163)
>     ... 7 more
> by changing this value I've already managed to reduce spills from 100 (for 
> default io.sort.mb value) to 10, disk usage dropped from around 7 gigabytes 
> for my small data set to around 900 mb. repairing this issue might bring big 
> performance improvements.
> I've got lots of free ram, that's not some lack of memory issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to