[
https://issues.apache.org/jira/browse/MAHOUT-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664342#comment-13664342
]
Jakub commented on MAHOUT-1226:
-------------------------------
I was working on those data:
http://student.agh.edu.pl/~pjakub/bigmatrix/
That's a small (50 000 x 80 000) matrix for natural language procesing of
polish text prepared for SSVD.
My setup is single node on ubuntu 13.04, hadoop configuration will follow in
attachement.
I'm running my job like that:
mahout ssvd --rank 400 --computeU true --computeV true --reduceTasks 3 --input
${INPUT} --output ${OUTPUT} -ow --tempDir /tmp/ssvdtmp/
other mahout parameters are left default, only hadoop was tuned.
I'm willing to use it on much bigger data in cluster, but first I wanted to
tune it on small data.
> mahout ssvd Bt-job bug
> ----------------------
>
> Key: MAHOUT-1226
> URL: https://issues.apache.org/jira/browse/MAHOUT-1226
> Project: Mahout
> Issue Type: Bug
> Affects Versions: 0.7
> Environment: mahout-0.7
> hadoop-0.20.205.0
> Reporter: Jakub
> Attachments: core-site.xml, hdfs-site.xml, mapred-site.xml
>
>
> when using mahout ssvd job, Bt-job creates lots of spills to disk.
> Those might be minimized by tuning hadoop io.sort.mb parameter.
> However, when io.sort.mb is bigger than ~ 1100 , ie. 1500 I'm getting that
> exception:
> java.io.IOException: Spill failed
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
> at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
> at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> at
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper$1.collect(BtJob.java:261)
> at
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper$1.collect(BtJob.java:255)
> at
> org.apache.mahout.math.hadoop.stochasticsvd.SparseRowBlockAccumulator.flushBlock(SparseRowBlockAccumulator.java:65)
> at
> org.apache.mahout.math.hadoop.stochasticsvd.SparseRowBlockAccumulator.collect(SparseRowBlockAccumulator.java:75)
> at
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.map(BtJob.java:158)
> at
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.map(BtJob.java:102)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.lang.RuntimeException: next value iterator failed
> at
> org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:166)
> at
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$OuterProductCombiner.reduce(BtJob.java:322)
> at
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$OuterProductCombiner.reduce(BtJob.java:302)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
> at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1502)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1436)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readByte(DataInputStream.java:267)
> at org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159)
> at
> org.apache.mahout.math.hadoop.stochasticsvd.SparseRowBlockWritable.readFields(SparseRowBlockWritable.java:60)
> at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
> at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
> at
> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
> at
> org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:163)
> ... 7 more
> by changing this value I've already managed to reduce spills from 100 (for
> default io.sort.mb value) to 10, disk usage dropped from around 7 gigabytes
> for my small data set to around 900 mb. repairing this issue might bring big
> performance improvements.
> I've got lots of free ram, that's not some lack of memory issue.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira