[
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589902#comment-13589902
]
Sara Del Río García commented on MAHOUT-145:
--------------------------------------------
Hello Deneche A. Hakim:
I'm testing the Random Forest Partial version in the version of Hadoop: Hadoop
2.0.0-cdh4.1.1
I'm trying to modify the algorithm, all I do is add more information to the
leaves of the tree. Currently containing the label and I want to add another
label more:
@Override
public void readFields(DataInput in) throws IOException {
label = in.readDouble();
leafWeight = in.readDouble();
}
@Override
protected void writeNode(DataOutput out) throws IOException {
out.writeDouble(label);
out.writeDouble(leafWeight);
}
And I get the following error:
13/02/27 06:53:27 INFO mapreduce.BuildForest: Partial Mapred implementation
13/02/27 06:53:27 INFO mapreduce.BuildForest: Building the forest...
13/02/27 06:53:27 INFO mapreduce.BuildForest: Weights Estimation: IR
13/02/27 06:53:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
13/02/27 06:53:39 INFO input.FileInputFormat: Total input paths to process : 1
13/02/27 06:53:39 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
13/02/27 06:53:39 WARN snappy.LoadSnappy: Snappy native library not loaded
13/02/27 06:53:39 INFO mapred.JobClient: Running job: job_201302270205_0013
13/02/27 06:53:40 INFO mapred.JobClient: map 0% reduce 0%
13/02/27 06:54:18 INFO mapred.JobClient: map 20% reduce 0%
13/02/27 06:54:42 INFO mapred.JobClient: map 40% reduce 0%
13/02/27 06:55:03 INFO mapred.JobClient: map 60% reduce 0%
13/02/27 06:55:26 INFO mapred.JobClient: map 70% reduce 0%
13/02/27 06:55:27 INFO mapred.JobClient: map 80% reduce 0%
13/02/27 06:55:49 INFO mapred.JobClient: map 100% reduce 0%
13/02/27 06:56:04 INFO mapred.JobClient: Job complete: job_201302270205_0013
13/02/27 06:56:04 INFO mapred.JobClient: Counters: 24
13/02/27 06:56:04 INFO mapred.JobClient: File System Counters
13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of bytes read=0
13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of bytes
written=1828230
13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of read operations=0
13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of large read
operations=0
13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of write operations=0
13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of bytes read=1381649
13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of bytes written=1680
13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of read operations=30
13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of large read
operations=0
13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of write operations=10
13/02/27 06:56:04 INFO mapred.JobClient: Job Counters
13/02/27 06:56:04 INFO mapred.JobClient: Launched map tasks=10
13/02/27 06:56:04 INFO mapred.JobClient: Data-local map tasks=10
13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all maps in
occupied slots (ms)=254707
13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all reduces in
occupied slots (ms)=0
13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all reduces
waiting after reserving slots (ms)=0
13/02/27 06:56:04 INFO mapred.JobClient: Map-Reduce Framework
13/02/27 06:56:04 INFO mapred.JobClient: Map input records=20
13/02/27 06:56:04 INFO mapred.JobClient: Map output records=10
13/02/27 06:56:04 INFO mapred.JobClient: Input split bytes=1540
13/02/27 06:56:04 INFO mapred.JobClient: Spilled Records=0
13/02/27 06:56:04 INFO mapred.JobClient: CPU time spent (ms)=12070
13/02/27 06:56:04 INFO mapred.JobClient: Physical memory (bytes)
snapshot=949579776
13/02/27 06:56:04 INFO mapred.JobClient: Virtual memory (bytes)
snapshot=8412340224
13/02/27 06:56:04 INFO mapred.JobClient: Total committed heap usage
(bytes)=478412800
READ
nodetype: 0
Exception in thread "main" java.lang.IllegalStateException: java.io.EOFException
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:104)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at
org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.processOutput(PartialBuilder.java:129)
at
org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.parseOutput(PartialBuilder.java:96)
at
org.apache.mahout.classifier.df.mapreduce.Builder.build(Builder.java:312)
at
org.apache.mahout.classifier.df.mapreduce.BuildForest.buildForest(BuildForest.java:246)
at
org.apache.mahout.classifier.df.mapreduce.BuildForest.run(BuildForest.java:200)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at
org.apache.mahout.classifier.df.mapreduce.BuildForest.main(BuildForest.java:270)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at java.io.DataInputStream.readDouble(DataInputStream.java:451)
at org.apache.mahout.classifier.df.node.Leaf.readFields(Leaf.java:136)
at org.apache.mahout.classifier.df.node.Node.read(Node.java:85)
at
org.apache.mahout.classifier.df.mapreduce.MapredOutput.readFields(MapredOutput.java:64)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2114)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2242)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
... 10 more
What's the problem?
You can try to write something more in the leaves of the tree? Anything.
Thank you very much.
Best regards,
Sara
> PartialData mapreduce Random Forests
> ------------------------------------
>
> Key: MAHOUT-145
> URL: https://issues.apache.org/jira/browse/MAHOUT-145
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Affects Versions: 0.2
> Reporter: Deneche A. Hakim
> Assignee: Deneche A. Hakim
> Priority: Minor
> Fix For: 0.2
>
> Attachments: partial_August_10.patch, partial_August_13.patch,
> partial_August_15.patch, partial_August_17.patch, partial_August_19.patch,
> partial_August_24.patch, partial_August_27.patch, partial_August_2.patch,
> partial_August_31.patch, partial_August_9.patch, partial_Sep_15.patch,
> partial_Sep_30.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions
> of the data. That loses some of the solidity of the original method, but
> could actually do better if the splits exposed non-stationary behavior."
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira