[jira] [Commented] (MAHOUT-932) RandomForest quits with ArrayIndexOutOfBoundsException while running sample

Bernhard Lehner (Commented) (JIRA) Tue, 17 Jan 2012 06:00:11 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187691#comment-13187691
 ]


Bernhard Lehner commented on MAHOUT-932:
----------------------------------------

Hello,
I only started recently with MAHOUT, but hopefully I can contribute to this 
topic:
I have the same Problem with the ArrayIndexOutOfBoundsException, but the exact 
behaviour differs in detail from Version to Version:
+++++++++++++++++++++++++++++++++++++++++
MAHOUT-0.5+HADOOP-0.20.2(running local)

The beforementioned KDDtrain(100%) dataset from the partial implementation 
example can be used for training without any problems whatsoever. BUT using a 
larger dataset (~10 times, aprox. 180MB) leads to the Exception (index equals 
the -t parameter) - regardless of the value of the -Dmapred.max.split.size 
parameter.
To avoid partitioning I even set the parameter to more than the actual input 
data size, nevertheless the PartialBuilder gets involved and leads to the 
Exception. Only by not passing the parameter at all I was able to train the RF.
But this led to another interesting result: When I compared the performance of 
the resulting RF-model with a model trained from WEKA, I noticed a ~2% drop of 
accuracy on every single one of my ten different train- and testsets. Using a 
bit smaller datasets (aprox. 150MB for training) didn't lead to such different 
behaviour in MAHOUT and WEKA.
It seems that partitioning is responsible for the 2% drop in accuracy, as it is 
explicitly mentioned on

https://cwiki.apache.org/MAHOUT/partial-implementation.html

"...
IMPORTANT: using less partitions should give better classification results, but 
needs a lot of memory. So if the Jobs are failing, try increasing the number of 
partitions.
..."

The weird thing is, I certainly have the amount of memory available (8GB) to 
train the RF, but I didn't find a way to prevent partitioning...


+++++++++++++++++++++++++++++++++++++++++
MAHOUT-0.6-trunk+HADOOP-0.20.204 | HADOOP-1.0.0(running local)

Now I am able to reproduce the same results as WEKA by not using the -p option, 
hence invoking the InMem Mapred implementation (i.e. no more 2% drop in 
accuracy). But, of course, it takes some time to finish training...
However, I still haven't found a combination of -p option and 
-Dmapred.max.split.size parameter that doesn't lead to the Exception thrown by 
PartialBuilder with large datasets.

I agree with Ikumasa-san that there is some discrepancy regarding the expected 
amount of subtrees. What I'm still not sure about is:
- Is this a problem with MAHOUT or HADOOP?
- Why does the -Dmapred.max.split.size parameter doesn't have any impact on the 
behaviour - even when it is set so high that it should not come to any 
partitioning at all?

This behaviour should be easy to reproduce by concatenating the content of the 
KDDtrain-file over and over. To speed up the whole process I would advise using 
a small number of trees and attributes (I used t 5 -sl 10),
since it has no effect on the throwing of the exception.


Regarding the -t parameter:
>From the user point of view I think it would be best to just give the total 
>amount of trees as parameter - regardless of the amount of mappers involved. 
>Given that HADOOP delivers the amount of invoked mappers, i would definitely 
>vote for this solution





Best regards,


 



                
> RandomForest quits with ArrayIndexOutOfBoundsException while running sample
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-932
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-932
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.6
>         Environment: Mac OS X, current Mac OS shipped Java version, latest 
> checkout from 17.12.2011
> Dual Core MacBook Pro 2009, 8 Gb, SSD
>            Reporter: Berttenfall M.
>            Priority: Minor
>              Labels: Classifier, DecisionForest, RandomForest
>
> Hello,
> when running the example under 
> https://cwiki.apache.org/MAHOUT/partial-implementation.html with the 
> recommended data sets several issues occur.
> First: ARFF files seem no longer to be supported, I've been using the UCI 
> format as recommended here 
> (https://cwiki.apache.org/MAHOUT/breiman-example.html). Using ARFF files, 
> Mahout quits when creating the description file (wrong number of attributes 
> in the string), using UCI format it works.
> The main error happends during the BuildForest step (I could not test 
> TestForest, due to missing tree).
> Running:
> $MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.df.mapreduce.BuildForest 
> -Dmapred.max.split.size=1874231 -d convertedData/data.data -ds KDDTrain+.info 
> -sl 5 -p -t 100 -o nsl-forest.
> I tested different split.size values. 1874231, 187423, 18742 give the 
> following error. 1874 does not finish on my machine (Dual Core MacBook Pro 
> 2009, 8 Gb, SSD).
> It quits after a while (map is almost done) with the following message:
> 11/12/17 16:23:24 INFO mapred.Task: Task 'attempt_local_0001_m_000998_0' done.
> 11/12/17 16:23:24 INFO mapred.Task: Task:attempt_local_0001_m_000999_0 is 
> done. And is in the process of commiting
> 11/12/17 16:23:24 INFO mapred.LocalJobRunner: 
> 11/12/17 16:23:24 INFO mapred.Task: Task attempt_local_0001_m_000999_0 is 
> allowed to commit now
> 11/12/17 16:23:24 INFO output.FileOutputCommitter: Saved output of task 
> 'attempt_local_0001_m_000999_0' to 
> file:/Users/martin/Documents/Studium/Master/LargeScaleProcessing/Repository/mahout_algorithms_evaluation/testingRandomForests/nsl-forest
> 11/12/17 16:23:27 INFO mapred.LocalJobRunner: 
> 11/12/17 16:23:27 INFO mapred.Task: Task 'attempt_local_0001_m_000999_0' done.
> 11/12/17 16:23:28 INFO mapred.JobClient:  map 100% reduce 0%
> 11/12/17 16:23:28 INFO mapred.JobClient: Job complete: job_local_0001
> 11/12/17 16:23:28 INFO mapred.JobClient: Counters: 8
> 11/12/17 16:23:28 INFO mapred.JobClient:   File Output Format Counters 
> 11/12/17 16:23:28 INFO mapred.JobClient:     Bytes Written=41869032
> 11/12/17 16:23:28 INFO mapred.JobClient:   FileSystemCounters
> 11/12/17 16:23:28 INFO mapred.JobClient:     FILE_BYTES_READ=37443033225
> 11/12/17 16:23:28 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=44946910704
> 11/12/17 16:23:28 INFO mapred.JobClient:   File Input Format Counters 
> 11/12/17 16:23:28 INFO mapred.JobClient:     Bytes Read=20478569
> 11/12/17 16:23:28 INFO mapred.JobClient:   Map-Reduce Framework
> 11/12/17 16:23:28 INFO mapred.JobClient:     Map input records=125973
> 11/12/17 16:23:28 INFO mapred.JobClient:     Spilled Records=0
> 11/12/17 16:23:28 INFO mapred.JobClient:     Map output records=100000
> 11/12/17 16:23:28 INFO mapred.JobClient:     SPLIT_RAW_BYTES=215000
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 100
>       at 
> org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.processOutput(PartialBuilder.java:126)
>       at 
> org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.parseOutput(PartialBuilder.java:89)
>       at 
> org.apache.mahout.classifier.df.mapreduce.Builder.build(Builder.java:303)
>       at 
> org.apache.mahout.classifier.df.mapreduce.BuildForest.buildForest(BuildForest.java:201)
>       at 
> org.apache.mahout.classifier.df.mapreduce.BuildForest.run(BuildForest.java:163)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>       at 
> org.apache.mahout.classifier.df.mapreduce.BuildForest.main(BuildForest.java:225)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>       at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>       at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> PS: I adjusted the class to .classifier.df. and removed -oop

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-932) RandomForest quits with ArrayIndexOutOfBoundsException while running sample

Reply via email to