Joseph K. Bradley created SPARK-13868:
-----------------------------------------
Summary: Random forest accuracy exploration
Key: SPARK-13868
URL: https://issues.apache.org/jira/browse/SPARK-13868
Project: Spark
Issue Type: Improvement
Components: ML
Reporter: Joseph K. Bradley
This is a JIRA for exploring accuracy improvements for Random Forests.
h2. Background
Initial exploration was based on reports of poor accuracy from
[http://datascience.la/benchmarking-random-forest-implementations/]
Essentially, Spark 1.2 showed poor performance relative to other libraries for
training set sizes of 1M and 10M.
h3. Initial improvements
The biggest issue was that the metric being used was AUC and Spark 1.2 was
using hard predictions, not class probabilities. This was fixed in
[SPARK-9528], and that brought Spark up to performance parity with
scikit-learn, Vowpal Wabbit, and R for the training set size of 1M.
h3. Remaining issues
For training set size 10M, Spark does not yet match the AUC of the other 2
libraries benchmarked (H2O and xgboost).
Note that, on 1M instances, these 2 libraries also show better results than
scikit-learn, VW, and R. I'm not too familiar with the H2O implementation and
how it differs, but xgboost is a very different algorithm, so it's not
surprising it has different behavior.
h2. My explorations
I've run Spark on the test set of 10M instances. (Note that the benchmark
linked above used somewhat different settings for the different algorithms, but
those settings are actually not that important for this problem. This included
gini vs. entropy impurity and limits on splitting nodes.)
I've tried adjusting:
* maxDepth: Past depth 20, going deeper does not seem to matter
* maxBins: I've gone up to 500, but this too does not seem to matter. However,
this is a hard thing to verify since slight differences in discretization could
become significant in a large tree.
h2. Current questions
* H2O: It would be good to understand how this implementation differs from
standard RF implementations (in R, VW, scikit-learn, and Spark).
* xgboost: There's a JIRA for it: [SPARK-8547]. It would be great to see the
Spark package linked from that JIRA tested vs. MLlib on the benchmark data (or
other data). From what I've heard/read, xgboost is sometimes better, sometimes
worse in accuracy (but of course faster with more localized training).
* Based on the above explorations, are there changes we should make to Spark
RFs?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]