Github user bgreeven commented on the pull request:
https://github.com/apache/spark/pull/1290#issuecomment-68241121
I have compared the ANN with Support Vector Machine (SVM) and Logistic
Regression.
I have tested using a master "local(5)" configuration, and applied the
MNIST dataset, using 60000 training examples and 10000 test examples.
Since SVM and Logistic Regression are binary classifiers, I applied two
methods to convert them to a multinary classifier: majority vote and ad-hoc
tree.
For the majority vote, I trained 10 different models, each to distinguish a
single class from the rest. The classification was done by looking at which
model gives the highest positive output. I performed 100 iterations per class,
leading to 1000 iterations in total.
For ANN, I used a single hidden layer with 32 nodes (not counting the bias
nodes). I performed 100 iterations.
For LBFGS I used tolerance 1e-5.
Because of the poor performance of SVM+SGD, I re-ran it with 1000
iterations per class (10000 in total). The performance was similar.
I found the following results for the test set:
```
Algorithm Accuracy Time # correct #
incorrect
+-----------------------------+----------+-----------+-----------+-------------+
| ANN (LBFGS) | 95.1% | 665s | 9510 |
490 |
+-----------------------------+----------+-----------+-----------+-------------+
| Logistic Regression (SGD) | 72.0% | 1325s | 7202 |
2798 |
+-----------------------------+----------+-----------+-----------+-------------+
| Logistic Regression (LBFGS) | 86.6% | 1635s | 8658 |
1342 |
+-----------------------------+----------+-----------+-----------+-------------+
| SVM (SGD) | 18.6% | 1294s | 1860 |
8140 |
+-----------------------------+----------+-----------+-----------+-------------+
| (SVM (SGD) 1000 iterations) | 18.5% | 12658s | 1850 |
8150 |
+-----------------------------+----------+-----------+-----------+-------------+
| SVM (LBFGS) | 86.2% | 1453s | 8622 |
1378 |
+-----------------------------+----------+-----------+-----------+-------------+
```
I also created an ad-hoc tree model. This separates the collection of
training examples in two approximately equal size partitions, where I tried to
separate the numbers based on how different they look. I continued with the two
separated partitions, until each output class corresponded to a single number.
The partioning choice was made manually and intuitively, as follows:
0123456789 -> (04689, 12357)
04689 -> (068, 49)
068 -> (0, 68)
68 -> (6, 8)
49 -> (4, 9)
12357 -> (17, 235)
17 -> (1, 7)
235 -> (2, 35)
35 -> (3, 5)
Notice that this leads to only nine classification runs, not ten as in the
voting scheme.
After training, I used the trained models to classify the test set. I got
the following results (same parameters as with the voting scheme):
```
Algorithm Accuracy Time # correct #
incorrect
+-----------------------------+----------+-----------+-----------+-------------+
| ANN (LBFGS) | 95.1% | 665s | 9510 |
490 |
+-----------------------------+----------+-----------+-----------+-------------+
| Logistic Regression (SGD) | 82.3% | 1146s | 8228 |
1772 |
+-----------------------------+----------+-----------+-----------+-------------+
| Logistic Regression (LBFGS) | 87.2% | 1273s | 8719 |
1281 |
+-----------------------------+----------+-----------+-----------+-------------+
| SVM (SGD) | 61.1% | 1148s | 6113 |
3887 |
+-----------------------------+----------+-----------+-----------+-------------+
| SVM (LBFGS) | 87.5% | 1182s | 8753 |
1247 |
+-----------------------------+----------+-----------+-----------+-------------+
```
Notice that I left ANN in the table because this is to compare ANN with
other algorithms. Since ANN is a multinary classifier by nature, it didn't use
the ad-hoc tree.
It would be great if someone could verify of my results. I am particularly
amazed of the low performance of SVM+SGD with voting, and the increase with the
ad-hoc tree. I used the same code for SGD and LBFGS, and only changed the
optimiser and related parameters.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]