On 10/01/2014 04:23 PM, Gavin Hackeling wrote:
Hi all,

I am working on an character recognition problem with the Chars74K data set. I am reshaping the images to 30x30 pixels, and using the 900 pixels' intensities as features. I am classifying the images using a SVC with an RBF kernel.

...
    pipeline = Pipeline([
        ('clf', SVC(kernel='rbf'))
    ])
    parameters = {
        'clf__gamma': (0.01, 0.03, 0.1, 0.3, 1),
        'clf__C': (0.1, 0.3, 1, 3, 10, 30),
    }
...

On CrunchBang 11 with scikit-learn 0.15.2, grid search yields the following results:

Fitting 3 folds for each of 30 candidates, totalling 90 fits
[Parallel(n_jobs=3)]: Done   1 jobs       | elapsed:  1.6min
[Parallel(n_jobs=3)]: Done  50 jobs       | elapsed: 34.8min
[Parallel(n_jobs=3)]: Done 86 out of 90 | elapsed: 69.4min remaining: 3.2min
[Parallel(n_jobs=3)]: Done  90 out of  90 | elapsed: 71.6min finished
Best score: 0.559
Best parameters set:
clf__C: 3
clf__gamma: 0.03
             precision    recall  f1-score   support

        001       0.00      0.00      0.00         6
        002       1.00      0.20      0.33         5
        ...
        061       0.00      0.00      0.00         4
        062       0.00      0.00      0.00         4

avg / total       0.56      0.58      0.53       532

On Ubuntu 14.04 and OS X with scikit-learn 0.15.1 and 0.15.2, the same model performs horribly. The following are the results of the script for Ubuntu 14.04 with NumPy 1.8.2 and 0.14.0.

avg / total       0.09      0.07      0.02       532

Switching to a polynomial kernel on these platforms yields better performance, but the RBF kernel still performs best.

It appears that the performance depends on the platform. What might be the problem here?
Have you fixed the random seed in the GridSearchCV? The dataset seems much to small for this number of classes, and the results of the cross validation will be very noisy. If you look at the "good" result, the performance is 0 for all classes that are visible but class number 002, and that one only has 5 samples. Another reason of non-determinism could be if you use "probabilities=True" in the SVC.



------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to