I also worked on something similar, instead of using some algorithms deal with unbalanced data, you can also try to create a balanced dataset either using oversampling or downsampling. scikit-learn-contrib has already had a project dealing with unbalanced data: https://github.com/scikit-learn-contrib/imbalanced-learn.
Either you treat it as a classification problem or anomaly detection problem (I prefer to treat it as a classification problem first) you all need to find a better set of features in time domain or frequency domain. On Fri, Aug 5, 2016 at 7:09 AM, Dale T Smith <dale.t.sm...@macys.com> wrote: > To analyze unbalanced classifiers, use > > > > from sklearn.metrics import classification_report > > > > > > ____________________________________________________________ > ______________________________ > *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data > Science and Capacity Planning > | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.sm...@macys.com > > > > *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith= > macys....@python.org] *On Behalf Of *Pedro Pazzini > *Sent:* Friday, August 5, 2016 9:33 AM > > *To:* Scikit-learn user and developer mailing list > *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series > > > > ⚠ EXT MSG: > > Just to add a few things to the discussion: > > 1. For unbalanced problems, as far as I know, one of the best scores > to evaluate a classifier is the Area Under the ROC curve: > http://scikit-learn.org/stable/modules/generated/ > sklearn.metrics.roc_auc_score.html > > <http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html>. > For that you will have to use clf.predict_proba(X_test) instead of > clf.predict(X_test). I think that using the 'sample_weight' parameter as > Smith said is a promising choice. > 2. Usually is recommend the normalization of each time series for > comparing them. The Z-score normalization is one of the most used [Ref: > http://wan.poly.edu/KDD2012/docs/p262.pdf > <http://wan.poly.edu/KDD2012/docs/p262.pdf>]. > 3. There are some interesting dissimilarity measures such as DTW > (Dynamic Time Warping), CID (Complex Invariant Distance), and others for > comparing time series[Ref: https://www.icmc.usp.br/~ > gbatista/files/bracis2013_1.pdf > <https://www.icmc.usp.br/~gbatista/files/bracis2013_1.pdf>]. And there > are also other approaches for comparing time series in the frequency domain > such as FFT and DWT [Ref: http://infolab.usc.edu/csci599/Fall2003/Time% > 20Series/Efficient%20Similarity%20Search%20In% > 20Sequence%20Databases.pdf > > <http://infolab.usc.edu/csci599/Fall2003/Time%20Series/Efficient%20Similarity%20Search%20In%20Sequence%20Databases.pdf> > ]. > > I hope it helps. > > > > 2016-08-05 9:26 GMT-03:00 Dale T Smith <dale.t.sm...@macys.com>: > > I don’t think you should treat this as an outlier detection problem. Why > not try it as a classification problem? The dataset is highly unbalanced. > Try > > > > http://scikit-learn.org/stable/modules/generated/sklearn.ensemble. > ExtraTreesClassifier.html > > > > Use sample_weight to tell the fit method about the class imbalance. But be > sure to read up about unbalanced classification and the class_weight > parameter to ExtraTreesClassifier. You cannot use the accuracy to find the > best model, so read up on model validation in the sklearn User’s Guide. And > when you do cross-validation to get the best hyperparameters, be sure you > pass the sample weights as well. > > > > Time series data is a bit different to use with cross-validation. You may > want to add features such as minutes since midnight, day of week, > weekday/weekend. And make sure your cross-validation folds respect the time > series nature of the problem. > > > > http://stackoverflow.com/questions/37583263/scikit- > learn-cross-validation-custom-splits-for-time-series-data > > > > > > ____________________________________________________________ > ______________________________ > *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data > Science and Capacity Planning > | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.sm...@macys.com > > > > *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith= > macys....@python.org] *On Behalf Of *Nicolas Goix > *Sent:* Thursday, August 4, 2016 9:13 PM > *To:* Scikit-learn user and developer mailing list > *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series > > > > ⚠ EXT MSG: > > There are different ways of aggregating estimators. A possibility can be > to take the majority vote, or averaging decision functions. > > > > On Aug 4, 2016 8:44 PM, "Amita Misra" <amis...@ucsc.edu> wrote: > > If I train multiple algorithms on different subsamples, then how do I get > the final classifier that predicts unseen data? > > I have very few positive samples since it is speed bump detection and we > have very few speed bumps in a drive. > However, I think that unseen new data would be quite similar to what I > have in training data hence if I can correctly learn a classifier for these > 5, I hope it should work well for unseen speed bumps. > > Thanks, > Amita > > > > On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix <goix.nico...@gmail.com> > wrote: > > You can evaluate the accuracy of your hyper-parameters on a few samples. > Just don't use the accuracy as your performance measure. > > For supervised classification, training multiple algorithms on small > balanced subsamples usually works well, but 5 anomalies seems indeed to be > very little. > > Nicolas > > > > On Aug 4, 2016 7:51 PM, "Amita Misra" <amis...@ucsc.edu> wrote: > > SubSample would remove a lot of information from the negative class. > > I have more than 500 samples of negative class and just 5 samples of > positive class. > > Amita > > > > On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix <goix.nico...@gmail.com> > wrote: > > Hi, > > > > Yes you can use your labeled data (you will need to sub-sample your normal > class to have similar proportion normal-abnormal) to learn your > hyper-parameters through CV. > > > > You can also try to use supervised classification algorithms on `not too > highly unbalanced' sub-samples. > > > > Nicolas > > > > On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra <amis...@ucsc.edu> wrote: > > Hi, > > > > I am currently exploring the problem of speed bump detection using > accelerometer time series data. > > I have extracted some features based on mean, std deviation etc within a > time window. > > Since the dataset is highly skewed ( I have just 5 positive samples for > every > 300 samples) > > I was looking into > > One ClassSVM > covariance.EllipticEnvelope > sklearn.ensemble.IsolationForest > > but I am not sure how to use them. > > What I get from docs > > separate the positive examples and train using only negative examples > > clf.fit(X_train) > > and then > predict the positive examples using > clf.predict(X_test) > > > I am not sure what is then the role of positive examples in my training > dataset or how can I use them to improve my classifier so that I can > predict better on new samples. > > Can we do something like Cross validation to learn the parameters as in > normal binary SVM classification > > > > Thanks,? > > Amita > > > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > > > > > -- > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > > Amita Misra > > Graduate Student Researcher > > Natural Language and Dialogue Systems Lab > > Baskin School of Engineering > > University of California Santa Cruz > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or > opening attachments. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or > opening attachments. > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Qingkai KONG Ph.D Candidate Seismological Lab 289 McCone Hall University of California, Berkeley http://seismo.berkeley.edu/qingkaikong
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn