I don’t think you should treat this as an outlier detection problem. Why not 
try it as a classification problem? The dataset is highly unbalanced. Try

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

Use sample_weight to tell the fit method about the class imbalance. But be sure 
to read up about unbalanced classification and the class_weight parameter to 
ExtraTreesClassifier. You cannot use the accuracy to find the best model, so 
read up on model validation in the sklearn User’s Guide. And when you do 
cross-validation to get the best hyperparameters, be sure you pass the sample 
weights as well.

Time series data is a bit different to use with cross-validation. You may want 
to add features such as minutes since midnight, day of week, weekday/weekend. 
And make sure your cross-validation folds respect the time series nature of the 
problem.

http://stackoverflow.com/questions/37583263/scikit-learn-cross-validation-custom-splits-for-time-series-data


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and 
Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.sm...@macys.com

From: scikit-learn 
[mailto:scikit-learn-bounces+dale.t.smith=macys....@python.org] On Behalf Of 
Nicolas Goix
Sent: Thursday, August 4, 2016 9:13 PM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Supervised anomaly detection in time series

⚠ EXT MSG:

There are different ways of aggregating estimators. A possibility can be to 
take the majority vote, or averaging decision functions.

On Aug 4, 2016 8:44 PM, "Amita Misra" 
<amis...@ucsc.edu<mailto:amis...@ucsc.edu>> wrote:
If I train multiple algorithms on different subsamples, then how do I get the 
final classifier that predicts unseen data?

I have very few positive samples since it is speed bump detection and we have 
very few speed bumps in a drive.
However, I think that  unseen new data would be quite similar to what I have in 
training data hence if I can correctly learn a classifier for these 5, I hope 
it should work well for unseen speed bumps.
Thanks,
Amita

On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix 
<goix.nico...@gmail.com<mailto:goix.nico...@gmail.com>> wrote:

You can evaluate the accuracy of your hyper-parameters on a few samples. Just 
don't use the accuracy as your performance measure.

For supervised classification, training multiple algorithms on small balanced 
subsamples usually works well, but 5 anomalies seems indeed to be very little.

Nicolas

On Aug 4, 2016 7:51 PM, "Amita Misra" 
<amis...@ucsc.edu<mailto:amis...@ucsc.edu>> wrote:
SubSample would remove a lot of information from the negative class.
I have more than 500 samples of negative class and just 5 samples of positive 
class.
Amita

On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix 
<goix.nico...@gmail.com<mailto:goix.nico...@gmail.com>> wrote:
Hi,

Yes you can use your labeled data (you will need to sub-sample your normal 
class to have similar proportion normal-abnormal) to learn your 
hyper-parameters through CV.

You can also try to use supervised classification algorithms on `not too highly 
unbalanced' sub-samples.

Nicolas

On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra 
<amis...@ucsc.edu<mailto:amis...@ucsc.edu>> wrote:
Hi,

I am currently exploring the problem of speed bump detection using 
accelerometer time series data.
I have extracted some features based on mean, std deviation etc  within a time 
window.
Since the dataset is highly skewed ( I have just 5  positive samples for every 
> 300 samples)
I was looking into

One ClassSVM
covariance.EllipticEnvelope
sklearn.ensemble.IsolationForest

but I am not sure how to use them.

What I get from docs
separate the positive examples and train using only negative examples

clf.fit(X_train)
and then
predict the positive examples using
clf.predict(X_test)

I am not sure what is then the role of positive examples in my training dataset 
or how can I use them to improve my classifier so that I can predict better on 
new samples.

Can we do something like Cross validation to learn the parameters as in normal 
binary SVM classification

Thanks,?
Amita

Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz





--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org<mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org<mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn



--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org<mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org<mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn



--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org<mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening 
attachments.
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to