-----Original Message-----
From: scikit-learn
[mailto:[email protected]] On Behalf Of
[email protected]
Sent: Monday, March 20, 2017 6:06 AM
To: [email protected]
Subject: scikit-learn Digest, Vol 12, Issue 42
Send scikit-learn mailing list submissions to
[email protected]
To subscribe or unsubscribe via the World Wide Web, visit
https://mail.python.org/mailman/listinfo/scikit-learn
or, via email, send a message with subject or body 'help' to
[email protected]
You can reach the person managing the list at
[email protected]
When replying, please edit your Subject line so it is more specific than "Re:
Contents of scikit-learn digest..."
Today's Topics:
1. recommended feature selection method to train an MLPRegressor
(Thomas Evangelidis)
2. Re: recommended feature selection method to train an
MLPRegressor (Andreas Mueller)
3. Re: recommended feature selection method to train an
MLPRegressor (Sebastian Raschka)
4. Anomaly/Outlier detection based on user access for a large
application (John Doe)
----------------------------------------------------------------------
Message: 1
Date: Sun, 19 Mar 2017 20:47:36 +0100
From: Thomas Evangelidis <[email protected]>
To: Scikit-learn user and developer mailing list
<[email protected]>
Subject: [scikit-learn] recommended feature selection method to train
an MLPRegressor
Message-ID:
<CAACvdx17Ev3jr0ds2bLyJc0RqZkqJH7Rtx=s1zaodmuvckc...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Which of the following methods would you recommend to select good features
(<=50) from a set of 534 features in order to train a MLPregressor? Please take
into account that the datasets I use for training are small.
http://scikit-learn.org/stable/modules/feature_selection.html
And please don't tell me to use a neural network that supports the dropout or
any other algorithm for feature elimination. This is not applicable in my case
because I want to know the best 50 features in order to append them to other
types of feature that I am confident that are important.
?cheers
Thomas?
--
======================================================================
Thomas Evangelidis
Research Specialist
CEITEC - Central European Institute of Technology Masaryk University Kamenice
5/A35/1S081,
62500 Brno, Czech Republic
email: [email protected]
[email protected]
website: https://sites.google.com/site/thomasevangelidishomepage/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.python.org/pipermail/scikit-learn/attachments/20170319/b3e083c7/attachment-0001.html>
------------------------------
Message: 2
Date: Sun, 19 Mar 2017 18:23:07 -0400
From: Andreas Mueller <[email protected]>
To: Scikit-learn user and developer mailing list
<[email protected]>
Subject: Re: [scikit-learn] recommended feature selection method to
train an MLPRegressor
Message-ID: <[email protected]>
Content-Type: text/plain; charset="windows-1252"; Format="flowed"
On 03/19/2017 03:47 PM, Thomas Evangelidis wrote:
> Which of the following methods would you recommend to select good
> features (<=50) from a set of 534 features in order to train a
> MLPregressor? Please take into account that the datasets I use for
> training are small.
>
> http://scikit-learn.org/stable/modules/feature_selection.html
>
> And please don't tell me to use a neural network that supports the
> dropout or any other algorithm for feature elimination. This is not
> applicable in my case because I want to know the best 50 features in
> order to append them to other types of feature that I am confident
> that are important.
>
You can always use forward or backward selection as implemented in mlxtend if
you're patient. As your dataset is small that might work.
However, it might be hard tricky to get the MLP to run consistently - though
maybe not...
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.python.org/pipermail/scikit-learn/attachments/20170319/79e4cf33/attachment-0001.html>
------------------------------
Message: 3
Date: Sun, 19 Mar 2017 19:32:45 -0400
From: Sebastian Raschka <[email protected]>
To: Scikit-learn user and developer mailing list
<[email protected]>
Subject: Re: [scikit-learn] recommended feature selection method to
train an MLPRegressor
Message-ID: <[email protected]>
Content-Type: text/plain; charset=utf-8
Hm, that?s tricky. I think the other methods listed on
http://scikit-learn.org/stable/modules/feature_selection.html could help
regarding a computationally cheap solution, but the problem would be that they
probably wouldn?t work that well for an MLP due to the linear assumption. And
an exhaustive sampling of all subsets would also be impractical/impossible. For
all 50 feature subsets, you already have
73353053308199416032348518540326808282134507009732998441913227684085760
combinations :P. A greedy solution like forward or backward selection would be
more feasible, but still very expensive in combination with an MLP. On top of
that, you also have to consider that neural networks are generally pretty
sensitive to hyperparameter settings. So even if you fix the architecture, you
probably still want to check if the learning rate etc. is appropriate for each
combination of features (by checking the cost and validation error during
training).
PS: I wouldn?t dismiss dropout, imho. Especially because your training set is
small, it could be even crucial to reduce overfitting. I mean it doesn?t remove
features from your dataset but just helps the network to rely on particular
combinations of features to be always present during training. Your final
network will still process all features and dropout will effectively cause your
network to ?use? more of those features in your ~50 feature subset compared to
no dropout (because otherwise, it may just learn to rely of a subset of these
50 features).
> On Mar 19, 2017, at 6:23 PM, Andreas Mueller <[email protected]> wrote:
>
>
>
> On 03/19/2017 03:47 PM, Thomas Evangelidis wrote:
>> Which of the following methods would you recommend to select good features
>> (<=50) from a set of 534 features in order to train a MLPregressor? Please
>> take into account that the datasets I use for training are small.
>>
>> http://scikit-learn.org/stable/modules/feature_selection.html
>>
>> And please don't tell me to use a neural network that supports the dropout
>> or any other algorithm for feature elimination. This is not applicable in my
>> case because I want to know the best 50 features in order to append them to
>> other types of feature that I am confident that are important.
>>
> You can always use forward or backward selection as implemented in mlxtend if
> you're patient. As your dataset is small that might work.
> However, it might be hard tricky to get the MLP to run consistently - though
> maybe not...
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
------------------------------
Message: 4
Date: Mon, 20 Mar 2017 11:35:59 +0530
From: John Doe <[email protected]>
To: [email protected]
Subject: [scikit-learn] Anomaly/Outlier detection based on user access
for a large application
Message-ID:
<CAP=qekf0svrzy+rqjyutp9ax2zlki39h89cnngbj-gewyxp...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi All,
I am trying to solve a problem of finding Anomalies/Outliers using
application logs of a large KMS. Please find the details below:
*Problem Statement*: Find Anomalies/outliers using application access logs in
an un-supervised learning environment. Basic use case is to find any suspicious
activity by user/group, that deviates from a trend that the algorithm has
learned.
*Input Data*: Data would be created from log file that are in the following
format:
"ts, src_ip, decrypt, user_a, group_b, kms_region, key"
Where:
*ts* : time of access in epoch Eg: 1489840335
*decrypt* : is one of the various possible actions.
*user_a*, *group_a* : are the user and group that did the access
*kms_region* : the region in which the key exists
*key* : the key that was accessed
*Train Set*: This comes under the un-supervised learning and hence we cant have
a "normal" training set which the model can learn.
*Example of anomalies*:
1. User A suddenly accessing from a different IP: xx.yy
2. No. of access for a given key going up suddenly for a given user,key
pair
3. Increased access on a generally quite long weekend
4. Increased access on a Thu (compared to last Thursdays)
5. Unusual sequences of actions for a given user. Eg. read, decrypt,
delete in quick succession for all keys for a given user
------------------------
>From our research, we have come up with below list of algorithms that
>are
applied to similar problems:
- ARIMA
<https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average>
: This might be good for timeseries predicting, but will it also learn to
flag anomalies like #3, #4, sequences of actions(#5) etc?
- scikit-learn's Novelty and Outlier Detection
<http://scikit-learn.org/stable/modules/outlier_detection.html> : Not
sure if these will address #3, #4 and #5 use cases above.
- Neural Networks
- k-nearest neighbors
- Clustering-Based Anomaly Detection Techniques: k-Means Clustering etc
- Parametric Techniques
<https://www.vs.inf.ethz.ch/edu/HS2011/CPS/papers/chandola09_anomaly-detection-survey.pdf>
(See Section 7): This might work well on continuous variables, but will it
work on discrete features like, is_weekday etc? Also will it cover cases
like #4 and #5 above?
Most of the research I did were on problems that had continuous features and
did not consider discrete variables like "Holiday_today?" / succession of
events etc.
Any feedback on the algorithm / technique that can be used for above usecases
would be highly appreciated. Thanks.
Regards,
John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.python.org/pipermail/scikit-learn/attachments/20170320/c3224199/attachment.html>
------------------------------
Subject: Digest Footer
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn
------------------------------
End of scikit-learn Digest, Vol 12, Issue 42
********************************************
This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity
to which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn