-----Original Message-----
From: scikit-learn 
[mailto:scikit-learn-bounces+mark_stratford=optum....@python.org] On Behalf Of 
scikit-learn-requ...@python.org
Sent: Monday, March 20, 2017 6:06 AM
To: scikit-learn@python.org
Subject: scikit-learn Digest, Vol 12, Issue 42

Send scikit-learn mailing list submissions to
        scikit-learn@python.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://mail.python.org/mailman/listinfo/scikit-learn
or, via email, send a message with subject or body 'help' to
        scikit-learn-requ...@python.org

You can reach the person managing the list at
        scikit-learn-ow...@python.org

When replying, please edit your Subject line so it is more specific than "Re: 
Contents of scikit-learn digest..."


Today's Topics:

   1. recommended feature selection method to train an MLPRegressor
      (Thomas Evangelidis)
   2. Re: recommended feature selection method to train an
      MLPRegressor (Andreas Mueller)
   3. Re: recommended feature selection method to train an
      MLPRegressor (Sebastian Raschka)
   4. Anomaly/Outlier detection based on user access for a large
      application (John Doe)


----------------------------------------------------------------------

Message: 1
Date: Sun, 19 Mar 2017 20:47:36 +0100
From: Thomas Evangelidis <teva...@gmail.com>
To: Scikit-learn user and developer mailing list
        <scikit-learn@python.org>
Subject: [scikit-learn] recommended feature selection method to train
        an MLPRegressor
Message-ID:
        <CAACvdx17Ev3jr0ds2bLyJc0RqZkqJH7Rtx=s1zaodmuvckc...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Which of the following methods would you recommend to select good features
(<=50) from a set of 534 features in order to train a MLPregressor? Please take 
into account that the datasets I use for training are small.

http://scikit-learn.org/stable/modules/feature_selection.html

And please don't tell me to use a neural network that supports the dropout or 
any other algorithm for feature elimination. This is not applicable in my case 
because I want to know the best 50 features in order to append them to other 
types of feature that I am confident that are important.


?cheers
Thomas?


-- 

======================================================================

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology Masaryk University Kamenice 
5/A35/1S081,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

          teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mail.python.org/pipermail/scikit-learn/attachments/20170319/b3e083c7/attachment-0001.html>

------------------------------

Message: 2
Date: Sun, 19 Mar 2017 18:23:07 -0400
From: Andreas Mueller <t3k...@gmail.com>
To: Scikit-learn user and developer mailing list
        <scikit-learn@python.org>
Subject: Re: [scikit-learn] recommended feature selection method to
        train an MLPRegressor
Message-ID: <6b490067-962e-02fc-5157-9a487fc1a...@gmail.com>
Content-Type: text/plain; charset="windows-1252"; Format="flowed"



On 03/19/2017 03:47 PM, Thomas Evangelidis wrote:
> Which of the following methods would you recommend to select good 
> features (<=50) from a set of 534 features in order to train a 
> MLPregressor? Please take into account that the datasets I use for 
> training are small.
>
> http://scikit-learn.org/stable/modules/feature_selection.html
>
> And please don't tell me to use a neural network that supports the 
> dropout or any other algorithm for feature elimination. This is not 
> applicable in my case because I want to know the best 50 features in 
> order to append them to other types of feature that I am confident 
> that are important.
>
You can always use forward or backward selection as implemented in mlxtend if 
you're patient. As your dataset is small that might work.
However, it might be hard tricky to get the MLP to run consistently - though 
maybe not...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mail.python.org/pipermail/scikit-learn/attachments/20170319/79e4cf33/attachment-0001.html>

------------------------------

Message: 3
Date: Sun, 19 Mar 2017 19:32:45 -0400
From: Sebastian Raschka <se.rasc...@gmail.com>
To: Scikit-learn user and developer mailing list
        <scikit-learn@python.org>
Subject: Re: [scikit-learn] recommended feature selection method to
        train an MLPRegressor
Message-ID: <f6b32e16-6045-4934-a27d-1407d43dc...@gmail.com>
Content-Type: text/plain; charset=utf-8

Hm, that?s tricky. I think the other methods listed on 
http://scikit-learn.org/stable/modules/feature_selection.html could help 
regarding a computationally cheap solution, but the problem would be that they 
probably wouldn?t work that well for an MLP due to the linear assumption. And 
an exhaustive sampling of all subsets would also be impractical/impossible. For 
all 50 feature subsets, you already have 
73353053308199416032348518540326808282134507009732998441913227684085760 
combinations :P. A greedy solution like forward or backward selection would be 
more feasible, but still very expensive in combination with an MLP. On top of 
that, you also have to consider that neural networks are generally pretty 
sensitive to hyperparameter settings. So even if you fix the architecture, you 
probably still want to check if the learning rate etc. is appropriate for each 
combination of features (by checking the cost and validation error during 
training).

PS: I wouldn?t dismiss dropout, imho. Especially because your training set is 
small, it could be even crucial to reduce overfitting. I mean it doesn?t remove 
features from your dataset but just helps the network to rely on particular 
combinations of features to be always present during training. Your final 
network will still process all features and dropout will effectively cause your 
network to ?use? more of those features in your ~50 feature subset compared to 
no dropout (because otherwise, it may just learn to rely of a subset of these 
50 features).

> On Mar 19, 2017, at 6:23 PM, Andreas Mueller <t3k...@gmail.com> wrote:
> 
> 
> 
> On 03/19/2017 03:47 PM, Thomas Evangelidis wrote:
>> Which of the following methods would you recommend to select good features 
>> (<=50) from a set of 534 features in order to train a MLPregressor? Please 
>> take into account that the datasets I use for training are small.
>> 
>> http://scikit-learn.org/stable/modules/feature_selection.html
>> 
>> And please don't tell me to use a neural network that supports the dropout 
>> or any other algorithm for feature elimination. This is not applicable in my 
>> case because I want to know the best 50 features in order to append them to 
>> other types of feature that I am confident that are important.
>> 
> You can always use forward or backward selection as implemented in mlxtend if 
> you're patient. As your dataset is small that might work.
> However, it might be hard tricky to get the MLP to run consistently - though 
> maybe not...
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



------------------------------

Message: 4
Date: Mon, 20 Mar 2017 11:35:59 +0530
From: John Doe <codera...@gmail.com>
To: scikit-learn@python.org
Subject: [scikit-learn] Anomaly/Outlier detection based on user access
        for a large application
Message-ID:
        <CAP=qekf0svrzy+rqjyutp9ax2zlki39h89cnngbj-gewyxp...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi All,
    I am trying to solve a problem of finding Anomalies/Outliers using 
application logs of a large KMS. Please find the details below:

*Problem Statement*: Find Anomalies/outliers using application access logs in 
an un-supervised learning environment. Basic use case is to find any suspicious 
activity by user/group, that deviates from a trend that the algorithm has 
learned.

*Input Data*: Data would be created from log file that are in the following
format:

"ts, src_ip, decrypt, user_a, group_b, kms_region, key"

Where:

*ts* : time of access in epoch Eg: 1489840335
*decrypt* : is one of the various possible actions.
*user_a*, *group_a* : are the user and group that did the access
*kms_region* : the region in which the key exists
*key* : the key that was accessed

*Train Set*: This comes under the un-supervised learning and hence we cant have 
a "normal" training set which the model can learn.

*Example of anomalies*:

   1. User A suddenly accessing from a different IP: xx.yy
   2. No. of access for a given key going up suddenly for a given user,key
   pair
   3. Increased access on a generally quite long weekend
   4. Increased access on a Thu (compared to last Thursdays)
   5. Unusual sequences of actions for a given user. Eg. read, decrypt,
   delete in quick succession for all keys for a given user

------------------------

>From our research, we have come up with below list of algorithms that 
>are
applied to similar problems:

   - ARIMA
   <https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average>
   : This might be good for timeseries predicting, but will it also learn to
   flag anomalies like #3, #4, sequences of actions(#5) etc?
   - scikit-learn's Novelty and Outlier Detection
   <http://scikit-learn.org/stable/modules/outlier_detection.html> : Not
   sure if these will address #3, #4 and #5 use cases above.
   - Neural Networks
   - k-nearest neighbors
   - Clustering-Based Anomaly Detection Techniques: k-Means Clustering etc
   - Parametric Techniques
   
<https://www.vs.inf.ethz.ch/edu/HS2011/CPS/papers/chandola09_anomaly-detection-survey.pdf>
   (See Section 7): This might work well on continuous variables, but will it
   work on discrete features like, is_weekday etc? Also will it cover cases
   like #4 and #5 above?

Most of the research I did were on problems that had continuous features and 
did not consider discrete variables like "Holiday_today?" / succession of 
events etc.

 Any feedback on the algorithm / technique that can be used for above usecases 
would be highly appreciated. Thanks.

Regards,
John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mail.python.org/pipermail/scikit-learn/attachments/20170320/c3224199/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


------------------------------

End of scikit-learn Digest, Vol 12, Issue 42
********************************************


This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity
to which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to