Re: [scikit-learn] combining arrays of features to train an MLP

Sebastian Raschka Tue, 20 Dec 2016 11:05:40 -0800

Hi, Thomas,
I haven’t looked what RandomizedLasso does exactly, but like you said, it is 
probably not ideal for combining it with an MLP. What In terms of 
regularization, I was more thinking of the L1 and L2 for the hidden layers, or 
dropout. However, given such a small sample size (and the small the 
sample/feature ratio), I think there are way too many (hyper/)parameters to fit 
in an MLP to get good results. I think you could be better off with a kernel 
SVM (if linear models don’t work well) or ensemble learning.


Best,
Sebastian

> On Dec 19, 2016, at 6:51 PM, Thomas Evangelidis <teva...@gmail.com> wrote:
> 
> Thank you, these articles discuss about ML application of the types of 
> fingerprints I working with! I will read them thoroughly to get some hints.
> 
> In the meantime I tried to eliminate some features using RandomizedLasso and 
> the performance escalated from R=0.067 using all 615 features to R=0.524 
> using only the 15 top ranked features. Naive question: does it make sense to 
> use the RandomizedLasso to select the good features in order to train a MLP? 
> I had the impression that RandomizedLasso uses multi-variate linear 
> regression to fit the observed values to the experimental and rank the 
> features.
> 
> Another question: this dataset consists of 31 observations. The Pearson's R 
> values that I reported above were calculated using cross-validation. Could 
> someone claim that they are inaccurate because the number of features used 
> for training the MLP is much larger than the number of observations?
>  
> 
> On 19 December 2016 at 23:42, Sebastian Raschka <se.rasc...@gmail.com> wrote:
> Oh, sorry, I just noticed that I was in the wrong thread — meant answer a 
> different Thomas :P.
> 
> Regarding the fingerprints; scikit-learn’s estimators expect feature vectors 
> as samples, so you can’t have a 3D array … e.g., think of image 
> classification: here you also enroll the n_pixels times m_pixels array into 
> 1D arrays.
> 
> The low performance can have mutliple issues. In case dimensionality is an 
> issue, I’d maybe try stronger regularization first, or feature selection.
> If you are working with molecular structures, and you have enough of them, 
> maybe also consider alternative feature representations, e.g,. learning from 
> the graphs directly:
> 
> http://papers.nips.cc/paper/5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints.pdf
> http://pubs.acs.org/doi/abs/10.1021/ci400187y
> 
> Best,
> Sebastian
> 
> 
> > On Dec 19, 2016, at 4:56 PM, Thomas Evangelidis <teva...@gmail.com> wrote:
> >
> > this means that both are feasible?
> >
> > On 19 December 2016 at 18:17, Sebastian Raschka <se.rasc...@gmail.com> 
> > wrote:
> > Thanks, Thomas, that makes sense! Will submit a PR then to update the 
> > docstring.
> >
> > Best,
> > Sebastian
> >
> >
> > > On Dec 19, 2016, at 11:06 AM, Thomas Evangelidis <teva...@gmail.com> 
> > > wrote:
> > >
> > > 
> > > Greetings,
> > >
> > > My dataset consists of objects which are characterised by their 
> > > structural features which are encoded into a so called "fingerprint" 
> > > form. There are several different types of fingerprints, each one 
> > > encapsulating different type of information. I want to combine two 
> > > specific types of fingerprints to train a MLP regressor. The first 
> > > fingerprint consists of a 2048 bit array of the form:
> > >
> > >  FP1 = array([ 1.,  1.,  0., ...,  0.,  0.,  1.], dtype=float32)
> > >
> > > The second is a 60 float number array of the form:
> > >
> > > FP2 = array([ 2.77494618,  0.98973243,  0.34638652,  2.88303715,  
> > > 1.31473857,
> > >        -0.56627112,  4.78847547,  2.29587913, -0.6786228 ,  4.63391109,
> > >        ...
> > >         0.        ,  0.        ,  5.89652792,  0.        ,  0.        ])
> > >
> > > At first I tried to fuse them into a single 1D array of 2048+60 columns 
> > > but the predictions of the MLP were worse than the 2 different MLP models 
> > > trained from one of the 2 fingerprint types individually. My question: is 
> > > there a more effective way to combine the 2 fingerprints in order to 
> > > indicate that they represent different type of information?
> > >
> > > To this end, I tried to create a 2-row array (1st row 2048 elements and 
> > > 2nd row 60 elements) but sklearn complained:
> > >
> > >     mlp.fit(x_train,y_train)
> > >   File 
> > > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> > >  line 618, in fit
> > >     return self._fit(X, y, incremental=False)
> > >   File 
> > > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> > >  line 330, in _fit
> > >     X, y = self._validate_input(X, y, incremental)
> > >   File 
> > > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> > >  line 1264, in _validate_input
> > >     multi_output=True, y_numeric=True)
> > >   File 
> > > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", 
> > > line 521, in check_X_y
> > >     ensure_min_features, warn_on_dtype, estimator)
> > >   File 
> > > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", 
> > > line 402, in check_array
> > >     array = array.astype(np.float64)
> > > ValueError: setting an array element with a sequence.
> > > 
> > >
> > > Then I tried to create for each object of the dataset a 2D array of 
> > > size 2x2048, by adding 1998 zeros in the second row in order both rows to 
> > > be of equal size. However sklearn complained again:
> > >
> > >
> > >     mlp.fit(x_train,y_train)
> > >   File 
> > > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> > >  line 618, in fit
> > >     return self._fit(X, y, incremental=False)
> > >   File 
> > > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> > >  line 330, in _fit
> > >     X, y = self._validate_input(X, y, incremental)
> > >   File 
> > > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> > >  line 1264, in _validate_input
> > >     multi_output=True, y_numeric=True)
> > >   File 
> > > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", 
> > > line 521, in check_X_y
> > >     ensure_min_features, warn_on_dtype, estimator)
> > >   File 
> > > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", 
> > > line 405, in check_array
> > >     % (array.ndim, estimator_name))
> > > ValueError: Found array with dim 3. Estimator expected <= 2.
> > >
> > >
> > > In another case of fingerprints, lets name them FP3 and FP4, I observed 
> > > that the MLP regressor created using FP3 yields better results when 
> > > trained and evaluated using logarithmically transformed experimental 
> > > values (the values in y_train and y_test 1D arrays), while the MLP 
> > > regressor created using FP4 yielded better results using the original 
> > > experimental values. So my second question is: when combining both FP3 
> > > and FP4 into a single array is there any way to designate to the MLP that 
> > > the features that correspond to FP3 must reproduce the logarithmic 
> > > transform of the experimental values while the features of FP4 the 
> > > original untransformed experimental values?
> > >
> > >
> > > I would greatly appreciate any advice on any of my 2 queries.
> > > Thomas
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > ======================================================================
> > > Thomas Evangelidis
> > > Research Specialist
> > > CEITEC - Central European Institute of Technology
> > > Masaryk University
> > > Kamenice 5/A35/1S081,
> > > 62500 Brno, Czech Republic
> > >
> > > email: tev...@pharm.uoa.gr
> > >               teva...@gmail.com
> > >
> > > website: https://sites.google.com/site/thomasevangelidishomepage/
> > >
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn@python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> > --
> > ======================================================================
> > Thomas Evangelidis
> > Research Specialist
> > CEITEC - Central European Institute of Technology
> > Masaryk University
> > Kamenice 5/A35/1S081,
> > 62500 Brno, Czech Republic
> >
> > email: tev...@pharm.uoa.gr
> >               teva...@gmail.com
> >
> > website: https://sites.google.com/site/thomasevangelidishomepage/
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> -- 
> ======================================================================
> Thomas Evangelidis
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081, 
> 62500 Brno, Czech Republic 
> 
> email: tev...@pharm.uoa.gr
>               teva...@gmail.com
> 
> website: https://sites.google.com/site/thomasevangelidishomepage/
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] combining arrays of features to train an MLP

Reply via email to