Re: [scikit-learn] combining arrays of features to train an MLP

Sebastian Raschka Mon, 19 Dec 2016 14:45:12 -0800

Oh, sorry, I just noticed that I was in the wrong thread — meant answer a 
different Thomas :P.


Regarding the fingerprints; scikit-learn’s estimators expect feature vectors as 
samples, so you can’t have a 3D array … e.g., think of image classification: 
here you also enroll the n_pixels times m_pixels array into 1D arrays.

The low performance can have mutliple issues. In case dimensionality is an 
issue, I’d maybe try stronger regularization first, or feature selection.
If you are working with molecular structures, and you have enough of them, 
maybe also consider alternative feature representations, e.g,. learning from 
the graphs directly:

http://papers.nips.cc/paper/5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints.pdf
http://pubs.acs.org/doi/abs/10.1021/ci400187y

Best,
Sebastian


> On Dec 19, 2016, at 4:56 PM, Thomas Evangelidis <teva...@gmail.com> wrote:
> 
> this means that both are feasible?
> 
> On 19 December 2016 at 18:17, Sebastian Raschka <se.rasc...@gmail.com> wrote:
> Thanks, Thomas, that makes sense! Will submit a PR then to update the 
> docstring.
> 
> Best,
> Sebastian
> 
> 
> > On Dec 19, 2016, at 11:06 AM, Thomas Evangelidis <teva...@gmail.com> wrote:
> >
> > 
> > Greetings,
> >
> > My dataset consists of objects which are characterised by their structural 
> > features which are encoded into a so called "fingerprint" form. There are 
> > several different types of fingerprints, each one encapsulating different 
> > type of information. I want to combine two specific types of fingerprints 
> > to train a MLP regressor. The first fingerprint consists of a 2048 bit 
> > array of the form:
> >
> >  FP1 = array([ 1.,  1.,  0., ...,  0.,  0.,  1.], dtype=float32)
> >
> > The second is a 60 float number array of the form:
> >
> > FP2 = array([ 2.77494618,  0.98973243,  0.34638652,  2.88303715,  
> > 1.31473857,
> >        -0.56627112,  4.78847547,  2.29587913, -0.6786228 ,  4.63391109,
> >        ...
> >         0.        ,  0.        ,  5.89652792,  0.        ,  0.        ])
> >
> > At first I tried to fuse them into a single 1D array of 2048+60 columns but 
> > the predictions of the MLP were worse than the 2 different MLP models 
> > trained from one of the 2 fingerprint types individually. My question: is 
> > there a more effective way to combine the 2 fingerprints in order to 
> > indicate that they represent different type of information?
> >
> > To this end, I tried to create a 2-row array (1st row 2048 elements and 2nd 
> > row 60 elements) but sklearn complained:
> >
> >     mlp.fit(x_train,y_train)
> >   File 
> > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> >  line 618, in fit
> >     return self._fit(X, y, incremental=False)
> >   File 
> > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> >  line 330, in _fit
> >     X, y = self._validate_input(X, y, incremental)
> >   File 
> > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> >  line 1264, in _validate_input
> >     multi_output=True, y_numeric=True)
> >   File 
> > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 
> > 521, in check_X_y
> >     ensure_min_features, warn_on_dtype, estimator)
> >   File 
> > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 
> > 402, in check_array
> >     array = array.astype(np.float64)
> > ValueError: setting an array element with a sequence.
> > 
> >
> > Then I tried to create for each object of the dataset a 2D array of size 
> > 2x2048, by adding 1998 zeros in the second row in order both rows to be of 
> > equal size. However sklearn complained again:
> >
> >
> >     mlp.fit(x_train,y_train)
> >   File 
> > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> >  line 618, in fit
> >     return self._fit(X, y, incremental=False)
> >   File 
> > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> >  line 330, in _fit
> >     X, y = self._validate_input(X, y, incremental)
> >   File 
> > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> >  line 1264, in _validate_input
> >     multi_output=True, y_numeric=True)
> >   File 
> > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 
> > 521, in check_X_y
> >     ensure_min_features, warn_on_dtype, estimator)
> >   File 
> > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 
> > 405, in check_array
> >     % (array.ndim, estimator_name))
> > ValueError: Found array with dim 3. Estimator expected <= 2.
> >
> >
> > In another case of fingerprints, lets name them FP3 and FP4, I observed 
> > that the MLP regressor created using FP3 yields better results when trained 
> > and evaluated using logarithmically transformed experimental values (the 
> > values in y_train and y_test 1D arrays), while the MLP regressor created 
> > using FP4 yielded better results using the original experimental values. So 
> > my second question is: when combining both FP3 and FP4 into a single array 
> > is there any way to designate to the MLP that the features that correspond 
> > to FP3 must reproduce the logarithmic transform of the experimental values 
> > while the features of FP4 the original untransformed experimental values?
> >
> >
> > I would greatly appreciate any advice on any of my 2 queries.
> > Thomas
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > ======================================================================
> > Thomas Evangelidis
> > Research Specialist
> > CEITEC - Central European Institute of Technology
> > Masaryk University
> > Kamenice 5/A35/1S081,
> > 62500 Brno, Czech Republic
> >
> > email: tev...@pharm.uoa.gr
> >               teva...@gmail.com
> >
> > website: https://sites.google.com/site/thomasevangelidishomepage/
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> -- 
> ======================================================================
> Thomas Evangelidis
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081, 
> 62500 Brno, Czech Republic 
> 
> email: tev...@pharm.uoa.gr
>               teva...@gmail.com
> 
> website: https://sites.google.com/site/thomasevangelidishomepage/
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] combining arrays of features to train an MLP

Reply via email to