Hi, Thomas, I haven’t looked what RandomizedLasso does exactly, but like you said, it is probably not ideal for combining it with an MLP. What In terms of regularization, I was more thinking of the L1 and L2 for the hidden layers, or dropout. However, given such a small sample size (and the small the sample/feature ratio), I think there are way too many (hyper/)parameters to fit in an MLP to get good results. I think you could be better off with a kernel SVM (if linear models don’t work well) or ensemble learning.
Best, Sebastian > On Dec 19, 2016, at 6:51 PM, Thomas Evangelidis <teva...@gmail.com> wrote: > > Thank you, these articles discuss about ML application of the types of > fingerprints I working with! I will read them thoroughly to get some hints. > > In the meantime I tried to eliminate some features using RandomizedLasso and > the performance escalated from R=0.067 using all 615 features to R=0.524 > using only the 15 top ranked features. Naive question: does it make sense to > use the RandomizedLasso to select the good features in order to train a MLP? > I had the impression that RandomizedLasso uses multi-variate linear > regression to fit the observed values to the experimental and rank the > features. > > Another question: this dataset consists of 31 observations. The Pearson's R > values that I reported above were calculated using cross-validation. Could > someone claim that they are inaccurate because the number of features used > for training the MLP is much larger than the number of observations? > > > On 19 December 2016 at 23:42, Sebastian Raschka <se.rasc...@gmail.com> wrote: > Oh, sorry, I just noticed that I was in the wrong thread — meant answer a > different Thomas :P. > > Regarding the fingerprints; scikit-learn’s estimators expect feature vectors > as samples, so you can’t have a 3D array … e.g., think of image > classification: here you also enroll the n_pixels times m_pixels array into > 1D arrays. > > The low performance can have mutliple issues. In case dimensionality is an > issue, I’d maybe try stronger regularization first, or feature selection. > If you are working with molecular structures, and you have enough of them, > maybe also consider alternative feature representations, e.g,. learning from > the graphs directly: > > http://papers.nips.cc/paper/5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints.pdf > http://pubs.acs.org/doi/abs/10.1021/ci400187y > > Best, > Sebastian > > > > On Dec 19, 2016, at 4:56 PM, Thomas Evangelidis <teva...@gmail.com> wrote: > > > > this means that both are feasible? > > > > On 19 December 2016 at 18:17, Sebastian Raschka <se.rasc...@gmail.com> > > wrote: > > Thanks, Thomas, that makes sense! Will submit a PR then to update the > > docstring. > > > > Best, > > Sebastian > > > > > > > On Dec 19, 2016, at 11:06 AM, Thomas Evangelidis <teva...@gmail.com> > > > wrote: > > > > > > > > > Greetings, > > > > > > My dataset consists of objects which are characterised by their > > > structural features which are encoded into a so called "fingerprint" > > > form. There are several different types of fingerprints, each one > > > encapsulating different type of information. I want to combine two > > > specific types of fingerprints to train a MLP regressor. The first > > > fingerprint consists of a 2048 bit array of the form: > > > > > > FP1 = array([ 1., 1., 0., ..., 0., 0., 1.], dtype=float32) > > > > > > The second is a 60 float number array of the form: > > > > > > FP2 = array([ 2.77494618, 0.98973243, 0.34638652, 2.88303715, > > > 1.31473857, > > > -0.56627112, 4.78847547, 2.29587913, -0.6786228 , 4.63391109, > > > ... > > > 0. , 0. , 5.89652792, 0. , 0. ]) > > > > > > At first I tried to fuse them into a single 1D array of 2048+60 columns > > > but the predictions of the MLP were worse than the 2 different MLP models > > > trained from one of the 2 fingerprint types individually. My question: is > > > there a more effective way to combine the 2 fingerprints in order to > > > indicate that they represent different type of information? > > > > > > To this end, I tried to create a 2-row array (1st row 2048 elements and > > > 2nd row 60 elements) but sklearn complained: > > > > > > mlp.fit(x_train,y_train) > > > File > > > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > > > line 618, in fit > > > return self._fit(X, y, incremental=False) > > > File > > > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > > > line 330, in _fit > > > X, y = self._validate_input(X, y, incremental) > > > File > > > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > > > line 1264, in _validate_input > > > multi_output=True, y_numeric=True) > > > File > > > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > > > line 521, in check_X_y > > > ensure_min_features, warn_on_dtype, estimator) > > > File > > > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > > > line 402, in check_array > > > array = array.astype(np.float64) > > > ValueError: setting an array element with a sequence. > > > > > > > > > Then I tried to create for each object of the dataset a 2D array of > > > size 2x2048, by adding 1998 zeros in the second row in order both rows to > > > be of equal size. However sklearn complained again: > > > > > > > > > mlp.fit(x_train,y_train) > > > File > > > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > > > line 618, in fit > > > return self._fit(X, y, incremental=False) > > > File > > > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > > > line 330, in _fit > > > X, y = self._validate_input(X, y, incremental) > > > File > > > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > > > line 1264, in _validate_input > > > multi_output=True, y_numeric=True) > > > File > > > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > > > line 521, in check_X_y > > > ensure_min_features, warn_on_dtype, estimator) > > > File > > > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > > > line 405, in check_array > > > % (array.ndim, estimator_name)) > > > ValueError: Found array with dim 3. Estimator expected <= 2. > > > > > > > > > In another case of fingerprints, lets name them FP3 and FP4, I observed > > > that the MLP regressor created using FP3 yields better results when > > > trained and evaluated using logarithmically transformed experimental > > > values (the values in y_train and y_test 1D arrays), while the MLP > > > regressor created using FP4 yielded better results using the original > > > experimental values. So my second question is: when combining both FP3 > > > and FP4 into a single array is there any way to designate to the MLP that > > > the features that correspond to FP3 must reproduce the logarithmic > > > transform of the experimental values while the features of FP4 the > > > original untransformed experimental values? > > > > > > > > > I would greatly appreciate any advice on any of my 2 queries. > > > Thomas > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > ====================================================================== > > > Thomas Evangelidis > > > Research Specialist > > > CEITEC - Central European Institute of Technology > > > Masaryk University > > > Kamenice 5/A35/1S081, > > > 62500 Brno, Czech Republic > > > > > > email: tev...@pharm.uoa.gr > > > teva...@gmail.com > > > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn@python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/1S081, > > 62500 Brno, Czech Republic > > > > email: tev...@pharm.uoa.gr > > teva...@gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tev...@pharm.uoa.gr > teva...@gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn