[scikit-learn] data augmentation following the underlying feature values distributions and correlations

Thomas Evangelidis Mon, 18 Dec 2017 06:21:23 -0800

Greetings,

I want to augment my training set but preserve at the same time the
correlations between feature values. More specifically my features are NMR
resonances of the nuclei of a single amino acid. For example for Glutamic
acid I have for each observation the following feature values:


[CA, HA, CB, HB, CG, HG]

where CA is the resonance of the alpha carbon, HA the resonance of the
alpha proton, and so forth. The complication here is that these feature
values are not independent. HA is covalently bonded to CA, CB to CA, and so
on. Therefore if I sample a random CA value from the distribution of
experimental values of CA, I cannot pick ANY HA VALUE from the respective
experimental distribution, simply because CA and HA are correlated. The
same applies to CA and CB, CB and HB, CB and CG, CG and HG. Is there any
algorithm that can generate [CA, HA, CB, HB, CG, HG] feature vectors that
comply with the atom distributions and their correlations? I saw that
Gaussian Mixture Models have a function to generate random samples from the
fitted Gaussian distribution (sklearn.mixture.GaussianMixture.sample) but
it is not clear if these samples will retain the correlations between the
features (nuclei in this case). If there is not such an algorithm in
scikit-learn,
could you please point me to any other Python library which does that?

Thanks in advance.
Thomas


-- 

======================================================================

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

          teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] data augmentation following the underlying feature values distributions and correlations

Reply via email to