2013/3/22 Albert Kottke <albert.kot...@gmail.com>: > Here is the data that I would be working with: > > No Thickness Depth Vp Vs > (m) (m) (m/s) (m/s) > 1, 2.00, 2.00, 480.00, 180.00 > 2, 8.00, 10.00, 2320.00, 700.00 > 3, 8.00, 18.00, 2980.00, 1150.00 > 4, 52.00, 70.00, 2980.00, 1720.00 > 5, -----, -----, 3120.00, 1870.00
This is not 2-d, this is 4-d, unless there's a relation between some of the variables that I'm missing. You should fetch all columns except the first (which I assume is just a sequence number?) into an array of dtype=np.float64; let's call that X. Currently, none of our models can handle missing data (NaN), so you should do some imputation to get rid of them. As a baseline approach, you can replace the missing parts with their means. Quick and dirty demo with IPython and NumPy masked arrays to handle NaN: In [1]: X = array([[2.00, 2.00, 480.00, 180.00], [8.00, 10.00, 2320.00, 700.00], [8.00, 18.00, 2980.00, 1150.00], [52.00, 70.00, 2980.00, 1720.00], [np.nan, np.nan, 3120.00, 1870.00]]) In [2]: means = np.asarray(np.ma.array(X, mask=np.isnan(X)).mean(axis=0)) In [3]: means Out[3]: array([ 17.5, 25. , 2376. , 1124. ]) In [4]: where_nan = np.where(np.isnan(X)) In [5]: X[where_nan] = means[where_nan[1]] Now you have a dataset X that you can do any kind of clustering on, e.g. sklearn.cluster.KMeans. (I'm not too sure if the trick in command [6] always works, so please test this. I don't regularly handle missing values. Maybe Pandas has a solution for this. We should offer imputation methods in the library, but again, I'm no expert...) You might want to scale the features so the ones with large ranges don't have a disproportionately large effect: In [6]: from sklearn.preprocessing import scale In [7]: Xt = scale(X) In [8]: Xt Out[8]: array([[-0.8635131 , -0.9671039 , -1.91894971, -1.49885838], [-0.52924996, -0.63071994, -0.05667784, -0.67321605], [-0.52924996, -0.29433597, 0.61131098, 0.04128212], [ 1.92201303, 1.89215981, 0.61131098, 0.94631313], [ 0. , 0. , 0.75300558, 1.18447919]]) ... or you could just take the logarithm with Xt = np.log(X), which is a common trick for working with non-negative features with large ranges. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general