2013/3/22 Albert Kottke <albert.kot...@gmail.com>:
> Here is the data that I would be working with:
>
> No Thickness   Depth    Vp       Vs
>     (m)       (m)    (m/s)    (m/s)
> 1,    2.00,    2.00,  480.00,  180.00
> 2,    8.00,   10.00, 2320.00,  700.00
> 3,    8.00,   18.00, 2980.00, 1150.00
> 4,   52.00,   70.00, 2980.00, 1720.00
> 5,   -----,   -----, 3120.00, 1870.00

This is not 2-d, this is 4-d, unless there's a relation between some
of the variables that I'm missing. You should fetch all columns except
the first (which I assume is just a sequence number?) into an array of
dtype=np.float64; let's call that X. Currently, none of our models can
handle missing data (NaN), so you should do some imputation to get rid
of them. As a baseline approach, you can replace the missing parts
with their means.

Quick and dirty demo with IPython and NumPy masked arrays to handle NaN:


In [1]: X = array([[2.00,    2.00,  480.00,  180.00],
           [8.00,   10.00, 2320.00,  700.00],
           [8.00,   18.00, 2980.00, 1150.00],
          [52.00,   70.00, 2980.00, 1720.00],
         [np.nan,  np.nan, 3120.00, 1870.00]])

In [2]: means = np.asarray(np.ma.array(X, mask=np.isnan(X)).mean(axis=0))

In [3]: means
Out[3]: array([   17.5,    25. ,  2376. ,  1124. ])

In [4]: where_nan = np.where(np.isnan(X))

In [5]: X[where_nan] = means[where_nan[1]]


Now you have a dataset X that you can do any kind of clustering on,
e.g. sklearn.cluster.KMeans.

(I'm not too sure if the trick in command [6] always works, so please
test this. I don't regularly handle missing values. Maybe Pandas has a
solution for this. We should offer imputation methods in the library,
but again, I'm no expert...)

You might want to scale the features so the ones with large ranges
don't have a disproportionately large effect:


In [6]: from sklearn.preprocessing import scale

In [7]: Xt = scale(X)

In [8]: Xt
Out[8]:
array([[-0.8635131 , -0.9671039 , -1.91894971, -1.49885838],
       [-0.52924996, -0.63071994, -0.05667784, -0.67321605],
       [-0.52924996, -0.29433597,  0.61131098,  0.04128212],
       [ 1.92201303,  1.89215981,  0.61131098,  0.94631313],
       [ 0.        ,  0.        ,  0.75300558,  1.18447919]])


... or you could just take the logarithm with Xt = np.log(X), which is
a common trick for working with non-negative features with large
ranges.


-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to