The NaNs at the base of the profile implies that the velocity in that layer
continues on for some unspecified thickness, which I can handle using a
couple of different approaches. I am not too concerned about that.

My biggest question is forming the data into the X matrix (n_samples,
n_features). The approach you describe would cluster based on thickness and
velocity without consideration of the relationship between adjacent layers.
Initially, I want to try to cluster based on change in Vs with depth. In
doing so, it is important that layer sequence is considered. Eventually, I
might want to consider other aspects, but I think this (Vs and depth) will
give me a good understanding of what is possible.

I am having trouble asking the right questions, because I don't really
understand the material.

Albert



On Fri, Mar 22, 2013 at 9:28 AM, Lars Buitinck <l.j.buiti...@uva.nl> wrote:

> 2013/3/22 Albert Kottke <albert.kot...@gmail.com>:
> > Here is the data that I would be working with:
> >
> > No Thickness   Depth    Vp       Vs
> >     (m)       (m)    (m/s)    (m/s)
> > 1,    2.00,    2.00,  480.00,  180.00
> > 2,    8.00,   10.00, 2320.00,  700.00
> > 3,    8.00,   18.00, 2980.00, 1150.00
> > 4,   52.00,   70.00, 2980.00, 1720.00
> > 5,   -----,   -----, 3120.00, 1870.00
>
> This is not 2-d, this is 4-d, unless there's a relation between some
> of the variables that I'm missing. You should fetch all columns except
> the first (which I assume is just a sequence number?) into an array of
> dtype=np.float64; let's call that X. Currently, none of our models can
> handle missing data (NaN), so you should do some imputation to get rid
> of them. As a baseline approach, you can replace the missing parts
> with their means.
>
> Quick and dirty demo with IPython and NumPy masked arrays to handle NaN:
>
>
> In [1]: X = array([[2.00,    2.00,  480.00,  180.00],
>            [8.00,   10.00, 2320.00,  700.00],
>            [8.00,   18.00, 2980.00, 1150.00],
>           [52.00,   70.00, 2980.00, 1720.00],
>          [np.nan,  np.nan, 3120.00, 1870.00]])
>
> In [2]: means = np.asarray(np.ma.array(X, mask=np.isnan(X)).mean(axis=0))
>
> In [3]: means
> Out[3]: array([   17.5,    25. ,  2376. ,  1124. ])
>
> In [4]: where_nan = np.where(np.isnan(X))
>
> In [5]: X[where_nan] = means[where_nan[1]]
>
>
> Now you have a dataset X that you can do any kind of clustering on,
> e.g. sklearn.cluster.KMeans.
>
> (I'm not too sure if the trick in command [6] always works, so please
> test this. I don't regularly handle missing values. Maybe Pandas has a
> solution for this. We should offer imputation methods in the library,
> but again, I'm no expert...)
>
> You might want to scale the features so the ones with large ranges
> don't have a disproportionately large effect:
>
>
> In [6]: from sklearn.preprocessing import scale
>
> In [7]: Xt = scale(X)
>
> In [8]: Xt
> Out[8]:
> array([[-0.8635131 , -0.9671039 , -1.91894971, -1.49885838],
>        [-0.52924996, -0.63071994, -0.05667784, -0.67321605],
>        [-0.52924996, -0.29433597,  0.61131098,  0.04128212],
>        [ 1.92201303,  1.89215981,  0.61131098,  0.94631313],
>        [ 0.        ,  0.        ,  0.75300558,  1.18447919]])
>
>
> ... or you could just take the logarithm with Xt = np.log(X), which is
> a common trick for working with non-negative features with large
> ranges.
>
>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to