The NaNs at the base of the profile implies that the velocity in that layer
continues on for some unspecified thickness, which I can handle using a
couple of different approaches. I am not too concerned about that.
My biggest question is forming the data into the X matrix (n_samples,
n_features). The approach you describe would cluster based on thickness and
velocity without consideration of the relationship between adjacent layers.
Initially, I want to try to cluster based on change in Vs with depth. In
doing so, it is important that layer sequence is considered. Eventually, I
might want to consider other aspects, but I think this (Vs and depth) will
give me a good understanding of what is possible.
I am having trouble asking the right questions, because I don't really
understand the material.
Albert
On Fri, Mar 22, 2013 at 9:28 AM, Lars Buitinck <l.j.buiti...@uva.nl> wrote:
> 2013/3/22 Albert Kottke <albert.kot...@gmail.com>:
> > Here is the data that I would be working with:
> >
> > No Thickness Depth Vp Vs
> > (m) (m) (m/s) (m/s)
> > 1, 2.00, 2.00, 480.00, 180.00
> > 2, 8.00, 10.00, 2320.00, 700.00
> > 3, 8.00, 18.00, 2980.00, 1150.00
> > 4, 52.00, 70.00, 2980.00, 1720.00
> > 5, -----, -----, 3120.00, 1870.00
>
> This is not 2-d, this is 4-d, unless there's a relation between some
> of the variables that I'm missing. You should fetch all columns except
> the first (which I assume is just a sequence number?) into an array of
> dtype=np.float64; let's call that X. Currently, none of our models can
> handle missing data (NaN), so you should do some imputation to get rid
> of them. As a baseline approach, you can replace the missing parts
> with their means.
>
> Quick and dirty demo with IPython and NumPy masked arrays to handle NaN:
>
>
> In [1]: X = array([[2.00, 2.00, 480.00, 180.00],
> [8.00, 10.00, 2320.00, 700.00],
> [8.00, 18.00, 2980.00, 1150.00],
> [52.00, 70.00, 2980.00, 1720.00],
> [np.nan, np.nan, 3120.00, 1870.00]])
>
> In [2]: means = np.asarray(np.ma.array(X, mask=np.isnan(X)).mean(axis=0))
>
> In [3]: means
> Out[3]: array([ 17.5, 25. , 2376. , 1124. ])
>
> In [4]: where_nan = np.where(np.isnan(X))
>
> In [5]: X[where_nan] = means[where_nan[1]]
>
>
> Now you have a dataset X that you can do any kind of clustering on,
> e.g. sklearn.cluster.KMeans.
>
> (I'm not too sure if the trick in command [6] always works, so please
> test this. I don't regularly handle missing values. Maybe Pandas has a
> solution for this. We should offer imputation methods in the library,
> but again, I'm no expert...)
>
> You might want to scale the features so the ones with large ranges
> don't have a disproportionately large effect:
>
>
> In [6]: from sklearn.preprocessing import scale
>
> In [7]: Xt = scale(X)
>
> In [8]: Xt
> Out[8]:
> array([[-0.8635131 , -0.9671039 , -1.91894971, -1.49885838],
> [-0.52924996, -0.63071994, -0.05667784, -0.67321605],
> [-0.52924996, -0.29433597, 0.61131098, 0.04128212],
> [ 1.92201303, 1.89215981, 0.61131098, 0.94631313],
> [ 0. , 0. , 0.75300558, 1.18447919]])
>
>
> ... or you could just take the logarithm with Xt = np.log(X), which is
> a common trick for working with non-negative features with large
> ranges.
>
>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general