Hey Juan,

For scaling data, sklearn provides the preprocessing module.  You can use
one scaler object per batch, and that will normalize each batch
indipendently.

Depending on what the angles represent, you should probably rescale them in
a way that they end up on a linear scale.  Naively - maybe something like
the shortest radial distance from 0 to the angle?  Clustering algorithms
won't know your data is periodic.

Federico


On Tue, Feb 18, 2014 at 11:45 AM, Juan Nunez-Iglesias <jni.s...@gmail.com>wrote:

> Hi All,
>
> I have a "biggish" dataset (to use Gaƫl's terminology ;), 45K samples x
> 300 features, that I want to cluster. I have very heterogeneous features --
> some are continuous, others are quasi-continuous (high counts), others are
> discrete (counts of rare events), others are angles (uniformly distributed
> in [-pi, pi])... Is it kosher to use standard scaling and K-means on such a
> dataset? What clustering method would you recommend?
>
> Additionally, there are some confounding factors that I want to account
> for, as samples were processed in batches. What's the best way to deal with
> this? Intuitively I was going to scale each batch independently, but is
> there a function/class within sklearn that will do this for me?
>
> Thanks,
>
> Juan.
>
>
>
> ------------------------------------------------------------------------------
> Managing the Performance of Cloud-Based Applications
> Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
> Read the Whitepaper.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to