Hi All,
I have a "biggish" dataset (to use Gaƫl's terminology ;), 45K samples x 300
features, that I want to cluster. I have very heterogeneous features -- some
are continuous, others are quasi-continuous (high counts), others are
discrete (counts of rare events), others are angles (uniformly distributed
in [-pi, pi])... Is it kosher to use standard scaling and K-means on such a
dataset? What clustering method would you recommend?
Additionally, there are some confounding factors that I want to account
for, as samples were processed in batches. What's the best way to deal with
this? Intuitively I was going to scale each batch independently, but is
there a function/class within sklearn that will do this for me?
Thanks,
Juan.
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general