I built an Aggregator that computes PCA on grouped datasets. I wanted to
use the PCA functions provided by MLlib, but they only work on a full
dataset, and I needed to do it on a grouped dataset (like a
RelationalGroupedDataset).

So I built a little Aggregator that can do that, here’s an example of how
it’s called:

    val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

    // For each grouping, compute a PCA matrix/vector
    val pcaModels = inputData
      .groupBy(keys:_*)
      .agg(pcaAggregation.as(pcaOutput))

I used the same algorithms under the hood as
RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
directly on Datasets without converting to RDD first.

I’ve seen others who wanted this ability (for example on Stack Overflow) so
I’d like to contribute it if it would be a benefit to the larger community.

So.. is this something worth contributing to MLlib? And if so, what are the
next steps to start the process?

thanks!

Reply via email to