I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset).
So I built a little Aggregator that can do that, here’s an example of how it’s called: val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn // For each grouping, compute a PCA matrix/vector val pcaModels = inputData .groupBy(keys:_*) .agg(pcaAggregation.as(pcaOutput)) I used the same algorithms under the hood as RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works directly on Datasets without converting to RDD first. I’ve seen others who wanted this ability (for example on Stack Overflow) so I’d like to contribute it if it would be a benefit to the larger community. So.. is this something worth contributing to MLlib? And if so, what are the next steps to start the process? thanks!