Hi Sean, thanks for your feedback. I saw this as a missing feature in the existing PCA implementation in MLlib. I suspect the use case is a common one: you have data from different entities (could be different users, different locations, or different products, for example) and you need to model them separately since they behave differently--perhaps their features run in different ranges, or perhaps they have completely different features.
For example if you were modeling the weather in different parts of the world for a given time period, and the features were things like temperature, humidity, wind speed, pressure, etc. With the current PCA/RowMatrix options, you can only calculate PCA on the entire dataset, when you really want to model the weather in New York separately from the weather in Buenos Aires. Today your options are to collect the data from each city and calculate PCA using some other library like Breeze, or use the PCA implementation from MLlib but only on one key at a time. The reason I thought it would be useful in Spark is that it makes the PCA offering in MLlib useful to more people. As it stands today, I wasn't able to use it for much and I suspect others had the same experience, for example: https://stackoverflow.com/questions/45240556/perform-pca-on-each-group-of-a-groupby-in-pyspark This isn't really big enough to warrant its own library--it's just a single class. But if you think it's better to publish it externally I can certainly do that. thanks again, --Matt On Fri, Oct 19, 2018 at 4:14 PM Sean Owen <sro...@gmail.com> wrote: > It's OK to open a JIRA though I generally doubt any new functionality will > be added. This might be viewed as a small worthwhile enhancement, haven't > looked at it. It's always more compelling if you can sketch the use case > for it and why it is more meaningful in spark than outside it. > > There is spark-packages for recording third party packages but it is not > required nor even necessarily a comprehensive list. You can just self > publish like any git or Maven project, if you develop a third party library > > On Fri, Oct 19, 2018, 2:32 PM Matt Saunders <m...@saunders.net> wrote: > >> Thanks, Eric. I went ahead and created SPARK-25782 for this improvement >> since it is a feature I and others have looked for in MLlib, but doesn't >> seem to exist yet. Also, while searching for PCA-related issues in JIRA I >> noticed that someone added grouping support for PCA to the MADlib project a >> while back (see MADLIB-947), so there does seem to be a demand for it. >> >> thanks! >> --Matt >> >> >> On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <eerla...@redhat.com> >> wrote: >> >>> Hi Matt! >>> >>> There are a couple ways to do this. If you want to submit it for >>> inclusion in Spark, you should start by filing a JIRA for it, and then a >>> pull request. Another possibility is to publish it as your own 3rd party >>> library, which I have done for aggregators before. >>> >>> >>> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <m...@saunders.net> wrote: >>> >>>> I built an Aggregator that computes PCA on grouped datasets. I wanted >>>> to use the PCA functions provided by MLlib, but they only work on a full >>>> dataset, and I needed to do it on a grouped dataset (like a >>>> RelationalGroupedDataset). >>>> >>>> So I built a little Aggregator that can do that, here’s an example of >>>> how it’s called: >>>> >>>> val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn >>>> >>>> // For each grouping, compute a PCA matrix/vector >>>> val pcaModels = inputData >>>> .groupBy(keys:_*) >>>> .agg(pcaAggregation.as(pcaOutput)) >>>> >>>> I used the same algorithms under the hood as >>>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works >>>> directly on Datasets without converting to RDD first. >>>> >>>> I’ve seen others who wanted this ability (for example on Stack >>>> Overflow) so I’d like to contribute it if it would be a benefit to the >>>> larger community. >>>> >>>> So.. is this something worth contributing to MLlib? And if so, what are >>>> the next steps to start the process? >>>> >>>> thanks! >>>> >>>