I think this is great info and context to put in the JIRA. On Fri, Oct 19, 2018, 6:53 PM Matt Saunders <m...@saunders.net> wrote:
> Hi Sean, thanks for your feedback. I saw this as a missing feature in the > existing PCA implementation in MLlib. I suspect the use case is a common > one: you have data from different entities (could be different users, > different locations, or different products, for example) and you need to > model them separately since they behave differently--perhaps their features > run in different ranges, or perhaps they have completely different > features. > > For example if you were modeling the weather in different parts of the > world for a given time period, and the features were things like > temperature, humidity, wind speed, pressure, etc. With the current > PCA/RowMatrix options, you can only calculate PCA on the entire dataset, > when you really want to model the weather in New York separately from the > weather in Buenos Aires. Today your options are to collect the data from > each city and calculate PCA using some other library like Breeze, or use > the PCA implementation from MLlib but only on one key at a time. > > The reason I thought it would be useful in Spark is that it makes the PCA > offering in MLlib useful to more people. As it stands today, I wasn't able > to use it for much and I suspect others had the same experience, for > example: > > https://stackoverflow.com/questions/45240556/perform-pca-on-each-group-of-a-groupby-in-pyspark > > This isn't really big enough to warrant its own library--it's just a > single class. But if you think it's better to publish it externally I can > certainly do that. > > thanks again, > --Matt > > > On Fri, Oct 19, 2018 at 4:14 PM Sean Owen <sro...@gmail.com> wrote: > >> It's OK to open a JIRA though I generally doubt any new functionality >> will be added. This might be viewed as a small worthwhile enhancement, >> haven't looked at it. It's always more compelling if you can sketch the use >> case for it and why it is more meaningful in spark than outside it. >> >> There is spark-packages for recording third party packages but it is not >> required nor even necessarily a comprehensive list. You can just self >> publish like any git or Maven project, if you develop a third party library >> >> On Fri, Oct 19, 2018, 2:32 PM Matt Saunders <m...@saunders.net> wrote: >> >>> Thanks, Eric. I went ahead and created SPARK-25782 for this improvement >>> since it is a feature I and others have looked for in MLlib, but doesn't >>> seem to exist yet. Also, while searching for PCA-related issues in JIRA I >>> noticed that someone added grouping support for PCA to the MADlib project a >>> while back (see MADLIB-947), so there does seem to be a demand for it. >>> >>> thanks! >>> --Matt >>> >>> >>> On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <eerla...@redhat.com> >>> wrote: >>> >>>> Hi Matt! >>>> >>>> There are a couple ways to do this. If you want to submit it for >>>> inclusion in Spark, you should start by filing a JIRA for it, and then a >>>> pull request. Another possibility is to publish it as your own 3rd party >>>> library, which I have done for aggregators before. >>>> >>>> >>>> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <m...@saunders.net> >>>> wrote: >>>> >>>>> I built an Aggregator that computes PCA on grouped datasets. I wanted >>>>> to use the PCA functions provided by MLlib, but they only work on a full >>>>> dataset, and I needed to do it on a grouped dataset (like a >>>>> RelationalGroupedDataset). >>>>> >>>>> So I built a little Aggregator that can do that, here’s an example of >>>>> how it’s called: >>>>> >>>>> val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn >>>>> >>>>> // For each grouping, compute a PCA matrix/vector >>>>> val pcaModels = inputData >>>>> .groupBy(keys:_*) >>>>> .agg(pcaAggregation.as(pcaOutput)) >>>>> >>>>> I used the same algorithms under the hood as >>>>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this >>>>> works >>>>> directly on Datasets without converting to RDD first. >>>>> >>>>> I’ve seen others who wanted this ability (for example on Stack >>>>> Overflow) so I’d like to contribute it if it would be a benefit to the >>>>> larger community. >>>>> >>>>> So.. is this something worth contributing to MLlib? And if so, what >>>>> are the next steps to start the process? >>>>> >>>>> thanks! >>>>> >>>>