Re: [MLlib] PCA Aggregator

Sean Owen Fri, 19 Oct 2018 17:07:14 -0700

I think this is great info and context to put in the JIRA.

On Fri, Oct 19, 2018, 6:53 PM Matt Saunders <m...@saunders.net> wrote:


> Hi Sean, thanks for your feedback. I saw this as a missing feature in the
> existing PCA implementation in MLlib. I suspect the use case is a common
> one: you have data from different entities (could be different users,
> different locations, or different products, for example) and you need to
> model them separately since they behave differently--perhaps their features
> run in different ranges, or perhaps they have completely different
> features.
>
> For example if you were modeling the weather in different parts of the
> world for a given time period, and the features were things like
> temperature, humidity, wind speed, pressure, etc. With the current
> PCA/RowMatrix options, you can only calculate PCA on the entire dataset,
> when you really want to model the weather in New York separately from the
> weather in Buenos Aires. Today your options are to collect the data from
> each city and calculate PCA using some other library like Breeze, or use
> the PCA implementation from MLlib but only on one key at a time.
>
> The reason I thought it would be useful in Spark is that it makes the PCA
> offering in MLlib useful to more people. As it stands today, I wasn't able
> to use it for much and I suspect others had the same experience, for
> example:
>
> https://stackoverflow.com/questions/45240556/perform-pca-on-each-group-of-a-groupby-in-pyspark
>
> This isn't really big enough to warrant its own library--it's just a
> single class. But if you think it's better to publish it externally I can
> certainly do that.
>
> thanks again,
> --Matt
>
>
> On Fri, Oct 19, 2018 at 4:14 PM Sean Owen <sro...@gmail.com> wrote:
>
>> It's OK to open a JIRA though I generally doubt any new functionality
>> will be added. This might be viewed as a small worthwhile enhancement,
>> haven't looked at it. It's always more compelling if you can sketch the use
>> case for it and why it is more meaningful in spark than outside it.
>>
>> There is spark-packages for recording third party packages but it is not
>> required nor even necessarily a comprehensive list. You can just self
>> publish like any git or Maven project, if you develop a third party library
>>
>> On Fri, Oct 19, 2018, 2:32 PM Matt Saunders <m...@saunders.net> wrote:
>>
>>> Thanks, Eric. I went ahead and created SPARK-25782 for this improvement
>>> since it is a feature I and others have looked for in MLlib, but doesn't
>>> seem to exist yet. Also, while searching for PCA-related issues in JIRA I
>>> noticed that someone added grouping support for PCA to the MADlib project a
>>> while back (see MADLIB-947), so there does seem to be a demand for it.
>>>
>>> thanks!
>>> --Matt
>>>
>>>
>>> On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <eerla...@redhat.com>
>>> wrote:
>>>
>>>> Hi Matt!
>>>>
>>>> There are a couple ways to do this. If you want to submit it for
>>>> inclusion in Spark, you should start by filing a JIRA for it, and then a
>>>> pull request.   Another possibility is to publish it as your own 3rd party
>>>> library, which I have done for aggregators before.
>>>>
>>>>
>>>> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <m...@saunders.net>
>>>> wrote:
>>>>
>>>>> I built an Aggregator that computes PCA on grouped datasets. I wanted
>>>>> to use the PCA functions provided by MLlib, but they only work on a full
>>>>> dataset, and I needed to do it on a grouped dataset (like a
>>>>> RelationalGroupedDataset).
>>>>>
>>>>> So I built a little Aggregator that can do that, here’s an example of
>>>>> how it’s called:
>>>>>
>>>>>     val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>>>>>
>>>>>     // For each grouping, compute a PCA matrix/vector
>>>>>     val pcaModels = inputData
>>>>>       .groupBy(keys:_*)
>>>>>       .agg(pcaAggregation.as(pcaOutput))
>>>>>
>>>>> I used the same algorithms under the hood as
>>>>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this 
>>>>> works
>>>>> directly on Datasets without converting to RDD first.
>>>>>
>>>>> I’ve seen others who wanted this ability (for example on Stack
>>>>> Overflow) so I’d like to contribute it if it would be a benefit to the
>>>>> larger community.
>>>>>
>>>>> So.. is this something worth contributing to MLlib? And if so, what
>>>>> are the next steps to start the process?
>>>>>
>>>>> thanks!
>>>>>
>>>>

Re: [MLlib] PCA Aggregator

Reply via email to