[ 
https://issues.apache.org/jira/browse/SPARK-25782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Saunders updated SPARK-25782:
----------------------------------
    Description: 
I built an Aggregator that computes PCA on grouped datasets. I wanted to use 
the PCA functions provided by MLlib, but they only work on a full dataset, and 
I needed to do it on a grouped dataset (like a RelationalGroupedDataset). 

So I built a little Aggregator that can do that, here's an example of how it's 
called:
{noformat}
val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

// For each grouping, compute a PCA matrix/vector
val pcaModels = inputData
  .groupBy(keys:_*)
  .agg(pcaAggregation.as(pcaOutput)){noformat}
I used the same algorithms under the hood as 
RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works 
directly on Datasets without converting to RDD first.

I saw this as a missing feature in the existing PCA implementation in MLlib. I 
suspect the use case is a common one: you have data from different entities 
(could be different users, different locations, or different products, for 
example) and you need to model them separately since they behave 
differently--perhaps their features run in different ranges, or perhaps they 
have completely different features.
 
For example if you were modeling the weather in different parts of the world 
for a given time period, and the features were things like temperature, 
humidity, wind speed, pressure, etc. With the current PCA/RowMatrix options, 
you can only calculate PCA on the entire dataset, when you really want to model 
the weather in New York separately from the weather in Buenos Aires. Today your 
options are to collect the data from each city and calculate PCA using some 
other library like Breeze, or use the PCA implementation from MLlib but only on 
one key at a time.
 
I hope this will make the PCA offering in MLlib useful to more people. As it 
stands today, I wasn't able to use it for much and I suspect others had the 
same experience, for example:
[https://stackoverflow.com/questions/45240556/perform-pca-on-each-group-of-a-groupby-in-pyspark]

  was:
I built an Aggregator that computes PCA on grouped datasets. I wanted to use 
the PCA functions provided by MLlib, but they only work on a full dataset, and 
I needed to do it on a grouped dataset (like a RelationalGroupedDataset). 

So I built a little Aggregator that can do that, here's an example of how it's 
called:
{noformat}
val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

// For each grouping, compute a PCA matrix/vector
val pcaModels = inputData
  .groupBy(keys:_*)
  .agg(pcaAggregation.as(pcaOutput)){noformat}
I used the same algorithms under the hood as 
RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works 
directly on Datasets without converting to RDD first.

I've seen others who wanted this ability (for example on Stack Overflow) so I'd 
like to contribute it if it would be a benefit to the larger community. If 
there is interest, I will prepare the code for a pull request.


> Add PCA Aggregator to support grouping
> --------------------------------------
>
>                 Key: SPARK-25782
>                 URL: https://issues.apache.org/jira/browse/SPARK-25782
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, MLlib
>    Affects Versions: 2.3.2
>            Reporter: Matt Saunders
>            Priority: Minor
>
> I built an Aggregator that computes PCA on grouped datasets. I wanted to use 
> the PCA functions provided by MLlib, but they only work on a full dataset, 
> and I needed to do it on a grouped dataset (like a RelationalGroupedDataset). 
> So I built a little Aggregator that can do that, here's an example of how 
> it's called:
> {noformat}
> val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
> // For each grouping, compute a PCA matrix/vector
> val pcaModels = inputData
>   .groupBy(keys:_*)
>   .agg(pcaAggregation.as(pcaOutput)){noformat}
> I used the same algorithms under the hood as 
> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works 
> directly on Datasets without converting to RDD first.
> I saw this as a missing feature in the existing PCA implementation in MLlib. 
> I suspect the use case is a common one: you have data from different entities 
> (could be different users, different locations, or different products, for 
> example) and you need to model them separately since they behave 
> differently--perhaps their features run in different ranges, or perhaps they 
> have completely different features.
>  
> For example if you were modeling the weather in different parts of the world 
> for a given time period, and the features were things like temperature, 
> humidity, wind speed, pressure, etc. With the current PCA/RowMatrix options, 
> you can only calculate PCA on the entire dataset, when you really want to 
> model the weather in New York separately from the weather in Buenos Aires. 
> Today your options are to collect the data from each city and calculate PCA 
> using some other library like Breeze, or use the PCA implementation from 
> MLlib but only on one key at a time.
>  
> I hope this will make the PCA offering in MLlib useful to more people. As it 
> stands today, I wasn't able to use it for much and I suspect others had the 
> same experience, for example:
> [https://stackoverflow.com/questions/45240556/perform-pca-on-each-group-of-a-groupby-in-pyspark]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to