It's OK to open a JIRA though I generally doubt any new functionality will
be added. This might be viewed as a small worthwhile enhancement, haven't
looked at it. It's always more compelling if you can sketch the use case
for it and why it is more meaningful in spark than outside it.

There is spark-packages for recording third party packages but it is not
required nor even necessarily a comprehensive list. You can just self
publish like any git or Maven project, if you develop a third party library

On Fri, Oct 19, 2018, 2:32 PM Matt Saunders <m...@saunders.net> wrote:

> Thanks, Eric. I went ahead and created SPARK-25782 for this improvement
> since it is a feature I and others have looked for in MLlib, but doesn't
> seem to exist yet. Also, while searching for PCA-related issues in JIRA I
> noticed that someone added grouping support for PCA to the MADlib project a
> while back (see MADLIB-947), so there does seem to be a demand for it.
>
> thanks!
> --Matt
>
>
> On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <eerla...@redhat.com>
> wrote:
>
>> Hi Matt!
>>
>> There are a couple ways to do this. If you want to submit it for
>> inclusion in Spark, you should start by filing a JIRA for it, and then a
>> pull request.   Another possibility is to publish it as your own 3rd party
>> library, which I have done for aggregators before.
>>
>>
>> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <m...@saunders.net> wrote:
>>
>>> I built an Aggregator that computes PCA on grouped datasets. I wanted to
>>> use the PCA functions provided by MLlib, but they only work on a full
>>> dataset, and I needed to do it on a grouped dataset (like a
>>> RelationalGroupedDataset).
>>>
>>> So I built a little Aggregator that can do that, here’s an example of
>>> how it’s called:
>>>
>>>     val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>>>
>>>     // For each grouping, compute a PCA matrix/vector
>>>     val pcaModels = inputData
>>>       .groupBy(keys:_*)
>>>       .agg(pcaAggregation.as(pcaOutput))
>>>
>>> I used the same algorithms under the hood as
>>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
>>> directly on Datasets without converting to RDD first.
>>>
>>> I’ve seen others who wanted this ability (for example on Stack Overflow)
>>> so I’d like to contribute it if it would be a benefit to the larger
>>> community.
>>>
>>> So.. is this something worth contributing to MLlib? And if so, what are
>>> the next steps to start the process?
>>>
>>> thanks!
>>>
>>

Reply via email to