[ 
https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017192#comment-16017192
 ] 

Cristian Opris commented on SPARK-16365:
----------------------------------------

There's another potential argument for exposing 'local' (non-distributed) 
implementations of the algorithms: sometimes it's useful to apply the algorithm 
on relatively small groupings of data in a very large dataset. In this case 
Spark would only serve to distribute the data and apply the algorithm locally 
on each partition/grouping of data, perhaps through an UDF.

This may currently be achieved with the scikit integration, but would be useful 
to consider making it possible to use the Spark implementation of the 
algorithm, where that algorithm is not an inherently distributed 
implementation. 
CountVectorizer is a good example, nothing in there inherently requires a 
DataFrame.

In practice this should mostly imply just exposing the core implementation of 
the algorithms where possible.

> Ideas for moving "mllib-local" forward
> --------------------------------------
>
>                 Key: SPARK-16365
>                 URL: https://issues.apache.org/jira/browse/SPARK-16365
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>            Reporter: Nick Pentreath
>
> Since SPARK-13944 is all done, we should all think about what the "next 
> steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's 
> linear algebra", or "investigate how we will implement local models/pipelines 
> in Spark", etc.
> This ticket is for comments, ideas, brainstormings and PoCs. The separation 
> of linalg into a standalone project turned out to be significantly more 
> complex than originally expected. So I vote we devote sufficient discussion 
> and time to planning out the next move :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to