[
https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017192#comment-16017192
]
Cristian Opris commented on SPARK-16365:
----------------------------------------
There's another potential argument for exposing 'local' (non-distributed)
implementations of the algorithms: sometimes it's useful to apply the algorithm
on relatively small groupings of data in a very large dataset. In this case
Spark would only serve to distribute the data and apply the algorithm locally
on each partition/grouping of data, perhaps through an UDF.
This may currently be achieved with the scikit integration, but would be useful
to consider making it possible to use the Spark implementation of the
algorithm, where that algorithm is not an inherently distributed
implementation.
CountVectorizer is a good example, nothing in there inherently requires a
DataFrame.
In practice this should mostly imply just exposing the core implementation of
the algorithms where possible.
> Ideas for moving "mllib-local" forward
> --------------------------------------
>
> Key: SPARK-16365
> URL: https://issues.apache.org/jira/browse/SPARK-16365
> Project: Spark
> Issue Type: Brainstorming
> Components: ML
> Reporter: Nick Pentreath
>
> Since SPARK-13944 is all done, we should all think about what the "next
> steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's
> linear algebra", or "investigate how we will implement local models/pipelines
> in Spark", etc.
> This ticket is for comments, ideas, brainstormings and PoCs. The separation
> of linalg into a standalone project turned out to be significantly more
> complex than originally expected. So I vote we devote sufficient discussion
> and time to planning out the next move :)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]