[ 
https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030997#comment-16030997
 ] 

Nick Pentreath commented on SPARK-16365:
----------------------------------------

[~akrim] thanks for the thoughts and document. Sorry for not responding sooner.

Overall I agree that the idea of a "typed local transformer" approach sounds 
attractive. However, the catch is this only really works for simple "pipelines" 
effectively operating over one column. Such as {{Text -> CountVectorizer -> 
IDF}} that you mentioned, or say {{Text -> Word2Vec}}. And some transformers 
can handle multiple input column types, so that is a further complexity here.

However, I don't believe that is sufficient to support the full Spark ML 
{{Pipeline}} API, in particular the (simple) DAG it creates and multiple 
columns involved. For example a typical classification pipeline may apply a set 
of {{StringIndexers}} to a set of categorical columns, a set of {{Bucketizers}} 
to numerical columns, use {{OneHotEncoders}} on all the resulting "indexed" 
columns, and finally use {{VectorAssembler}} to put them all together for 
training with say {{LogisticRegression}}.

In order to support the full pipeline, I personally see two potential options 
(not necessarily mutually exclusive):

(1) operate on some form of "local row" object much like the Spark transformers 
operate over {{Dataset<Row>}}. So each {{localTransform}} appends (and/or 
deletes potentially) "fields" to the "local row". This is the local analogue to 
the existing DataFrame versions. The "element" transform function (e.g. Vector 
-> prediction) would be shared by the DataFrame model versions;

(2) construct a "typed DAG" that represents the data flows between nodes in the 
graph. So this would know the types (and effectively "contain its own schema 
definition"), as well as effectively the "wiring" of input <-> output columns 
through the pipeline.

An actual solution may contain elements of both approaches above - perhaps it 
would be completely different from MLeap, or perhaps incorporate some (or most) 
of MLeap. MLeap for sure does #1 and some form of #2 (though without types 
AFAIK). You can see that MLeap (for example) has had to re-create 
lighter-weight "local" versions of Rows, DataFrames and Schemas. This is by 
necessity - full pipeline support I believe does require all these elements.

It's not to say it cannot be done and there is obviously a lot of demand for a 
solution, but I think a partial solution is in many ways worse than no solution.

> Ideas for moving "mllib-local" forward
> --------------------------------------
>
>                 Key: SPARK-16365
>                 URL: https://issues.apache.org/jira/browse/SPARK-16365
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>            Reporter: Nick Pentreath
>
> Since SPARK-13944 is all done, we should all think about what the "next 
> steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's 
> linear algebra", or "investigate how we will implement local models/pipelines 
> in Spark", etc.
> This ticket is for comments, ideas, brainstormings and PoCs. The separation 
> of linalg into a standalone project turned out to be significantly more 
> complex than originally expected. So I vote we devote sufficient discussion 
> and time to planning out the next move :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to