[ 
https://issues.apache.org/jira/browse/SPARK-16319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381882#comment-15381882
 ] 

Sean Owen commented on SPARK-16319:
-----------------------------------

I think you're right. I looked at Pipeline.fit and it just forms a sequence of 
transformers and then executes them with foldLeft. Although in principle some 
of these could be parallelized, they're not executed that way in practice, even 
if they could be as far as the caller's concerned.

You're right in the sense that nothing checks that the input/output cols 
against the sequence of stages to verify the DAG property. Of course, if they 
don't, you'll hit an error at runtime. And of course, their values do matter in 
general.

I think the text is still correct: a Pipeline may form a DAG. This property 
isn't checked, and it isn't exploited (right now), but it's just saying this is 
allowed and stating the conditions for it to work.

> Non-linear (DAG) pipelines need better explanation
> --------------------------------------------------
>
>                 Key: SPARK-16319
>                 URL: https://issues.apache.org/jira/browse/SPARK-16319
>             Project: Spark
>          Issue Type: Documentation
>          Components: ML
>    Affects Versions: 2.0.0
>            Reporter: Max Moroz
>            Priority: Minor
>
> There's a 
> [paragraph|http://spark.apache.org/docs/2.0.0-preview/ml-guide.html#details] 
> about non-linear pipeline in the ML docs, but it's not clear how DAG pipeline 
> differs from a linear pipeline, and in fact, it seems that a "DAG Pipeline" 
> results in the behavior identical to that of a regular linear pipeline (the 
> stages are simply applied in the order provided when the pipeline is 
> created). In addition, no checks of input and output columns seem to occur 
> when the pipeline.fit() or pipeline.transform() is called.
> It would be better to clarify in the docs and/or remove that paragraph.
> I'd be happy to write it up, but I have no idea what the intention of this 
> concept is at this point.
> [Additional reference on 
> SO|http://stackoverflow.com/questions/37541668/non-linear-dag-ml-pipelines-in-apache-spark]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to