[ https://issues.apache.org/jira/browse/SPARK-24597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-24597. ---------------------------------- Resolution: Incomplete > Spark ML Pipeline Should support non-linear models => DAGPipeline > ----------------------------------------------------------------- > > Key: SPARK-24597 > URL: https://issues.apache.org/jira/browse/SPARK-24597 > Project: Spark > Issue Type: New Feature > Components: ML > Affects Versions: 2.3.1 > Reporter: Michael Dreibelbis > Priority: Minor > Labels: bulk-closed > > Currently SparkML Pipeline/PipelineModel supports single linear dataset > transformation > despite the documentation stating otherwise: > [reference > documentation|https://spark.apache.org/docs/2.3.0/ml-pipeline.html#details] > I'm proposing implementing a DAGPipeline and supporting multiple datasets as > input > The code could look something like this: > > {code:java} > val ds1 = /*dataset 1 creation*/ > val ds2 = /*dataset 2 creation*/ > // nodes take on uid from estimator/transformer > val i1 = IdentityNode(new IdentityTransformer("i1")) > val i2 = IdentityNode(new IdentityTransformer("i2")) > val bi = TransformerNode(new Binarizer("bi")) > val cv = EstimatorNode(new CountVectorizer("cv")) > val idf = EstimatorNode(new IDF("idf")) > val j1 = JoinerNode(new Joiner("j1")) > val nodes = Array(i1, i2, bi, cv, idf) > val edges = Array( > ("i1", "cv"), ("cv", "idf"), ("idf", "j1"), > ("i2", "bi"), ("bi", "j1")) > val p = new DAGPipeline(nodes, edges) > .setIdentity("i1", ds1) > .setIdentity("i2", ds2) > val m = p.fit(spark.emptyDataFrame) > m.setIdentity("i1", ds1).setIdentity("i2", ds2) > m.transform(spark.emptyDataFrame) > {code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org