Michael Dreibelbis created SPARK-24656:
------------------------------------------

             Summary: SparkML Transformers and Estimators with multiple columns
                 Key: SPARK-24656
                 URL: https://issues.apache.org/jira/browse/SPARK-24656
             Project: Spark
          Issue Type: New Feature
          Components: ML, MLlib
    Affects Versions: 2.3.1
            Reporter: Michael Dreibelbis


Currently SparkML Transformers and Estimators operate on single input/output 
column pairs. This makes pipelines extremely cumbersome (as well as 
non-performant) when transformations on multiple columns needs to be made.

 

I am proposing to implement ParallelPipelineStage/Transformer/Estimator/Model 
that would operate on the input columns in parallel.

 
{code:java}
 // old way
    val pipeline = new Pipeline().setStages(Array(
      new CountVectorizer().setInputCol("_1").setOutputCol("_1_cv"),
      new CountVectorizer().setInputCol("_2").setOutputCol("_2_cv"),
      new IDF().setInputCol("_1_cv").setOutputCol("_1_idf"),
      new IDF().setInputCol("_2_cv").setOutputCol("_2_idf")
    ))

    // proposed way
    val pipeline2 = new Pipeline().setStages(Array(
      new ParallelCountVectorizer().setInputCols(Array("_1", 
"_2")).setOutputCols(Array("_1_cv", "_2_cv")),
      new ParallelIDF().setInputCols(Array("_1_cv", 
"_2_cv")).setOutputCols(Array("_1_idf", "_2_idf"))
    ))

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to