Michael Dreibelbis created SPARK-24656: ------------------------------------------
Summary: SparkML Transformers and Estimators with multiple columns Key: SPARK-24656 URL: https://issues.apache.org/jira/browse/SPARK-24656 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 2.3.1 Reporter: Michael Dreibelbis Currently SparkML Transformers and Estimators operate on single input/output column pairs. This makes pipelines extremely cumbersome (as well as non-performant) when transformations on multiple columns needs to be made. I am proposing to implement ParallelPipelineStage/Transformer/Estimator/Model that would operate on the input columns in parallel. {code:java} // old way val pipeline = new Pipeline().setStages(Array( new CountVectorizer().setInputCol("_1").setOutputCol("_1_cv"), new CountVectorizer().setInputCol("_2").setOutputCol("_2_cv"), new IDF().setInputCol("_1_cv").setOutputCol("_1_idf"), new IDF().setInputCol("_2_cv").setOutputCol("_2_idf") )) // proposed way val pipeline2 = new Pipeline().setStages(Array( new ParallelCountVectorizer().setInputCols(Array("_1", "_2")).setOutputCols(Array("_1_cv", "_2_cv")), new ParallelIDF().setInputCols(Array("_1_cv", "_2_cv")).setOutputCols(Array("_1_idf", "_2_idf")) )) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org