Michael Dreibelbis created SPARK-24656:
------------------------------------------
Summary: SparkML Transformers and Estimators with multiple columns
Key: SPARK-24656
URL: https://issues.apache.org/jira/browse/SPARK-24656
Project: Spark
Issue Type: New Feature
Components: ML, MLlib
Affects Versions: 2.3.1
Reporter: Michael Dreibelbis
Currently SparkML Transformers and Estimators operate on single input/output
column pairs. This makes pipelines extremely cumbersome (as well as
non-performant) when transformations on multiple columns needs to be made.
I am proposing to implement ParallelPipelineStage/Transformer/Estimator/Model
that would operate on the input columns in parallel.
{code:java}
// old way
val pipeline = new Pipeline().setStages(Array(
new CountVectorizer().setInputCol("_1").setOutputCol("_1_cv"),
new CountVectorizer().setInputCol("_2").setOutputCol("_2_cv"),
new IDF().setInputCol("_1_cv").setOutputCol("_1_idf"),
new IDF().setInputCol("_2_cv").setOutputCol("_2_idf")
))
// proposed way
val pipeline2 = new Pipeline().setStages(Array(
new ParallelCountVectorizer().setInputCols(Array("_1",
"_2")).setOutputCols(Array("_1_cv", "_2_cv")),
new ParallelIDF().setInputCols(Array("_1_cv",
"_2_cv")).setOutputCols(Array("_1_idf", "_2_idf"))
))
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]