[ https://issues.apache.org/jira/browse/SPARK-19846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-19846: ------------------------------------ Assignee: Apache Spark > Add a flag to disable constraint propagation > -------------------------------------------- > > Key: SPARK-19846 > URL: https://issues.apache.org/jira/browse/SPARK-19846 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.1.0 > Reporter: Liang-Chi Hsieh > Assignee: Apache Spark > > Constraint propagation can be computation expensive and block the driver > execution for long time. For example, the below benchmark needs 30mins. > Compared with other attempts to modify how constraints propagation works, > this is a much simpler option: add a flag to disable constraint propagation. > {code} > import org.apache.spark.ml.{Pipeline, PipelineStage} > import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, > VectorAssembler} > spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) > val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, > "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) > val indexers = df.columns.tail.map(c => new StringIndexer() > .setInputCol(c) > .setOutputCol(s"${c}_indexed") > .setHandleInvalid("skip")) > val encoders = indexers.map(indexer => new OneHotEncoder() > .setInputCol(indexer.getOutputCol) > .setOutputCol(s"${indexer.getOutputCol}_encoded") > .setDropLast(true)) > val stages: Array[PipelineStage] = indexers ++ encoders > val pipeline = new Pipeline().setStages(stages) > val startTime = System.nanoTime > pipeline.fit(df).transform(df).show > val runningTime = System.nanoTime - startTime > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org