Re: DataFrame.withColumn very slow when used iteratively?

Reynold Xin Tue, 02 Jun 2015 12:49:00 -0700

We improved this in 1.4. Adding 100 columns took 4s on my laptop.
https://issues.apache.org/jira/browse/SPARK-7276


Still not the fastest, but much faster.

scala> Seq((1, 2)).toDF("a", "b")
res6: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala>

scala> val start = System.nanoTime
start: Long = 1433274299441224000

scala> for (i <- 1 to 100) {
     |   df = df.withColumn("n" + i, org.apache.spark.sql.functions.lit(0))
     | }

scala> val end = System.nanoTime
end: Long = 1433274303250091000

scala>

scala> println((end - start) / 1000 / 1000 / 1000)
3


On Tue, Jun 2, 2015 at 12:34 PM, zsampson <zsamp...@palantir.com> wrote:

> Hey,
>
> I'm seeing extreme slowness in withColumn when it's used in a loop. I'm
> running this code:
>
> for (int i = 0; i < NUM_ITERATIONS ++i) {
> df = df.withColumn("col"+i, new Column(new Literal(i,
> DataTypes.IntegerType)));
> }
>
> where df is initially a trivial dataframe. Here are the results of running
> with different values of NUM_ITERATIONS:
>
> iterations      time
> 25      3s
> 50      11s
> 75      31s
> 100     76s
> 125     159s
> 150     283s
>
> When I update the DataFrame by manually copying/appending to the column
> array and using DataFrame.select, it runs in about half the time, but this
> is still untenable at any significant number of iterations.
>
> Any insight?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-withColumn-very-slow-when-used-iteratively-tp12562.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: DataFrame.withColumn very slow when used iteratively?

Reply via email to