Hi folks,

I have compared the performance of Spark SQL version 1.6.0 and version
1.5.2. In a simple case, Spark 1.6.0 is quite faster than Spark 1.5.2.
However in a more complex query - in our case it is an aggregation query
with grouping sets, Spark SQL version 1.6.0 is very much slower than Spark
SQL version 1.5. Could any of you kindly let us know a workaround for this
performance regression ?

Here is our test scenario:

case class Toto(
                 a: String = f"${(math.random*1e6).toLong}%06.0f",
                 b: String = f"${(math.random*1e6).toLong}%06.0f",
                 c: String = f"${(math.random*1e6).toLong}%06.0f",
                 n: Int = (math.random*1e3).toInt,
                 m: Double = (math.random*1e3))

val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )

df.registerTempTable( "toto" )
val sqlSelect = "SELECT a, b, COUNT(1) AS k1, COUNT(DISTINCT n) AS k2,
SUM(m) AS k3"
val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
val sqlText = s"$sqlSelect $sqlGroupBy"

val rs1 = sqlContext.sql( sqlText )
rs1.saveAsParquetFile( "rs1" )

The query is executed from a spark-shell in local mode with
--driver-memory=1G. Screenshots from Spark UI are accessible at
http://i.stack.imgur.com/VujQY.png (Spark 1.5.2) and
http://i.stack.imgur.com/Hlg95.png (Spark 1.6.0). The DAG on Spark 1.6.0
can be viewed at http://i.stack.imgur.com/u3HrG.png.

Many thanks and looking forward to hearing from you,
Tien-Dung Le

Reply via email to