Performance of loading parquet files into case classes in Spark

Julien Dumazert Sat, 27 Aug 2016 06:28:04 -0700

Hi all,

I'm forwarding a question 
<http://stackoverflow.com/questions/39125911/performance-of-loading-parquet-files-into-case-classes-in-spark>
 I recently asked on Stack Overflow about benchmarking Spark performance when 
working with case classes stored in Parquet files.


I am assessing the performance of different ways of loading Parquet files in 
Spark and the differences are staggering.

In our Parquet files, we have nested case classes of the type:

case class C(/* a dozen of attributes*/)
case class B(/* a dozen of attributes*/, cs: Seq[C])
case class A(/* a dozen of attributes*/, bs: Seq[B])
It takes a while to load them from Parquet files. So I've done a benchmark of 
different ways of loading case classes from Parquet files and summing a field 
using Spark 1.6 and 2.0.

Here is a summary of the benchmark I did:

val df: DataFrame = sqlContext.read.parquet("path/to/file.gz.parquet").persist()
df.count()

// Spark 1.6

// Play Json
// 63.169s
df.toJSON.flatMap(s => Try(Json.parse(s).as[A]).toOption)
         .map(_.fieldToSum).sum()

// Direct access to field using Spark Row
// 2.811s
df.map(row => row.getAs[Long]("fieldToSum")).sum()

// Some small library we developed that access fields using Spark Row
// 10.401s
df.toRDD[A].map(_.fieldToSum).sum()

// Dataframe hybrid SQL API
// 0.239s
df.agg(sum("fieldToSum")).collect().head.getAs[Long](0)

// Dataset with RDD-style code
// 34.223s
df.as[A].map(_.fieldToSum).reduce(_ + _)

// Dataset with column selection
// 0.176s
df.as[A].select($"fieldToSum".as[Long]).reduce(_ + _)


// Spark 2.0

// Performance is similar except for:

// Direct access to field using Spark Row
// 23.168s
df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _)

// Some small library we developed that access fields using Spark Row
// 32.898s
f1DF.toRDD[A].map(_.fieldToSum).sum()
I understand why the performance of methods using Spark Row is degraded when 
upgrading to Spark 2.0, since Dataframe is now a mere alias of Dataset[Row]. 
That's the cost of unifying the interfaces, I guess.

On the other hand, I'm quite disappointed that the promise of Dataset is not 
kept: performance when using RDD-style coding (maps and flatMaps) is worse than 
when using Dataset like Dataframe with SQL-like DSL.

Basically, to have good performance, we need to give up type safety.

What is the reason for such difference between Dataset used as RDD and Dataset 
used as Dataframe?
Is there a way to improve encoding performance in Dataset to equate RDD-style 
coding and SQL-style coding performance? For data engineering, it's much 
cleaner to have RDD-style coding.
Also, working with the SQL-like DSL would require to flatten our data model and 
not use nested case classes. Am I right that good performance is only achieved 
with flat data models?
Some more questions:

4. Is the performance regression between Spark 1.6 and Spark 2.0 an identified 
problem? Will it be addressed in future releases? Or is the performance 
regression very specific to my case and I should handle my data differently?

5. Is the performance difference between RDD-style coding and SQL-style coding 
with Dataset an identified problem? Will it be addressed in future releases? 
Maybe there's no way to do something about it for reasons I can't see with my 
limited understanding of Spark internals. Or should I migrate to the SQL-style 
interface, yet losing type safety?

Best regards,
Julien

Performance of loading parquet files into case classes in Spark

Reply via email to