Here are some timings showing effect of caching last Binary->String conversion. Query times are reduced significantly and variation in timings due to reduction in garbage is very significant.
Set of sample queries selecting various columns, applying some filtering and then aggregating Spark 1.2.0 Query 1 mean time 8353.3 millis, std deviation 480.91511147441025 millis Query 2 mean time 8677.6 millis, std deviation 3193.345518417949 millis Query 3 mean time 11302.5 millis, std deviation 2989.9406998950476 millis Query 4 mean time 10537.0 millis, std deviation 5166.024024549462 millis Query 5 mean time 9559.9 millis, std deviation 4141.487667493409 millis Query 6 mean time 12638.1 millis, std deviation 3639.4505522430477 millis Spark 1.2.0 - cache last Binary->String conversion Query 1 mean time 5118.9 millis, std deviation 549.6670608448152 millis Query 2 mean time 3761.3 millis, std deviation 202.57785883183013 millis Query 3 mean time 7358.8 millis, std deviation 242.58918176850162 millis Query 4 mean time 4173.5 millis, std deviation 179.802515122688 millis Query 5 mean time 3857.0 millis, std deviation 140.71957930579526 millis Query 6 mean time 7512.0 millis, std deviation 198.32633040858022 millis -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10193.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org