Hi, It seems that a reasonably large proportion of query time using Spark SQL seems to be spent decoding Parquet Binary objects to produce Java Strings. Has anyone considered trying to optimize these conversions as many are duplicated.
Details are outlined in the conversation in the user mailing list http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html> , I have copied a bit of that discussion here. It seems that as Spark processes each row from Parquet it makes a call to convert the Binary representation for each String column into a Java String. However in many (probably most) circumstances the underlying Binary instance from Parquet will have come from a Dictionary, for example when column cardinality is low. Therefore Spark is converting the same byte array to a copy of the same Java String over and over again. This is bad due to extra cpu, extra memory used for these strings, and probably results in more expensive grouping comparisons. I tested a simple hack to cache the last Binary->String conversion per column in ParquetConverter and this led to a 25% performance improvement for the queries I used. Admittedly this was over a data set with lots or runs of the same Strings in the queried columns. These costs are quite significant for the type of data that I expect will be stored in Parquet which will often have denormalized tables and probably lots of fairly low cardinality string columns I think a good way to optimize this would be if changes could be made to Parquet so that the encoding/decoding of Objects to Binary is handled on Parquet side of fence. Parquet could deal with Objects (Strings) as the client understands them and only use encoding/decoding to store/read from underlying storage medium. Doing this I think Parquet could ensure that the encoding/decoding of each Object occurs only once. Does anyone have an opinion on this, has it been considered already? Cheers Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org