Optimize encoding/decoding strings when using Parquet

Mick Davies Fri, 16 Jan 2015 08:19:19 -0800

Hi, 

It seems that a reasonably large proportion of query time using Spark SQL
seems to be spent decoding Parquet Binary objects to produce Java Strings.
Has anyone considered trying to optimize these conversions as many are
duplicated.


Details are outlined in the conversation in the user mailing list 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html
<http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html>
 
, I have copied a bit of that discussion here.

It seems that as Spark processes each row from Parquet it makes a call to
convert the Binary representation for each String column into a Java String.
However in many (probably most) circumstances the underlying Binary instance
from Parquet will have come from a Dictionary, for example when column
cardinality is low. Therefore Spark is converting the same byte array to a
copy of the same Java String over and over again. This is bad due to extra
cpu, extra memory used for these strings, and probably results in more
expensive grouping comparisons. 


I tested a simple hack to cache the last Binary->String conversion per
column in ParquetConverter and this led to a 25% performance improvement for
the queries I used. Admittedly this was over a data set with lots or runs of
the same Strings in the queried columns. 

These costs are quite significant for the type of data that I expect will be
stored in Parquet which will often have denormalized tables and probably
lots of fairly low cardinality string columns 

I think a good way to optimize this would be if changes could be made to
Parquet so that  the encoding/decoding of Objects to Binary is handled on
Parquet side of fence. Parquet could deal with Objects (Strings) as the
client understands them and only use encoding/decoding to store/read from
underlying storage medium. Doing this I think Parquet could ensure that the
encoding/decoding of each Object occurs only once. 

Does anyone have an opinion on this, has it been considered already?

Cheers Mick







--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Optimize encoding/decoding strings when using Parquet

Reply via email to