+1 to adding such an optimization to parquet. The bytes are tagged specially as UTF8 in the parquet schema so it seem like it would be possible to add this.
On Fri, Jan 16, 2015 at 8:17 AM, Mick Davies <michael.belldav...@gmail.com> wrote: > Hi, > > It seems that a reasonably large proportion of query time using Spark SQL > seems to be spent decoding Parquet Binary objects to produce Java Strings. > Has anyone considered trying to optimize these conversions as many are > duplicated. > > Details are outlined in the conversation in the user mailing list > > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html > < > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html > > > , I have copied a bit of that discussion here. > > It seems that as Spark processes each row from Parquet it makes a call to > convert the Binary representation for each String column into a Java > String. > However in many (probably most) circumstances the underlying Binary > instance > from Parquet will have come from a Dictionary, for example when column > cardinality is low. Therefore Spark is converting the same byte array to a > copy of the same Java String over and over again. This is bad due to extra > cpu, extra memory used for these strings, and probably results in more > expensive grouping comparisons. > > > I tested a simple hack to cache the last Binary->String conversion per > column in ParquetConverter and this led to a 25% performance improvement > for > the queries I used. Admittedly this was over a data set with lots or runs > of > the same Strings in the queried columns. > > These costs are quite significant for the type of data that I expect will > be > stored in Parquet which will often have denormalized tables and probably > lots of fairly low cardinality string columns > > I think a good way to optimize this would be if changes could be made to > Parquet so that the encoding/decoding of Objects to Binary is handled on > Parquet side of fence. Parquet could deal with Objects (Strings) as the > client understands them and only use encoding/decoding to store/read from > underlying storage medium. Doing this I think Parquet could ensure that the > encoding/decoding of each Object occurs only once. > > Does anyone have an opinion on this, has it been considered already? > > Cheers Mick > > > > > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >