Re: Optimize encoding/decoding strings when using Parquet
I have put in a PR on Parquet to support dictionaries when filters are pushed down, which should reduce binary conversion overhear when Spark pushes down string predicates on columns that are dictionary encoded. https://github.com/apache/incubator-parquet-mr/pull/117 It's blocked at the moment as I part of my parquet build fails on my Mac due to issue getting thrift 0.7 installed. Installation instructions available on Parquet do not seem to work I think due to this issue https://issues.apache.org/jira/browse/THRIFT-2229 https://issues.apache.org/jira/browse/THRIFT-2229. This is not directly related to Spark but I wondered if anyone has got thrift 0.7 working on Mac Yosemite 10.0, or can suggest a work round. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10617.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Optimize encoding/decoding strings when using Parquet
Added PR https://github.com/apache/spark/pull/4139 https://github.com/apache/spark/pull/4139 - I think tests have been re-arranged so merge necessary Mick On 19 Jan 2015, at 18:31, Reynold Xin r...@databricks.com wrote: Definitely go for a pull request! On Mon, Jan 19, 2015 at 10:10 AM, Mick Davies michael.belldav...@gmail.com mailto:michael.belldav...@gmail.com wrote: Looking at Parquet code - it looks like hooks are already in place to support this. In particular PrimitiveConverter has methods hasDictionarySupport and addValueFromDictionary for this purpose. These are not used by CatalystPrimitiveConverter. I think that it would be pretty straightforward to add this. Has anyone considered this? Shall I get a pull request together for it. Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org mailto:dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org mailto:dev-h...@spark.apache.org
Re: Optimize encoding/decoding strings when using Parquet
Added a JIRA to track https://issues.apache.org/jira/browse/SPARK-5309 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10189.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Optimize encoding/decoding strings when using Parquet
Here are some timings showing effect of caching last Binary-String conversion. Query times are reduced significantly and variation in timings due to reduction in garbage is very significant. Set of sample queries selecting various columns, applying some filtering and then aggregating Spark 1.2.0 Query 1 mean time 8353.3 millis, std deviation 480.91511147441025 millis Query 2 mean time 8677.6 millis, std deviation 3193.345518417949 millis Query 3 mean time 11302.5 millis, std deviation 2989.9406998950476 millis Query 4 mean time 10537.0 millis, std deviation 5166.024024549462 millis Query 5 mean time 9559.9 millis, std deviation 4141.487667493409 millis Query 6 mean time 12638.1 millis, std deviation 3639.4505522430477 millis Spark 1.2.0 - cache last Binary-String conversion Query 1 mean time 5118.9 millis, std deviation 549.6670608448152 millis Query 2 mean time 3761.3 millis, std deviation 202.57785883183013 millis Query 3 mean time 7358.8 millis, std deviation 242.58918176850162 millis Query 4 mean time 4173.5 millis, std deviation 179.802515122688 millis Query 5 mean time 3857.0 millis, std deviation 140.71957930579526 millis Query 6 mean time 7512.0 millis, std deviation 198.32633040858022 millis -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10193.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Optimize encoding/decoding strings when using Parquet
Looking at Parquet code - it looks like hooks are already in place to support this. In particular PrimitiveConverter has methods hasDictionarySupport and addValueFromDictionary for this purpose. These are not used by CatalystPrimitiveConverter. I think that it would be pretty straightforward to add this. Has anyone considered this? Shall I get a pull request together for it. Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Optimize encoding/decoding strings when using Parquet
Definitely go for a pull request! On Mon, Jan 19, 2015 at 10:10 AM, Mick Davies michael.belldav...@gmail.com wrote: Looking at Parquet code - it looks like hooks are already in place to support this. In particular PrimitiveConverter has methods hasDictionarySupport and addValueFromDictionary for this purpose. These are not used by CatalystPrimitiveConverter. I think that it would be pretty straightforward to add this. Has anyone considered this? Shall I get a pull request together for it. Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Optimize encoding/decoding strings when using Parquet
Hi, It seems that a reasonably large proportion of query time using Spark SQL seems to be spent decoding Parquet Binary objects to produce Java Strings. Has anyone considered trying to optimize these conversions as many are duplicated. Details are outlined in the conversation in the user mailing list http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html , I have copied a bit of that discussion here. It seems that as Spark processes each row from Parquet it makes a call to convert the Binary representation for each String column into a Java String. However in many (probably most) circumstances the underlying Binary instance from Parquet will have come from a Dictionary, for example when column cardinality is low. Therefore Spark is converting the same byte array to a copy of the same Java String over and over again. This is bad due to extra cpu, extra memory used for these strings, and probably results in more expensive grouping comparisons. I tested a simple hack to cache the last Binary-String conversion per column in ParquetConverter and this led to a 25% performance improvement for the queries I used. Admittedly this was over a data set with lots or runs of the same Strings in the queried columns. These costs are quite significant for the type of data that I expect will be stored in Parquet which will often have denormalized tables and probably lots of fairly low cardinality string columns I think a good way to optimize this would be if changes could be made to Parquet so that the encoding/decoding of Objects to Binary is handled on Parquet side of fence. Parquet could deal with Objects (Strings) as the client understands them and only use encoding/decoding to store/read from underlying storage medium. Doing this I think Parquet could ensure that the encoding/decoding of each Object occurs only once. Does anyone have an opinion on this, has it been considered already? Cheers Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Optimize encoding/decoding strings when using Parquet
+1 to adding such an optimization to parquet. The bytes are tagged specially as UTF8 in the parquet schema so it seem like it would be possible to add this. On Fri, Jan 16, 2015 at 8:17 AM, Mick Davies michael.belldav...@gmail.com wrote: Hi, It seems that a reasonably large proportion of query time using Spark SQL seems to be spent decoding Parquet Binary objects to produce Java Strings. Has anyone considered trying to optimize these conversions as many are duplicated. Details are outlined in the conversation in the user mailing list http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html , I have copied a bit of that discussion here. It seems that as Spark processes each row from Parquet it makes a call to convert the Binary representation for each String column into a Java String. However in many (probably most) circumstances the underlying Binary instance from Parquet will have come from a Dictionary, for example when column cardinality is low. Therefore Spark is converting the same byte array to a copy of the same Java String over and over again. This is bad due to extra cpu, extra memory used for these strings, and probably results in more expensive grouping comparisons. I tested a simple hack to cache the last Binary-String conversion per column in ParquetConverter and this led to a 25% performance improvement for the queries I used. Admittedly this was over a data set with lots or runs of the same Strings in the queried columns. These costs are quite significant for the type of data that I expect will be stored in Parquet which will often have denormalized tables and probably lots of fairly low cardinality string columns I think a good way to optimize this would be if changes could be made to Parquet so that the encoding/decoding of Objects to Binary is handled on Parquet side of fence. Parquet could deal with Objects (Strings) as the client understands them and only use encoding/decoding to store/read from underlying storage medium. Doing this I think Parquet could ensure that the encoding/decoding of each Object occurs only once. Does anyone have an opinion on this, has it been considered already? Cheers Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org