Re: Optimize encoding/decoding strings when using Parquet

2015-02-13 Thread Mick Davies
I have put in a PR on Parquet to support dictionaries when filters are pushed
down, which should reduce binary conversion overhear when Spark pushes down
string predicates on columns that are dictionary encoded.

https://github.com/apache/incubator-parquet-mr/pull/117

It's blocked at the moment as I part of my parquet build fails on my Mac due
to issue getting thrift 0.7 installed. Installation instructions available
on Parquet do not seem to work I think due to this issue
https://issues.apache.org/jira/browse/THRIFT-2229
https://issues.apache.org/jira/browse/THRIFT-2229.

This is not directly related to Spark but I wondered if anyone has got
thrift 0.7 working on Mac Yosemite 10.0, or can suggest a work round.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10617.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Optimize encoding/decoding strings when using Parquet

2015-01-23 Thread Michael Davies
Added PR https://github.com/apache/spark/pull/4139 
https://github.com/apache/spark/pull/4139 - I think tests have been 
re-arranged so merge necessary

Mick


 On 19 Jan 2015, at 18:31, Reynold Xin r...@databricks.com wrote:
 
 Definitely go for a pull request!
 
 
 On Mon, Jan 19, 2015 at 10:10 AM, Mick Davies michael.belldav...@gmail.com 
 mailto:michael.belldav...@gmail.com wrote:
 
 Looking at Parquet code - it looks like hooks are already in place to
 support this.
 
 In particular PrimitiveConverter has methods hasDictionarySupport and
 addValueFromDictionary for this purpose. These are not used by
 CatalystPrimitiveConverter.
 
 I think that it would be pretty straightforward to add this. Has anyone
 considered this? Shall I get a pull request  together for it.
 
 Mick
 
 
 
 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html
  
 http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
 mailto:dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org 
 mailto:dev-h...@spark.apache.org
 
 



Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies
Added a JIRA to track
https://issues.apache.org/jira/browse/SPARK-5309



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10189.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies
Here are some timings showing effect of caching last Binary-String
conversion. Query times are reduced significantly and variation in timings
due to reduction in garbage is very significant.

Set of sample queries selecting various columns, applying some filtering and
then aggregating

Spark 1.2.0
Query 1 mean time 8353.3 millis, std deviation 480.91511147441025 millis
Query 2 mean time 8677.6 millis, std deviation 3193.345518417949 millis
Query 3 mean time 11302.5 millis, std deviation 2989.9406998950476 millis
Query 4 mean time 10537.0 millis, std deviation 5166.024024549462 millis
Query 5 mean time 9559.9 millis, std deviation 4141.487667493409 millis
Query 6 mean time 12638.1 millis, std deviation 3639.4505522430477 millis


Spark 1.2.0 - cache last Binary-String conversion
Query 1 mean time 5118.9 millis, std deviation 549.6670608448152 millis
Query 2 mean time 3761.3 millis, std deviation 202.57785883183013 millis
Query 3 mean time 7358.8 millis, std deviation 242.58918176850162 millis
Query 4 mean time 4173.5 millis, std deviation 179.802515122688 millis
Query 5 mean time 3857.0 millis, std deviation 140.71957930579526 millis
Query 6 mean time 7512.0 millis, std deviation 198.32633040858022 millis




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10193.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies

Looking at Parquet code - it looks like hooks are already in place to
support this.

In particular PrimitiveConverter has methods hasDictionarySupport and
addValueFromDictionary for this purpose. These are not used by
CatalystPrimitiveConverter.

I think that it would be pretty straightforward to add this. Has anyone
considered this? Shall I get a pull request  together for it.

Mick



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Reynold Xin
Definitely go for a pull request!


On Mon, Jan 19, 2015 at 10:10 AM, Mick Davies michael.belldav...@gmail.com
wrote:


 Looking at Parquet code - it looks like hooks are already in place to
 support this.

 In particular PrimitiveConverter has methods hasDictionarySupport and
 addValueFromDictionary for this purpose. These are not used by
 CatalystPrimitiveConverter.

 I think that it would be pretty straightforward to add this. Has anyone
 considered this? Shall I get a pull request  together for it.

 Mick



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Optimize encoding/decoding strings when using Parquet

2015-01-16 Thread Mick Davies
Hi, 

It seems that a reasonably large proportion of query time using Spark SQL
seems to be spent decoding Parquet Binary objects to produce Java Strings.
Has anyone considered trying to optimize these conversions as many are
duplicated.

Details are outlined in the conversation in the user mailing list 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html
 
, I have copied a bit of that discussion here.

It seems that as Spark processes each row from Parquet it makes a call to
convert the Binary representation for each String column into a Java String.
However in many (probably most) circumstances the underlying Binary instance
from Parquet will have come from a Dictionary, for example when column
cardinality is low. Therefore Spark is converting the same byte array to a
copy of the same Java String over and over again. This is bad due to extra
cpu, extra memory used for these strings, and probably results in more
expensive grouping comparisons. 


I tested a simple hack to cache the last Binary-String conversion per
column in ParquetConverter and this led to a 25% performance improvement for
the queries I used. Admittedly this was over a data set with lots or runs of
the same Strings in the queried columns. 

These costs are quite significant for the type of data that I expect will be
stored in Parquet which will often have denormalized tables and probably
lots of fairly low cardinality string columns 

I think a good way to optimize this would be if changes could be made to
Parquet so that  the encoding/decoding of Objects to Binary is handled on
Parquet side of fence. Parquet could deal with Objects (Strings) as the
client understands them and only use encoding/decoding to store/read from
underlying storage medium. Doing this I think Parquet could ensure that the
encoding/decoding of each Object occurs only once. 

Does anyone have an opinion on this, has it been considered already?

Cheers Mick







--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Optimize encoding/decoding strings when using Parquet

2015-01-16 Thread Michael Armbrust
+1 to adding such an optimization to parquet.  The bytes are tagged
specially as UTF8 in the parquet schema so it seem like it would be
possible to add this.

On Fri, Jan 16, 2015 at 8:17 AM, Mick Davies michael.belldav...@gmail.com
wrote:

 Hi,

 It seems that a reasonably large proportion of query time using Spark SQL
 seems to be spent decoding Parquet Binary objects to produce Java Strings.
 Has anyone considered trying to optimize these conversions as many are
 duplicated.

 Details are outlined in the conversation in the user mailing list

 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html
 
 , I have copied a bit of that discussion here.

 It seems that as Spark processes each row from Parquet it makes a call to
 convert the Binary representation for each String column into a Java
 String.
 However in many (probably most) circumstances the underlying Binary
 instance
 from Parquet will have come from a Dictionary, for example when column
 cardinality is low. Therefore Spark is converting the same byte array to a
 copy of the same Java String over and over again. This is bad due to extra
 cpu, extra memory used for these strings, and probably results in more
 expensive grouping comparisons.


 I tested a simple hack to cache the last Binary-String conversion per
 column in ParquetConverter and this led to a 25% performance improvement
 for
 the queries I used. Admittedly this was over a data set with lots or runs
 of
 the same Strings in the queried columns.

 These costs are quite significant for the type of data that I expect will
 be
 stored in Parquet which will often have denormalized tables and probably
 lots of fairly low cardinality string columns

 I think a good way to optimize this would be if changes could be made to
 Parquet so that  the encoding/decoding of Objects to Binary is handled on
 Parquet side of fence. Parquet could deal with Objects (Strings) as the
 client understands them and only use encoding/decoding to store/read from
 underlying storage medium. Doing this I think Parquet could ensure that the
 encoding/decoding of each Object occurs only once.

 Does anyone have an opinion on this, has it been considered already?

 Cheers Mick







 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org