GitHub user tejasapatil opened a pull request:
https://github.com/apache/spark/pull/15573
[SPARK-18035] [SQL] Unwrapping java maps in HiveInspectors allocates
unnecessary buffer
## What changes were proposed in this pull request?
Jira: https://issues.apache.org/jira/browse/SPARK-18035
In HiveInspectors, I saw that converting Java map to Spark's
`ArrayBasedMapData` spent quite sometime in buffer copying :
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658
The reason being `map.toSeq` allocates a new buffer and copies the map
entries to it:
https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323
This copy is not needed as we get rid of it once we extract the key and
value arrays.
Here is the call trace:
```
org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
scala.collection.AbstractMap.toSeq(Map.scala:59)
scala.collection.MapLike$class.toSeq(MapLike.scala:323)
scala.collection.AbstractMap.toBuffer(Map.scala:59)
scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
scala.collection.AbstractIterable.foreach(Iterable.scala:54)
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
scala.collection.Iterator$class.foreach(Iterator.scala:893)
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
```
Also, earlier code was populating keys and values arrays separately by
iterating twice. The PR avoids double iteration of the map and does it in one
iteration.
## Performance gains
The number is subjective and depends on how many map columns are accessed
in the query and average entries per map. For one the queries that I tried out,
I saw 3% CPU savings (end-to-end) for the query.
## How was this patch tested?
This does not change the end result produced so relying on existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tejasapatil/spark SPARK-18035_avoid_toSeq
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15573.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15573
----
commit 563232d095c0877fd06da8da2e789c94a289cdb1
Author: Tejas Patil <[email protected]>
Date: 2016-10-18T23:21:33Z
`map.asScala.toSeq` creates a new copy of the collection and copies the map
entries to it:
https://github.com/scala/legacy-svn-scala/blob/a74698486dc4221af0fb590c776998b7339e58d6/src/library/scala/collection/MapLike.scala#L313
This copy is not needed as we get rid of it once we extract the key and
value arrays.
Here is the call trace:
```
org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
scala.collection.AbstractMap.toSeq(Map.scala:59)
scala.collection.MapLike$class.toSeq(MapLike.scala:323)
scala.collection.AbstractMap.toBuffer(Map.scala:59)
scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
scala.collection.AbstractIterable.foreach(Iterable.scala:54)
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
scala.collection.Iterator$class.foreach(Iterator.scala:893)
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
```
Also, earlier code was populating keys and values arrays separately by
iterating twice. The PR avoids double iteration of the map and does it in one
iteration.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]