[GitHub] spark pull request #15573: [SPARK-18035] [SQL] Unwrapping java maps in HiveI...

tejasapatil Thu, 20 Oct 2016 15:15:06 -0700

GitHub user tejasapatil opened a pull request:

    https://github.com/apache/spark/pull/15573


    [SPARK-18035] [SQL] Unwrapping java maps in HiveInspectors allocates 
unnecessary buffer

    ## What changes were proposed in this pull request?
    
    Jira: https://issues.apache.org/jira/browse/SPARK-18035
    
    In HiveInspectors, I saw that converting Java map to Spark's 
`ArrayBasedMapData` spent quite sometime in buffer copying : 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658
    
    The reason being `map.toSeq` allocates a new buffer and copies the map 
entries to it: 
https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323
    
    This copy is not needed as we get rid of it once we extract the key and 
value arrays.
    
    Here is the call trace:
    
    ```
    
org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
    scala.collection.AbstractMap.toSeq(Map.scala:59)
    scala.collection.MapLike$class.toSeq(MapLike.scala:323)
    scala.collection.AbstractMap.toBuffer(Map.scala:59)
    scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
    scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
    
scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275)
    scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    scala.collection.Iterator$class.foreach(Iterator.scala:893)
    
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
    
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
    ```
    
    Also, earlier code was populating keys and values arrays separately by 
iterating twice. The PR avoids double iteration of the map and does it in one 
iteration.
    
    ## Performance gains
    
    The number is subjective and depends on how many map columns are accessed 
in the query and average entries per map. For one the queries that I tried out, 
I saw 3% CPU savings (end-to-end) for the query.
    
    ## How was this patch tested?
    
    This does not change the end result produced so relying on existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tejasapatil/spark SPARK-18035_avoid_toSeq

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15573.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15573
    
----
commit 563232d095c0877fd06da8da2e789c94a289cdb1
Author: Tejas Patil <[email protected]>
Date:   2016-10-18T23:21:33Z

    `map.asScala.toSeq` creates a new copy of the collection and copies the map 
entries to it: 
https://github.com/scala/legacy-svn-scala/blob/a74698486dc4221af0fb590c776998b7339e58d6/src/library/scala/collection/MapLike.scala#L313
    
    This copy is not needed as we get rid of it once we extract the key and 
value arrays.
    
    Here is the call trace:
    
    ```
    
org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
    scala.collection.AbstractMap.toSeq(Map.scala:59)
    scala.collection.MapLike$class.toSeq(MapLike.scala:323)
    scala.collection.AbstractMap.toBuffer(Map.scala:59)
    scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
    scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
    
scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275)
    scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    scala.collection.Iterator$class.foreach(Iterator.scala:893)
    
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
    
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
    ```
    
    Also, earlier code was populating keys and values arrays separately by 
iterating twice. The PR avoids double iteration of the map and does it in one 
iteration.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15573: [SPARK-18035] [SQL] Unwrapping java maps in HiveI...

Reply via email to