GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/10877

    [SPARK-12950][SQL] Cache newly accessed hashtable entry in BytesToBytesMap

    JIRA: https://issues.apache.org/jira/browse/SPARK-12950
    
    As described in JIRA, it is observed that when aggregate with grouping 
keys, profiling show that lookup in BytesToBytesMap took about 90% of the CPU 
time.
    
    This patch doesn't change how the lookup works. But we find that every time 
it is going to call `safeLookup` method, we need to look up for a key from the 
beginning. So if we want to find a same key, we will perform the same lookup 
process (comparing hashcode, comparing keys,hitting collision...). This patch 
tries to cache the newly accessed key and returns the entry early if it is in 
cache.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 cache-bytes-to-bytes-map

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10877.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10877
    
----
commit 0add15ca27d11d7d315b55d5effa085f7d405e59
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-01-22T08:53:19Z

    Cache newly accessed hashtable entry.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to