GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/10877
[SPARK-12950][SQL] Cache newly accessed hashtable entry in BytesToBytesMap
JIRA: https://issues.apache.org/jira/browse/SPARK-12950
As described in JIRA, it is observed that when aggregate with grouping
keys, profiling show that lookup in BytesToBytesMap took about 90% of the CPU
time.
This patch doesn't change how the lookup works. But we find that every time
it is going to call `safeLookup` method, we need to look up for a key from the
beginning. So if we want to find a same key, we will perform the same lookup
process (comparing hashcode, comparing keys,hitting collision...). This patch
tries to cache the newly accessed key and returns the entry early if it is in
cache.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 cache-bytes-to-bytes-map
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10877.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10877
----
commit 0add15ca27d11d7d315b55d5effa085f7d405e59
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-01-22T08:53:19Z
Cache newly accessed hashtable entry.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]