[ https://issues.apache.org/jira/browse/HIVE-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095057#comment-14095057 ]
Sergey Shelukhin commented on HIVE-7617: ---------------------------------------- I've done some unscientific perf testing, and some profiling. It seems like join is much smaller part of the cost than the initial profiling would suggest, but it's still not insignificant. Running the same simple query with mapjoin on TCPDS 200 scale dataset, on the same cluster, with container reuse for each separate configuration, without optimized hashtable 18 times takes 9.63sec. on average, 8.39 is the average of last 12 runs. Optimized (for size ;)) hashtable with no int reader (but with other small perf stuff from this patch) takes 10.35/8.82, with this patch it takes 9.37/8.48. I can see in profiler dumps (separate from the above runs) that cost of key serialization is gone from the query. > optimize bytes mapjoin hash table read path wrt serialization, at least for > common cases > ---------------------------------------------------------------------------------------- > > Key: HIVE-7617 > URL: https://issues.apache.org/jira/browse/HIVE-7617 > Project: Hive > Issue Type: Improvement > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Attachments: HIVE-7617.01.patch, HIVE-7617.patch, > HIVE-7617.prelim.patch > > > BytesBytes has table stores keys in the byte array for compact > representation, however that means that the straightforward implementation of > lookups serializes lookup keys to byte arrays, which is relatively expensive. > We can either shortcut hashcode and compare for common types on read path > (integral types which would cover most of the real-world keys), or specialize > hashtable and from BytesBytes... create LongBytes, StringBytes, or whatever. > First one seems simpler now. -- This message was sent by Atlassian JIRA (v6.2#6252)