[
https://issues.apache.org/jira/browse/HIVE-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Shelukhin updated HIVE-6430:
-----------------------------------
Attachment: HIVE-6430.09.patch
This replaces guava murmurhash with inline one, and adds (untested)
serialization bypass for serdes (testing fast query, hash and byte copies in
serdes are the most prominent differences in my profiled runs). Unfortunately,
for the latter I've discovered that keys given to us are serialized using
BinarySortableSerDe because they come from ReduceSinkOperator. Will need to
sync w/Gunther tomorrow on this. Most likely outcome is that we'll change the
tez hashtable output to lazy serde, so we could just copy bytes. Alternative
would be to change key serialization to binarysortable, but that's ugly because
values would stay on lazybinary so we will have two paths. Plus bunch of
changes will be required to binarysortable to not have byte copies again, and
use RandomAccessOutput instead of its OutputBuffer thing. Yet another
alternative is to do bypass only for values, not keys.
Regardless, I think we should be committing this patch soon (even if off by
default), and doing additional improvements in separate jiras.
It's growing too big.
> MapJoin hash table has large memory overhead
> --------------------------------------------
>
> Key: HIVE-6430
> URL: https://issues.apache.org/jira/browse/HIVE-6430
> Project: Hive
> Issue Type: Improvement
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Attachments: HIVE-6430.01.patch, HIVE-6430.02.patch,
> HIVE-6430.03.patch, HIVE-6430.04.patch, HIVE-6430.05.patch,
> HIVE-6430.06.patch, HIVE-6430.07.patch, HIVE-6430.08.patch,
> HIVE-6430.09.patch, HIVE-6430.patch
>
>
> Right now, in some queries, I see that storing e.g. 4 ints (2 for key and 2
> for row) can take several hundred bytes, which is ridiculous. I am reducing
> the size of MJKey and MJRowContainer in other jiras, but in general we don't
> need to have java hash table there. We can either use primitive-friendly
> hashtable like the one from HPPC (Apache-licenced), or some variation, to map
> primitive keys to single row storage structure without an object per row
> (similar to vectorization).
--
This message was sent by Atlassian JIRA
(v6.2#6252)