[ https://issues.apache.org/jira/browse/HIVE-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin updated HIVE-6430: ----------------------------------- Attachment: HIVE-6430.09.patch This replaces guava murmurhash with inline one, and adds (untested) serialization bypass for serdes (testing fast query, hash and byte copies in serdes are the most prominent differences in my profiled runs). Unfortunately, for the latter I've discovered that keys given to us are serialized using BinarySortableSerDe because they come from ReduceSinkOperator. Will need to sync w/Gunther tomorrow on this. Most likely outcome is that we'll change the tez hashtable output to lazy serde, so we could just copy bytes. Alternative would be to change key serialization to binarysortable, but that's ugly because values would stay on lazybinary so we will have two paths. Plus bunch of changes will be required to binarysortable to not have byte copies again, and use RandomAccessOutput instead of its OutputBuffer thing. Yet another alternative is to do bypass only for values, not keys. Regardless, I think we should be committing this patch soon (even if off by default), and doing additional improvements in separate jiras. It's growing too big. > MapJoin hash table has large memory overhead > -------------------------------------------- > > Key: HIVE-6430 > URL: https://issues.apache.org/jira/browse/HIVE-6430 > Project: Hive > Issue Type: Improvement > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Attachments: HIVE-6430.01.patch, HIVE-6430.02.patch, > HIVE-6430.03.patch, HIVE-6430.04.patch, HIVE-6430.05.patch, > HIVE-6430.06.patch, HIVE-6430.07.patch, HIVE-6430.08.patch, > HIVE-6430.09.patch, HIVE-6430.patch > > > Right now, in some queries, I see that storing e.g. 4 ints (2 for key and 2 > for row) can take several hundred bytes, which is ridiculous. I am reducing > the size of MJKey and MJRowContainer in other jiras, but in general we don't > need to have java hash table there. We can either use primitive-friendly > hashtable like the one from HPPC (Apache-licenced), or some variation, to map > primitive keys to single row storage structure without an object per row > (similar to vectorization). -- This message was sent by Atlassian JIRA (v6.2#6252)