[ https://issues.apache.org/jira/browse/HIVE-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lars Francke updated HIVE-3179: ------------------------------- Attachment: HIVE-3179.1.patch The attached patch fixes the problem as well as changes a unit test that actually tests this behavior. The unit test fails if our fix to {{LazyHBaseRow}} is not applied. We're not sure if this is the best way to fix this problem as it circumvents the optimization being done by the fieldsInited field. Ideally instead of returning null on an empty HBase cell this would insert some kind of marker but adding an empty ByteArrayRef is not interpreted as NULL but as an empty value (which makes sense). In short: This fixes the bug at the cost of some performance for NULL (non-existing) fields in HBase. > HBase Handler doesn't handle NULLs properly > ------------------------------------------- > > Key: HIVE-3179 > URL: https://issues.apache.org/jira/browse/HIVE-3179 > Project: Hive > Issue Type: Bug > Components: HBase Handler > Affects Versions: 0.9.0 > Reporter: Lars Francke > Priority: Critical > Attachments: HIVE-3179.1.patch > > > We found a quite severe issue in the HBase Handler which actually means that > Hive potentially returns incorrect data if a column has NULL values in HBase > (which means the cell doesn't even exist) > In HBase Shell: > {noformat} > create 'hive_hbase_test', 'test' > put 'hive_hbase_test', '1', 'test:c1', 'c1-1' > put 'hive_hbase_test', '1', 'test:c2', 'c2-1' > put 'hive_hbase_test', '1', 'test:c3', 'c3-1' > put 'hive_hbase_test', '2', 'test:c1', 'c1-2' > {noformat} > In Hive: > {noformat} > DROP TABLE IF EXISTS hive_hbase_test; > CREATE EXTERNAL TABLE hive_hbase_test ( > id int, > c1 string, > c2 string, > c3 string > ) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = > ":key#s,test:c1#s,test:c2#s,test:c3#s") > TBLPROPERTIES("hbase.table.name" = "hive_hbase_test"); > hive> select * from hive_hbase_test; > OK > 1 c1-1 c2-1 c3-1 > 2 c1-2 NULL NULL > hive> select c1 from hive_hbase_test; > c1-1 > c1-2 > hive> select c1, c2 from hive_hbase_test; > c1-1 c2-1 > c1-2 NULL > {noformat} > So far everything is correct but now: > {noformat} > hive> select c1, c2, c2 from hive_hbase_test; > c1-1 c2-1 c2-1 > c1-2 NULL c2-1 > {noformat} > Selecting c2 twice works the first time but the second time we > actually get the value from the previous row. > {noformat} > hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test; > c1-1 c3-1 c2-1 c2-1 c3-1 c3-1 c1-1 > c1-2 NULL NULL c2-1 c3-1 c3-1 c1-2 > {noformat} > We've narrowed this down to an early initialization of > {{fieldsInited\[fieldID] = true}} in {{LazyHBaseRow#uncheckedGetField}} and > we'll try to provide a patch which surely needs review. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira