[ https://issues.apache.org/jira/browse/HBASE-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Appy updated HBASE-15236: ------------------------- Description: If there are two bulkloaded hfiles in a region with same seqID, same timestamps and duplicate keys*, get and scan may return different values for a key. Not sure how this would happen, but one of our customer uploaded a dataset with 2 files in a single region and both having same bulk load timestamp. These files are small ~50M (I couldn't find any setting for max file size that could lead to 2 files). The range of keys in two hfiles are overlapping to some extent, but not fully (so the two files are because of region merge). In such a case, depending on file sizes (because we take it into account when sorting hfiles internally), we may get different values for the same cell (say "r", "cf:50") depending on what we call: get "r" "cf:50" or get "r" "cf:". I have been able to replicate this issue, will post the instructions shortly. --- \* was: If there are two bulkloaded hfiles in a region with same seqID and duplicate keys*, get and scan may return different values for a key. More details: - one of the rows had 200k+ columns. say row is 'r', column family is 'cf' and column qualifiers are 1 to 1000. - hfiles were split somewhere along that row, but there were a range of columns in both hfiles. For eg, something like - hfile1: ["", r:cf:70) and hfile2: [r:cf:40, ....). - Between columns 40 to 70, some (not all) columns were in both the files with different values. Whereas other were only in one of the files. In such a case, depending on file size (because we take it into account when sorting hfiles internally), we may get different values for the same cell (say "r", "cf:50") depending on what we call: get "r" "cf:50" or get "r" "cf:". I have been able to replicate this issue, will post the instructions shortly. --- \* not sure how this would happen. These files are small ~50M, nor could i find any setting for max file size that could lead to splits. Need to investigate more. > Inconsistent cell reads over multiple bulk-loaded HFiles > -------------------------------------------------------- > > Key: HBASE-15236 > URL: https://issues.apache.org/jira/browse/HBASE-15236 > Project: HBase > Issue Type: Bug > Reporter: Appy > Assignee: Appy > > If there are two bulkloaded hfiles in a region with same seqID, same > timestamps and duplicate keys*, get and scan may return different values for > a key. Not sure how this would happen, but one of our customer uploaded a > dataset with 2 files in a single region and both having same bulk load > timestamp. These files are small ~50M (I couldn't find any setting for max > file size that could lead to 2 files). The range of keys in two hfiles are > overlapping to some extent, but not fully (so the two files are because of > region merge). > In such a case, depending on file sizes (because we take it into account when > sorting hfiles internally), we may get different values for the same cell > (say "r", "cf:50") depending on what we call: get "r" "cf:50" or get "r" > "cf:". > I have been able to replicate this issue, will post the instructions shortly. > --- > \* -- This message was sent by Atlassian JIRA (v6.3.4#6332)