[
https://issues.apache.org/jira/browse/HBASE-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Appy updated HBASE-15236:
-------------------------
Description:
If there are two bulkloaded hfiles in a region with same seqID, same timestamps
and duplicate keys*, get and scan may return different values for a key. Not
sure how this would happen, but one of our customer uploaded a dataset with 2
files in a single region and both having same bulk load timestamp. These files
are small ~50M (I couldn't find any setting for max file size that could lead
to 2 files). The range of keys in two hfiles are overlapping to some extent,
but not fully (so the two files are because of region merge).
In such a case, depending on file sizes (because we take it into account when
sorting hfiles internally), we may get different values for the same cell (say
"r", "cf:50") depending on what we call: get "r" "cf:50" or get "r" "cf:".
I have been able to replicate this issue, will post the instructions shortly.
---
\*
was:
If there are two bulkloaded hfiles in a region with same seqID and duplicate
keys*, get and scan may return different values for a key.
More details:
- one of the rows had 200k+ columns. say row is 'r', column family is 'cf' and
column qualifiers are 1 to 1000.
- hfiles were split somewhere along that row, but there were a range of columns
in both hfiles. For eg, something like - hfile1: ["", r:cf:70) and hfile2:
[r:cf:40, ....).
- Between columns 40 to 70, some (not all) columns were in both the files with
different values. Whereas other were only in one of the files.
In such a case, depending on file size (because we take it into account when
sorting hfiles internally), we may get different values for the same cell (say
"r", "cf:50") depending on what we call: get "r" "cf:50" or get "r" "cf:".
I have been able to replicate this issue, will post the instructions shortly.
---
\* not sure how this would happen. These files are small ~50M, nor could i find
any setting for max file size that could lead to splits. Need to investigate
more.
> Inconsistent cell reads over multiple bulk-loaded HFiles
> --------------------------------------------------------
>
> Key: HBASE-15236
> URL: https://issues.apache.org/jira/browse/HBASE-15236
> Project: HBase
> Issue Type: Bug
> Reporter: Appy
> Assignee: Appy
>
> If there are two bulkloaded hfiles in a region with same seqID, same
> timestamps and duplicate keys*, get and scan may return different values for
> a key. Not sure how this would happen, but one of our customer uploaded a
> dataset with 2 files in a single region and both having same bulk load
> timestamp. These files are small ~50M (I couldn't find any setting for max
> file size that could lead to 2 files). The range of keys in two hfiles are
> overlapping to some extent, but not fully (so the two files are because of
> region merge).
> In such a case, depending on file sizes (because we take it into account when
> sorting hfiles internally), we may get different values for the same cell
> (say "r", "cf:50") depending on what we call: get "r" "cf:50" or get "r"
> "cf:".
> I have been able to replicate this issue, will post the instructions shortly.
> ---
> \*
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)