[jira] [Updated] (HBASE-15236) Inconsistent cell reads over multiple bulk-loaded HFiles

Appy (JIRA) Mon, 08 Feb 2016 20:48:55 -0800

     [ 
https://issues.apache.org/jira/browse/HBASE-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Appy updated HBASE-15236:
-------------------------
    Description: 
If there are two bulkloaded hfiles in a region with same seqID and duplicate 
keys*, get and scan may return different values for a key.
More details:
- one of the rows had 200k+ columns. say row is 'r', column family is 'cf' and 
column qualifiers are 1 to 1000.
- hfiles were split somewhere along that row, but there were a range of columns 
in both hfiles. For eg, something like - hfile1: ["",  r:cf:70) and hfile2: 
[r:cf:40, ....).
- Between columns 40 to 70, some (not all) columns were in both the files with 
different values. Whereas other were only in one of the files.

In such a case, depending on file size (because we take it into account when 
sorting hfiles internally), we may get different values for the same cell (say 
"r", "cf:50") depending on what we call: get "r" "cf:50" or get "r" "cf:".

I have been able to replicate this issue, will post the instructions shortly.

---
\* not sure how this would happen. These files are small ~50M, nor could i find 
any setting for max file size that could lead to splits. Need to investigate 
more.

  was:
If there are two bulkloaded hfiles in a region with same seqID and duplicate 
keys*, get and scan may return different values for a key.
More details:
- one of the rows had 200k+ columns. say row is 'r', column family is 'cf' and 
column qualifiers are 1 to 1000.
- hfiles were split somewhere along that row, but there were a range of columns 
in both hfiles. For eg, something like - hfile1: ["",  r:cf:70) and hfile2: 
[r:cf:40, ....).
- Between columns 40 to 70, some (not all) columns were in both the files with 
different values. Whereas other were only in one of the files.

In such a case, depending on file size (because we take it into account when 
sorting hfiles internally), we may get different values for the same cell (say 
"r", "cf:50") depending on what we call: get "r" "cf:50" or get "r" "cf:".

I have been able to replicate this issue, will post the instructions shortly.

* not sure how this would happen. These files are small ~50M, nor could i find 
any setting for max file size that could lead to splits. Need to investigate 
more.


> Inconsistent cell reads over multiple bulk-loaded HFiles
> --------------------------------------------------------
>
>                 Key: HBASE-15236
>                 URL: https://issues.apache.org/jira/browse/HBASE-15236
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Appy
>            Assignee: Appy
>
> If there are two bulkloaded hfiles in a region with same seqID and duplicate 
> keys*, get and scan may return different values for a key.
> More details:
> - one of the rows had 200k+ columns. say row is 'r', column family is 'cf' 
> and column qualifiers are 1 to 1000.
> - hfiles were split somewhere along that row, but there were a range of 
> columns in both hfiles. For eg, something like - hfile1: ["",  r:cf:70) and 
> hfile2: [r:cf:40, ....).
> - Between columns 40 to 70, some (not all) columns were in both the files 
> with different values. Whereas other were only in one of the files.
> In such a case, depending on file size (because we take it into account when 
> sorting hfiles internally), we may get different values for the same cell 
> (say "r", "cf:50") depending on what we call: get "r" "cf:50" or get "r" 
> "cf:".
> I have been able to replicate this issue, will post the instructions shortly.
> ---
> \* not sure how this would happen. These files are small ~50M, nor could i 
> find any setting for max file size that could lead to splits. Need to 
> investigate more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15236) Inconsistent cell reads over multiple bulk-loaded HFiles

Reply via email to