[ 
https://issues.apache.org/jira/browse/HBASE-29716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kodey Converse updated HBASE-29716:
-----------------------------------
    Description: 
When an incremental backup is taken, WAL files are re-written as HFiles using 
the WAL player. These HFiles are formatted only for bulkloads (which is their 
primary purpose), and the sequence IDs for cells (which are required for 
correctness) are ignored by the RegionScanner when used with the 
ClientSideRegionScanner

This is a follow up to HBASE-27649; that fix plumbed sequence IDs from the WAL 
to the HFiles generated by WALPlayer. However, the HFiles generated by 
WALPlayer are marked to be bulk loaded [by metadata on the 
HFile|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L461],
 and RegionScanner [will reset cell-level sequence 
IDs|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStoreFile.java#L427-L450]
 for HFiles with this metadata, instead relying on the sequence ID generated at 
time of bulkload (i.e. during a backup restore). If used before this via the 
ClientSideRegionScanner, it can return incorrect results.

The result is that cell versions that have been overwritten (and therefore rely 
on sequence IDs for correctness) will return an incorrect value when read by 
tooling such as the ClientSideRegionScanner. Instead, I believe the cell value 
that is returned will be decided based on [sorting the HFiles by their 
size|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileComparators.java#L36-L39].

  was:
When an incremental backup is taken, WAL files are re-written as HFiles using 
the WAL player. These HFiles are not formatted properly, and the sequence IDs 
for cells (which are required for correctness) are ignored by the RegionScanner.

This is a follow up to HBASE-27649; that fix plumbed sequence IDs from the WAL 
to the HFiles generated by WALPlayer. However, the HFiles generated by 
WALPlayer are marked to be bulk loaded [by metadata on the 
HFile|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L461],
 and RegionScanner [will reset cell-level sequence 
IDs|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStoreFile.java#L427-L450]
 for HFiles with this metadata, instead relying on the sequence ID generated at 
time of bulkload (which won't ever happen for these HFiles intended for 
incremental backups).

The result is that cell versions that have been overwritten (and therefore rely 
on sequence IDs for correctness) will return an incorrect value when read by 
HBase or by tooling such as the ClientSideRegionScanner. Instead, I believe the 
cell value that is returned will be decided based on [sorting the HFiles by 
their 
size|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileComparators.java#L36-L39].

        Summary: Incremental backup HFiles do not contain a sequence ID  (was: 
Incremental backup does not properly preserve sequence IDs)

> Incremental backup HFiles do not contain a sequence ID
> ------------------------------------------------------
>
>                 Key: HBASE-29716
>                 URL: https://issues.apache.org/jira/browse/HBASE-29716
>             Project: HBase
>          Issue Type: Bug
>          Components: backup&restore
>    Affects Versions: 3.0.0, 2.5.13, 2.6.5
>            Reporter: Kodey Converse
>            Priority: Minor
>              Labels: pull-request-available
>
> When an incremental backup is taken, WAL files are re-written as HFiles using 
> the WAL player. These HFiles are formatted only for bulkloads (which is their 
> primary purpose), and the sequence IDs for cells (which are required for 
> correctness) are ignored by the RegionScanner when used with the 
> ClientSideRegionScanner
> This is a follow up to HBASE-27649; that fix plumbed sequence IDs from the 
> WAL to the HFiles generated by WALPlayer. However, the HFiles generated by 
> WALPlayer are marked to be bulk loaded [by metadata on the 
> HFile|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L461],
>  and RegionScanner [will reset cell-level sequence 
> IDs|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStoreFile.java#L427-L450]
>  for HFiles with this metadata, instead relying on the sequence ID generated 
> at time of bulkload (i.e. during a backup restore). If used before this via 
> the ClientSideRegionScanner, it can return incorrect results.
> The result is that cell versions that have been overwritten (and therefore 
> rely on sequence IDs for correctness) will return an incorrect value when 
> read by tooling such as the ClientSideRegionScanner. Instead, I believe the 
> cell value that is returned will be decided based on [sorting the HFiles by 
> their 
> size|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileComparators.java#L36-L39].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to