[ 
https://issues.apache.org/jira/browse/HBASE-23160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

junfei liang updated HBASE-23160:
---------------------------------
    Description: 
in our online cluster,  we found  daughter region's reference file can point to 
a nonexistent hfile.  so when is region is balanced,  the region open will be 
failed as FileNotFoundException, and  a lot of errors thrown.

 

how the problem happen

1.    Region R1 is on server S1, and it's has a compaction, say  storefile  sf1 
is compacted  into another file  at time t1.  

2.     S1 has a long full gc (in our case about 470s)  at t2  (t1 + 300s)

3.    R1 is offline from S1  after t2 + 180s,   rs zk session expired , so 
master thought the RS is dead and reassign the R1 to S2.

4.   S2 found R1 is too large so it make a split request, and R1 split into   
R2 + R3, both hold a reference to sf1.

5 .   the S1 finish the fullgc at  t2 + 470s , and before it report to master,  
 CompactedHFilesDischarger remove the compacted file sf1 from R1 (R1 is still 
online on Server S1 )

6.   so R2、R3 hold a reference  to not exists storefile,and lead to the error 
we came across。

 

possible solutions:

 

1.   write WAL Marker before remove hfile from store

      as in SSH, the dead  rs log dir is deleted, so write wal marker will be 
failed. 

      but is not absolutely reliable, because  rs can fullgc  after write the 
marker.  there is not way we do these two action  **  atomically.

       it's not 100% reliable , but it's simple...      

2.   a possible  reliable  solution

      when remove hfile from store dir,  first  move it to a RS-Level  special  
DIR,   and then move to archived dir. 

      and we delete the DIR in the SSH,so the  remove compacted files will be 
failed in the first step, it's reliable  but complicated. 

      

 

 

 

  was:
in our online cluster,  we found  daughter region's reference file can point to 
a nonexistent hfile.  so when is region is balanced,  the region open will be 
failed as FileNotFoundException, and  a lot of errors thrown.

 

how the problem happen

1.    Region R1 is on server S1, and it's has a compaction, say  storefile  sf1 
is compacted  into another file  at time t1.  

2.     S1 has a long full gc (in our case about 470s)  at t2  (t1 + 300s)

3.    R1 is offline from S1  after t2 + 180s,   rs zk session expired , so 
master thought the RS is dead and reassign the R1 to S2.

4.   S2 found R1 is too large so it make a split request, and R1 split into   
R2 + R3, both hold a reference to sf1.

5 .   the S1 finish the fullgc  t2 + 470s , and before it report to master,   
CompactedHFilesDischarger remove compacted file sf1 from R1 (R1 is still online 
on Server S1 )

6.   so R2、R3 hold a reference  to not exists storefile,and lead to the error 
we came across。

 

possible solutions:

 

1.   write WAL Marker before remove hfile from store

      as in SSH, the dead  rs log dir is deleted, so write wal marker will be 
failed. 

      but is not absolutely reliable, because  rs can fullgc  after write the 
marker.  there is not way we do these two action  **  atomically.

       it's not 100% reliable , but it's simple...      

2.   a possible  reliable  solution

      when remove hfile from store dir,  first  move it to a RS-Level  special  
DIR,   and then move to archived dir. 

      and we delete the DIR in the SSH,so the  remove compacted files will be 
failed in the first step, it's reliable  but complicated. 

      

 

 

 


> The Dead RS May Remove compacted  files after recover from full gc
> ------------------------------------------------------------------
>
>                 Key: HBASE-23160
>                 URL: https://issues.apache.org/jira/browse/HBASE-23160
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>            Reporter: junfei liang
>            Priority: Major
>
> in our online cluster,  we found  daughter region's reference file can point 
> to a nonexistent hfile.  so when is region is balanced,  the region open will 
> be failed as FileNotFoundException, and  a lot of errors thrown.
>  
> how the problem happen
> 1.    Region R1 is on server S1, and it's has a compaction, say  storefile  
> sf1 is compacted  into another file  at time t1.  
> 2.     S1 has a long full gc (in our case about 470s)  at t2  (t1 + 300s)
> 3.    R1 is offline from S1  after t2 + 180s,   rs zk session expired , so 
> master thought the RS is dead and reassign the R1 to S2.
> 4.   S2 found R1 is too large so it make a split request, and R1 split into   
> R2 + R3, both hold a reference to sf1.
> 5 .   the S1 finish the fullgc at  t2 + 470s , and before it report to 
> master,   CompactedHFilesDischarger remove the compacted file sf1 from R1 (R1 
> is still online on Server S1 )
> 6.   so R2、R3 hold a reference  to not exists storefile,and lead to the error 
> we came across。
>  
> possible solutions:
>  
> 1.   write WAL Marker before remove hfile from store
>       as in SSH, the dead  rs log dir is deleted, so write wal marker will be 
> failed. 
>       but is not absolutely reliable, because  rs can fullgc  after write the 
> marker.  there is not way we do these two action  **  atomically.
>        it's not 100% reliable , but it's simple...      
> 2.   a possible  reliable  solution
>       when remove hfile from store dir,  first  move it to a RS-Level  
> special  DIR,   and then move to archived dir. 
>       and we delete the DIR in the SSH,so the  remove compacted files will be 
> failed in the first step, it's reliable  but complicated. 
>       
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to