[ 
https://issues.apache.org/jira/browse/HBASE-23160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell resolved HBASE-23160.
-----------------------------------------
    Resolution: Invalid

> The Dead RS May Remove compacted  files after recover from full gc
> ------------------------------------------------------------------
>
>                 Key: HBASE-23160
>                 URL: https://issues.apache.org/jira/browse/HBASE-23160
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>            Reporter: junfei liang
>            Priority: Major
>
> in our online cluster,  we found  daughter region's reference file can point 
> to a nonexistent hfile.  so when is region is balanced,  the region open will 
> be failed as FileNotFoundException, and  a lot of errors thrown.
>  
> how the problem happen
> 1.    Region R1 is on server S1, and it's has a compaction, say  storefile  
> sf1 is compacted  into another file  at time t1.  
> 2.     S1 has a long full gc (in our case about 470s)  at t2  (t1 + 300s)
> 3.    R1 is offline from S1  after t2 + 180s,   rs zk session expired , so 
> master thought the RS is dead and reassign the R1 to S2.
> 4.   S2 found R1 is too large so it make a split request, and R1 split into   
> R2 + R3, both hold a reference to sf1.
> 5 .   the S1 finish the fullgc at  t2 + 470s , and before it report to 
> master,   CompactedHFilesDischarger remove the compacted file sf1 from R1 (R1 
> is still online on Server S1 )
> 6.   so R2、R3 hold a reference  to not exists storefile,and lead to the error 
> we came across。
>  
> possible solutions:
>  
> 1.   write WAL Marker before remove hfile from store
>       as in SSH, the dead  rs log dir is deleted, so write wal marker will be 
> failed. 
>       but is not absolutely reliable, because  rs can fullgc  after write the 
> marker.  there is not way we do these two action  **  atomically.
>        it's not 100% reliable , but it's simple...      
> 2.   a possible  reliable  solution
>       when remove hfile from store dir,  first  move it to a RS-Level  
> special  DIR,   and then move to archived dir. 
>       and we delete the DIR in the SSH,so the  remove compacted files will be 
> failed in the first step, it's reliable  but complicated. 
>       
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to