[
https://issues.apache.org/jira/browse/HBASE-23160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell resolved HBASE-23160.
-----------------------------------------
Resolution: Invalid
> The Dead RS May Remove compacted files after recover from full gc
> ------------------------------------------------------------------
>
> Key: HBASE-23160
> URL: https://issues.apache.org/jira/browse/HBASE-23160
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Reporter: junfei liang
> Priority: Major
>
> in our online cluster, we found daughter region's reference file can point
> to a nonexistent hfile. so when is region is balanced, the region open will
> be failed as FileNotFoundException, and a lot of errors thrown.
>
> how the problem happen
> 1. Region R1 is on server S1, and it's has a compaction, say storefile
> sf1 is compacted into another file at time t1.
> 2. S1 has a long full gc (in our case about 470s) at t2 (t1 + 300s)
> 3. R1 is offline from S1 after t2 + 180s, rs zk session expired , so
> master thought the RS is dead and reassign the R1 to S2.
> 4. S2 found R1 is too large so it make a split request, and R1 split into
> R2 + R3, both hold a reference to sf1.
> 5 . the S1 finish the fullgc at t2 + 470s , and before it report to
> master, CompactedHFilesDischarger remove the compacted file sf1 from R1 (R1
> is still online on Server S1 )
> 6. so R2、R3 hold a reference to not exists storefile,and lead to the error
> we came across。
>
> possible solutions:
>
> 1. write WAL Marker before remove hfile from store
> as in SSH, the dead rs log dir is deleted, so write wal marker will be
> failed.
> but is not absolutely reliable, because rs can fullgc after write the
> marker. there is not way we do these two action ** atomically.
> it's not 100% reliable , but it's simple...
> 2. a possible reliable solution
> when remove hfile from store dir, first move it to a RS-Level
> special DIR, and then move to archived dir.
> and we delete the DIR in the SSH,so the remove compacted files will be
> failed in the first step, it's reliable but complicated.
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)