[ 
https://issues.apache.org/jira/browse/HBASE-27231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763825#comment-17763825
 ] 

Andrew Kyle Purtell edited comment on HBASE-27231 at 9/11/23 5:03 PM:
----------------------------------------------------------------------

I think we can. I cherry picked the master commit for this JIRA from master 
branch to our internal fork of 2.5.5 with a minor conflict (resolving the 
conflict was clean, bit it remains to be seen with additional confidence like 
cluster chaos tests if the change is ok, as the code is critical) and only one 
WAL unit test is now not passing, and I think it is because the test itself is 
no longer valid. I will report back when the internal change is all green. 


was (Author: apurtell):
I think we can. I cherry picked the master commit for this JIRA from master 
branch to our internal fork of 2.5.5 with a minor conflict (this remains to be 
seen) and only one WAL unit test is now not passing, and I think it is because 
the test itself is no longer valid. I will report back when the internal change 
is all green. 

> FSHLog should retry writing WAL entries when syncs to HDFS failed.
> ------------------------------------------------------------------
>
>                 Key: HBASE-27231
>                 URL: https://issues.apache.org/jira/browse/HBASE-27231
>             Project: HBase
>          Issue Type: Improvement
>          Components: wal
>    Affects Versions: 3.0.0-alpha-4
>            Reporter: chenglei
>            Assignee: chenglei
>            Priority: Major
>             Fix For: 3.0.0-beta-1
>
>
> Just as HBASE-27223 said, basically, if the {{WAL}} write to HDFS fails, we 
> do not know whether the data has been persistent or not. The implementation 
> for {{AsyncFSWAL}}, is to open a new writer and try to write the WAL entries 
> again, and then adding logic in WAL split and replay to deal with duplicate 
> entries. But for {{FSHLog}}, it does not have the same logic with 
> {{AsyncFSWAL}}, when {{ProtobufLogWriter.append}} and 
> {{ProtobufLogWriter.sync}} failed, {{FSHLog.sync}} immediately throws the 
> exception thrown by {{ProtobufLogWriter.append}} and 
> {{ProtobufLogWriter.sync}} , we should implement the same retry logic as 
> {{AsyncFSWAL}}, so {{WAL.sync}} could only throw  {{TimeoutIOException}} and 
> we could uniformly abort the RegionServer when  {{WAL.sync}} failed.
> The basic idea is because both {{FSHLog.RingBufferEventHandler}} and 
> {{AsyncFSWAL.consumeExecutor}} are single-thread,  we could reuse the logic 
> in {{AsyncWAL}} and move the most code in {{AsyncWAL}} upward to 
> {{AbstractFSWAL}} , and just adapting the {{SyncRunner}} in {{FSHLog}} to the 
> logic in {{AsyncWriter.sync}}. Once we do that, most logic in {{AsyncWAL}} 
> and {{FSHLog}} are unified, just how to sync the {{writer}} is different.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to