[
https://issues.apache.org/jira/browse/HBASE-22917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964006#comment-16964006
]
Duo Zhang commented on HBASE-22917:
-----------------------------------
I do not think it is fine to merge this back.
This is the feedback on the PR
{quote}
IIRC two master can't write proc-WAL together, do we have such corner scenario?
However here we are trying to cleanup when rollWriter fails when write header
throws IOE.
{quote}
Actually it is our magic on the file id which prevents two masters write the
proc-WAL together, so we should not use this as a assumption to implement our
file id logic, totally wrong.
Now we just increase the file id by one if we failed to delete the old file,
but this is an rpc call right? It could happen that on the NN side, the file
has been deleted successfully but at client side we get an error, and then we
increase the file id by 1, and then there will be a whole, what if another
master tries to write new file id but just fill in the whole? Then we have two
'live' masters which could both write proc wal(at least there be a small
overlap due to the aysnc behavior on zk session expire processing). This will
lead to inconsistency and mess up everything.
So my suggestion is that, unless we have a clear explaination that the above
scenario can not happen, then the safest way is to just abort the HMaster if we
fail to roll the writer. And maybe it is safe to just increase the file id
without deleting the broken proc wal file(this is a typical solution in WAL
based system), but anyway, usually deleting a wal file is not a good idea...
Thanks.
> Proc-WAL roll fails always saying someone else has already created log
> ----------------------------------------------------------------------
>
> Key: HBASE-22917
> URL: https://issues.apache.org/jira/browse/HBASE-22917
> Project: HBase
> Issue Type: Bug
> Components: proc-v2, wal
> Reporter: Pankaj Kumar
> Assignee: Pankaj Kumar
> Priority: Critical
> Fix For: 3.0.0, 2.3.0, 2.2.3
>
>
> Recently we met a weird scenario where Procedure WAL roll fails as it is
> already created by someone else.
> Later while going through the logs and code, observed that during Proc-WAL
> roll it failed to write the header. On failure file stream is just closed,
> {code}
> try {
> ProcedureWALFormat.writeHeader(newStream, header);
> startPos = newStream.getPos();
> } catch (IOException ioe) {
> LOG.warn("Encountered exception writing header", ioe);
> newStream.close();
> return false;
> }
> {code}
> Since we don't delete the corrupted file or increment the *flushLogId*, so on
> each retry it is trying to create the same *flushLogId* file. However Hmaster
> failover will resolve this issue, but we should handle it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)