[jira] [Commented] (HBASE-22917) Proc-WAL roll fails always saying someone else has already created log

Duo Zhang (Jira) Thu, 31 Oct 2019 06:30:24 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-22917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964006#comment-16964006
 ]


Duo Zhang commented on HBASE-22917:
-----------------------------------

I do not think it is fine to merge this back.

This is the feedback on the PR

{quote}
IIRC two master can't write proc-WAL together, do we have such corner scenario?
However here we are trying to cleanup when rollWriter fails when write header 
throws IOE.
{quote}

Actually it is our magic on the file id which prevents two masters write the 
proc-WAL together, so we should not use this as a assumption to implement our 
file id logic, totally wrong.

Now we just increase the file id by one if we failed to delete the old file, 
but this is an rpc call right? It could happen that on the NN side, the file 
has been deleted successfully but at client side we get an error, and then we 
increase the file id by 1, and then there will be a whole, what if another 
master tries to write new file id but just fill in the whole? Then we have two 
'live' masters which could both write proc wal(at least there be a small 
overlap due to the aysnc behavior on zk session expire processing). This will 
lead to inconsistency and mess up everything.

So my suggestion is that, unless we have a clear explaination that the above 
scenario can not happen, then the safest way is to just abort the HMaster if we 
fail to roll the writer. And maybe it is safe to just increase the file id 
without deleting the broken proc wal file(this is a typical solution in WAL 
based system), but anyway, usually deleting a wal file is not a good idea...

Thanks.

> Proc-WAL roll fails always saying someone else has already created log
> ----------------------------------------------------------------------
>
>                 Key: HBASE-22917
>                 URL: https://issues.apache.org/jira/browse/HBASE-22917
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2, wal
>            Reporter: Pankaj Kumar
>            Assignee: Pankaj Kumar
>            Priority: Critical
>             Fix For: 3.0.0, 2.3.0, 2.2.3
>
>
> Recently we met a weird scenario where Procedure WAL roll fails as it is 
> already created by someone else.
> Later while going through the logs and code, observed that during Proc-WAL 
> roll it failed to write the header. On failure file stream is just closed,
> {code}
>  try {
>  ProcedureWALFormat.writeHeader(newStream, header);
>  startPos = newStream.getPos();
>  } catch (IOException ioe) {
>  LOG.warn("Encountered exception writing header", ioe);
>  newStream.close();
>  return false;
>  }
> {code}
> Since we don't delete the corrupted file or increment the *flushLogId*, so on 
> each retry it is trying to create the same *flushLogId* file. However Hmaster 
> failover will resolve this issue, but we should handle it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-22917) Proc-WAL roll fails always saying someone else has already created log

Reply via email to