[jira] [Deleted] (HDFS-10901) QJM should not consider stale/failed txn available in any one of JNs.

Vinayakumar B (JIRA) Mon, 26 Sep 2016 01:31:09 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vinayakumar B deleted HDFS-10901:
---------------------------------


> QJM should not consider stale/failed txn available in any one of JNs.
> ---------------------------------------------------------------------
>
>                 Key: HDFS-10901
>                 URL: https://issues.apache.org/jira/browse/HDFS-10901
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Vinayakumar B
>            Assignee: Vinayakumar B
>            Priority: Critical
>
> In one of our cluster faced an issue, where NameNode restart failed due to a 
> stale/failed txn available in one JN but not others. 
> Scenario is:
> 1. Full cluster restart
> 2. startLogSegment Txn(195222) synced in Only one JN but failed to others, 
> because they were shutting down. Only editlog file was created but txn was 
> not synced in others, so after restart they were marked as empty.
> 3. Cluster restarted. During failover, this new logSegment missed the 
> recovery because this JN was slow in responding to this call.
> 4. Other JNs recover was successfull, as there was no in-progress files.
> 5. editlog.openForWrite() detected that (195222) was already available, and 
> failed the failover.
> Same steps repeated until that stale editlog in JN was manually deleted.
> Since QJM is a quorum of JNs, txn is considered successfull, if its written 
> min quorum. Otherwise it will be failed.
> So, same case should be applied while selecting streams for reading also.
> Stale/failed txns available in only less JNs should not be considered for 
> reading.
> HDFS-10519, does similar work to consider 'durable' txns based on 
> 'committedTxnId'. But updating 'committedTxnId' for every flush with one more 
> RPC seems tobe problematic to performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Deleted] (HDFS-10901) QJM should not consider stale/failed txn available in any one of JNs.

Reply via email to