[ 
https://issues.apache.org/jira/browse/HBASE-12125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236993#comment-16236993
 ] 

Ted Yu commented on HBASE-12125:
--------------------------------

>From the QA run:
{code}
[ERROR] 
testFixMissingReplicationWAL(org.apache.hadoop.hbase.util.TestHBaseFsckReplication)
  Time elapsed: 54.85 s  <<< ERROR!
java.lang.NullPointerException
        at 
org.apache.hadoop.hbase.util.TestHBaseFsckReplication.testFixMissingReplicationWAL(TestHBaseFsckReplication.java:184)
{code}
which was almost identical to the error I reported yesterday.

> Add Hbck option to check and fix WAL's from replication queue
> -------------------------------------------------------------
>
>                 Key: HBASE-12125
>                 URL: https://issues.apache.org/jira/browse/HBASE-12125
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 3.0.0
>            Reporter: Virag Kothari
>            Assignee: Vincent Poon
>            Priority: Major
>         Attachments: HBASE-12125.v1.master.patch, 
> HBASE-12125.v2.master.patch, HBASE-12125.v3.master.patch
>
>
> The replication source will discard the WAL file in many cases when it 
> encounters an exception reading it . This can cause data loss
> and the underlying reason of failed read remains hidden.  Only in certain 
> scenarios, the replication source should dump the current WAL and move to the 
> next one. 
> This JIRA aims to have an hbck option to check the WAL files of replication 
> queues for any inconsistencies and also provide an option to fix it.
> The fix can be to remove the file from replication queue in zk and from the 
> memory of replication source manager and replication sources. 
> A region server endpoint call from the hbck client to region server can be 
> used to achieve this.
> Hbck can be configured with the following options:
> -softCheckReplicationWAL : Tries to open only the oldest WAL (the WAL 
> currently read by replication source) from replication queue. If there is a 
> position associated, it also seeks to that position and reads an entry from 
> there
> -hardCheckReplicationWAL:  Check all WAL paths from replication queues by 
> reading them completely to make sure they are ok.
> -fixMissingReplicationWAL: Remove the WAL's from replication queues which are 
> not present on hdfs
> -fixCorruptedReplicationWAL:  Remove the WAL's from replication queues which 
> are corrupted (based on the findings from softCheck/hardCheck). Also the 
> WAL's are moved to a quarantine dir
> -rollAndFixCorruptedReplicationWAL - If the current WAL is corrupted, it is 
> first rolled over and then deals with it in the same way as 
> -fixCorruptedReplicationWAL option



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to