[jira] [Commented] (HBASE-14004) [Replication] Inconsistency between Memstore and WAL may result in data in remote cluster that is not in the origin

stack (JIRA) Sat, 05 Dec 2015 18:26:50 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15043657#comment-15043657
 ]


stack commented on HBASE-14004:
-------------------------------

bq. ReplicationSource should only read WAL that is hsynced to prevent slave 
cluster having data that master losses.

This will require big change in how replication works but for the better and 
replication will be less resource intense because less NN ops (if crash, we ask 
NN for file length, not ZK? If so, this would be a task we have been needing to 
do for a long time; i.e. undo keeping replication position in zk).

bq.  WAL reader can handle duplicate entries, in other words, make WAL logging 
idempotent. 

Might have to add some code to reader to skip an entry it has seen before (this 
may be there already -- need to check).

bq. Fixing HBase writing path that we should retry logging WAL in a new file 
rather than rollback MemStore.

This is new but has been done before.

I'd be up for helping w/ WAL changes, stuff like keeping around appends until 
the sync for them comes in (I've messed w/ this before), and would be 
interested in helping out on replication log length accounting changing it from 
relying on reopen after it gets EOF and keeping length in zk.

You fellas are fixing a few fundamental issues here. Sweet.

bq. we will still rollback MemStore since we can confirm that the WAL entries 
have not been written out. Right?

We could try rejiggering the order in which memstore gets updated, putting it 
off till after the sync. The order we have now came about long time ago when 
WAL was very different. We might be able to change the order, simplify the 
write pipeline, and not lose too much perf (or, perhaps, get more perf because 
we are doing healthier group commits).

bq.  Maybe we could get current total write out bytes first(not acked length) 
and then call hsync, the acked length after calling hsync must be larger than 
this value so it is safe to use this value as "acked length". 

It would be good if hbase could calculate the written length itself. We could 
try it. What happens if we want to compress WAL or what about crc tax.... (I 
suppose this latter would be a constant -- and for the former, maybe we could 
figure then length... even on compress if per edit or per batch....)

bq.  I don't know if there is a sequence increment unique id for each wal log. 

There is such a sequenceid but it is by-region, not global.  Could keep 
sequence id by region accounts? (We already do this elsewhere).



> [Replication] Inconsistency between Memstore and WAL may result in data in 
> remote cluster that is not in the origin
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-14004
>                 URL: https://issues.apache.org/jira/browse/HBASE-14004
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>            Reporter: He Liangliang
>            Priority: Critical
>              Labels: replication, wal
>
> Looks like the current write path can cause inconsistency between 
> memstore/hfile and WAL which cause the slave cluster has more data than the 
> master cluster.
> The simplified write path looks like:
> 1. insert record into Memstore
> 2. write record to WAL
> 3. sync WAL
> 4. rollback Memstore if 3 fails
> It's possible that the HDFS sync RPC call fails, but the data is already  
> (may partially) transported to the DNs which finally get persisted. As a 
> result, the handler will rollback the Memstore and the later flushed HFile 
> will also skip this record.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14004) [Replication] Inconsistency between Memstore and WAL may result in data in remote cluster that is not in the origin

Reply via email to