[
https://issues.apache.org/jira/browse/HBASE-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15043657#comment-15043657
]
stack commented on HBASE-14004:
-------------------------------
bq. ReplicationSource should only read WAL that is hsynced to prevent slave
cluster having data that master losses.
This will require big change in how replication works but for the better and
replication will be less resource intense because less NN ops (if crash, we ask
NN for file length, not ZK? If so, this would be a task we have been needing to
do for a long time; i.e. undo keeping replication position in zk).
bq. WAL reader can handle duplicate entries, in other words, make WAL logging
idempotent.
Might have to add some code to reader to skip an entry it has seen before (this
may be there already -- need to check).
bq. Fixing HBase writing path that we should retry logging WAL in a new file
rather than rollback MemStore.
This is new but has been done before.
I'd be up for helping w/ WAL changes, stuff like keeping around appends until
the sync for them comes in (I've messed w/ this before), and would be
interested in helping out on replication log length accounting changing it from
relying on reopen after it gets EOF and keeping length in zk.
You fellas are fixing a few fundamental issues here. Sweet.
bq. we will still rollback MemStore since we can confirm that the WAL entries
have not been written out. Right?
We could try rejiggering the order in which memstore gets updated, putting it
off till after the sync. The order we have now came about long time ago when
WAL was very different. We might be able to change the order, simplify the
write pipeline, and not lose too much perf (or, perhaps, get more perf because
we are doing healthier group commits).
bq. Maybe we could get current total write out bytes first(not acked length)
and then call hsync, the acked length after calling hsync must be larger than
this value so it is safe to use this value as "acked length".
It would be good if hbase could calculate the written length itself. We could
try it. What happens if we want to compress WAL or what about crc tax.... (I
suppose this latter would be a constant -- and for the former, maybe we could
figure then length... even on compress if per edit or per batch....)
bq. I don't know if there is a sequence increment unique id for each wal log.
There is such a sequenceid but it is by-region, not global. Could keep
sequence id by region accounts? (We already do this elsewhere).
> [Replication] Inconsistency between Memstore and WAL may result in data in
> remote cluster that is not in the origin
> -------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-14004
> URL: https://issues.apache.org/jira/browse/HBASE-14004
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Reporter: He Liangliang
> Priority: Critical
> Labels: replication, wal
>
> Looks like the current write path can cause inconsistency between
> memstore/hfile and WAL which cause the slave cluster has more data than the
> master cluster.
> The simplified write path looks like:
> 1. insert record into Memstore
> 2. write record to WAL
> 3. sync WAL
> 4. rollback Memstore if 3 fails
> It's possible that the HDFS sync RPC call fails, but the data is already
> (may partially) transported to the DNs which finally get persisted. As a
> result, the handler will rollback the Memstore and the later flushed HFile
> will also skip this record.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)