[
https://issues.apache.org/jira/browse/HBASE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678538#comment-13678538
]
Sergey Shelukhin commented on HBASE-8701:
-----------------------------------------
We had also had a discussion here, about both these and other schemes. We are
in consensus with the above wrt "streaming recovered edits", so let me provide
a short brain dump for the other approach :)
On worker servers, we could rewrite HLogs into HFiles (in the simplest case,
each worker writes one separate HFile per region, but bear with me... solution
for too many HFile-s is described below). There's little new code involved,
just use WAL reader, memstore, and file writer. We might want to make slight
changes to memstore for this - given that it will be write-only and
single-threaded, for example, ConcurrentSkipList is not the best ds; etc. That
should be trivial, as long as the new ds supports the same interface...
Each worker separately flushes all these memstores, commits the new HFile-s
into target regions (seqnums don't require any special handling because they
come from one, or several contiguous (see below), HLogs). Then it can nuke the
processed HLogs; i.e. the recovery is incremental, which is good.
Each worker processes one HLog file in simple case; if there are more HLogs
than servers doing the recovery, assign several contiguous HLog-s per server
(e.g. 30 HLogs, replay on 9 servers => 3-4 per server).
After all workers complete region is ready to accept reads and writes before
any merge; the merge of the HFile-s will be done via compaction of all these
small files. We can also open region for writing during recovery with very
little special handling, as soon as all HFiles are added, store will just have
to load them. We may have to disallow compactions, but that is true for every
scheme, btw we should handle it in this JIRA anyway.
Moreover, as we rewrite the HLog as HFile(s), we can put one replica of new
HFile onto the target server, so that "replay to other server" RPC cost is
baked into normal HDFS replication cost.
Now, the biggest problem with this scheme is how many HFile-s it can generate -
up to # regions * min(# servers, # of WALs). To solve that, it is possible to
create indexed MultiStoreFile, expanding the existing HalfStoreFile, and share
it between multiple regions. That way, the /total/ number of HFile-s produced
is bounded by min(# servers, # of WALs). This is slightly more involved, but
again, region can be immediately opened for normal operation and files split
via compaction, just like for region splits.
Locality is lost for some regions if we do that; if the worker who does the
rewriting HLog to HFile knows some metadata of the WAL, its size etc., however,
it can make a decision that trades off locality vs number of files (e.g.
produce one MultiStoreFile per multiple regions on 3 servers and place replicas
there, or whatever).
I like this scheme in particular because it requires very little special new
logic for correctness, or new code/new file formats/etc. Enis only likes this
scheme with one MultiStoreFile per WAL as far as I understand (correct me if
I'm wrong ;))
> distributedLogReplay need to apply wal edits in the receiving order of those
> edits
> ----------------------------------------------------------------------------------
>
> Key: HBASE-8701
> URL: https://issues.apache.org/jira/browse/HBASE-8701
> Project: HBase
> Issue Type: Bug
> Components: MTTR
> Reporter: Jeffrey Zhong
> Assignee: Jeffrey Zhong
> Fix For: 0.98.0, 0.95.2
>
>
> This issue happens in distributedLogReplay mode when recovering multiple puts
> of the same key + version(timestamp). After replay, the value is
> nondeterministic of the key
> h5. The original concern situation raised from [~eclark]:
> For all edits the rowkey is the same.
> There's a log with: [ A (ts = 0), B (ts = 0) ]
> Replay the first half of the log.
> A user puts in C (ts = 0)
> Memstore has to flush
> A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
> Replay the rest of the Log.
> Flush
> The issue will happen in similar situation like Put(key, t=T) in WAL1 and
> Put(key,t=T) in WAL2
> h5. Below is the option I'd like to use:
> a) During replay, we pass wal file name hash in each replay batch and
> original wal sequence id of each edit to the receiving RS
> b) Once a wal is recovered, playing RS send a signal to the receiving RS so
> the receiving RS can flush
> c) In receiving RS, different WAL file of a region sends edits to different
> memstores.(We can visualize this in high level as sending changes to a new
> region object with name(origin region name + wal name hash) and use the
> original sequence Ids.)
> d) writes from normal traffic(allow writes during recovery) are put in normal
> memstores as of today and flush normally with new sequenceIds.
> h5. The other alternative options are listed below for references:
> Option one
> a) disallow writes during recovery
> b) during replay, we pass original wal sequence ids
> c) hold flush till all wals of a recovering region are replayed. Memstore
> should hold because we only recover unflushed wal edits. For edits with same
> key + version, whichever with larger sequence Id wins.
> Option two
> a) During replay, we pass original wal sequence ids
> b) for each wal edit, we store each edit's original sequence id along with
> its key.
> c) during scanning, we use the original sequence id if it's present otherwise
> its store file sequence Id
> d) compaction can just leave put with max sequence id
> Please let me know if you have better ideas.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira