[
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677582#comment-13677582
]
stack commented on HBASE-7006:
------------------------------
bq. We have to use skip wal option here.
I was hoping to avoid our doing skip-wal for reasons argued above, that
replaying edits w/ skip WAL enabled introduces more states and will complicate
replay but old edits coming into the new server getting new seqids will itself
make for some new interesting states (If the server we are playing into crashes
before all is flushed, it will have in its WALs edits where the sequenceid for
'B', is > that for 'C', so on its recovery, 'B', will come out when we want
'C', the last edit inserted at a particular coordinate).
So, if no WAL, what happens when we need to flush a memstore or a background
replay memstore (the one-memstore-per-region we discuss above)? What seqid
will we write out into the hfile if we have to flush memory? I suppose if this
replay backing memstore had the old WAL seqid, it would be legit to use these.
The flushed file would sort properly with an old seqid (but then this would be
a different kind of flush, one where you dictate the seqid file rather than
take what is current in the server -- that will be intrusive to change).
We'd have to use the old ids in case we had to flush midway through a WAL (I
suppose we say this already above)
But thinking more on the per-WAL replay memstore, there are kinks to figure
(apart from the one above where we want to have a flush w/ a seqid that is not
the servers current max seqid). As hfiles contain sorted kvs but the edits in
the old WAL not in sort order, if we sort the edits so we can flush the hfile,
then we'll have seqids not-in-order. Do we take the highest seqid in the hfile
as the hfiles' seqid? This would be different to how we usually write hfiles.
There could be issues in here.
bq. Another question is, initially we had one recovered.edits file per WAL; now
we planning one HFile per WAL.
This would be only if we had to flush. We'd keep per-WAL replay memstore so if
we have to flush, the file written out -- this would be at an extreme.
bq. I'm planning to use a config to control the new behavior because the issue
we're trying to address isn't a common usage scenario.
bq. I'd vote we instead have a config that would disallow writes during
recovery
+1 on disabling writes during recovery for now. It is this that is adding the
complication. If we disable writes during recovery, we can turn on distributed
log replay now as the default and enjoy the speedup it brings over current log
splitting. We can work on being able to take on writes during recovery for
later and over in the new issue.
> [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
> -------------------------------------------------------------------
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
> Issue Type: New Feature
> Components: MTTR
> Reporter: stack
> Assignee: Jeffrey Zhong
> Priority: Critical
> Fix For: 0.98.0, 0.95.1
>
> Attachments: 7006-addendum-3.txt, hbase-7006-addendum.patch,
> hbase-7006-combined.patch, hbase-7006-combined-v1.patch,
> hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch,
> hbase-7006-combined-v6.patch, hbase-7006-combined-v7.patch,
> hbase-7006-combined-v8.patch, hbase-7006-combined-v9.patch, LogSplitting
> Comparison.pdf,
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down hard and 30 nodes had
> 1700 WALs to replay. Replay took almost an hour. It looks like it could run
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least. Can always punt.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira