[
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675513#comment-13675513
]
Elliott Clark edited comment on HBASE-7006 at 6/5/13 1:24 AM:
--------------------------------------------------------------
So talking about this patch with Himanshu and I had a few concerns.
h3.Concern One
I'm pretty sure there is an issue with opening a region for edits before all
logs are finished replaying. To illustrate:
Say there's a table with cf that has VERSIONS = 2.
For all edits the rowkey is the same.
# There's a log with: [ A (ts = 0), B (ts = 0) ]
# Replay the first half of the log.
# A user puts in C (ts = 0)
# Memstore has to flush
# A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
# Replay the rest of the Log.
# Flush
You'll get either C, A when a get is issued.
C, B is the expected result.
We have promised that edits will be ordered by timestamp, then sequence id.
h3.Concern Two
I think there's an issue with duplicating edits if there is a failure while
replaying. To illustrate:
Say there's a table with a column family with Versions = 3
# There's a log with edits who's timestamps are [ 10, 11, 12 ]
# assign the region for replay
# Start replaying
# Fail after [ 10, 11 ]
# Now there are two logs [ 10, 11, 12] [ 10, 11 ]
# Master sees that replaying failed and that the rs hosting the region failed.
# it will replay both logs.
# You will now have [ 12, 11, 11 ]
Any get to that table will get [ 12, 11, 11]
[12, 11, 10] is expected.
This is fixable if we:
# Don't replay wal edits with isReaply = true
# and only remove old logs after all the memstores that the log got replayed
into have fully flushed.
This is hard since the memstores are all over and hard to keep track of.
or:
# Don't append the replayed edits to the wal
# while replaying if the memstore needs to flush, flush the hfiles out to a
temp location.
# Move the Hfiles in after all the edits are recovered.
This is hard as we'll have to meddle with how we flush the memstore.
was (Author: eclark):
So talking about this patch with Himanshu and I had a few concern.
h3.Concern One
I'm pretty sure there is an issue with opening a region for edits before all
logs are finished replaying. To illustrate:
Say there's a table with cf that has VERSIONS = 2.
For all edits the rowkey is the same.
# There's a log with: [ A (ts = 0), B (ts = 0) ]
# Replay the first half of the log.
# A user puts in C (ts = 0)
# Memstore has to flush
# A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
# Replay the rest of the Log.
# Flush
You'll get either C, A when a get is issued.
C, B is the expected result.
We have promised that edits will be ordered by timestamp, then sequence id.
h3.Concern Two
I think there's an issue with duplicating edits if there is a failure while
replaying. To illustrate:
Say there's a table with a column family with Versions = 3
# There's a log [ 10, 11, 12 ]
# assign the region for replay
# Start replaying
# Fail after [ 10, 11 ]
# Now there are two logs [ 10, 11, 12] [ 10, 11 ]
# Master sees that replaying failed and that the rs hosting the region failed.
# it will replay both logs.
# You will now have [ 12, 11, 11 ]
Any get to that table will get [ 12, 11, 11]
[12, 11, 10] is expected.
This is fixable if we:
# Don't replay wal edits with isReaply = true
# and only remove old logs after all the memstores that the log got replayed
into have fully flushed.
This is hard since the memstores are all over and hard to keep track of.
or:
# Don't append the replayed edits to the wal
# while replaying if the memstore needs to flush, flush the hfiles out to a
temp location.
# Move the Hfiles in after all the edits are recovered.
This is hard as we'll have to meddle with how we flush the memstore.
> [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
> -------------------------------------------------------------------
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
> Issue Type: New Feature
> Components: MTTR
> Reporter: stack
> Assignee: Jeffrey Zhong
> Priority: Critical
> Fix For: 0.98.0, 0.95.1
>
> Attachments: 7006-addendum-3.txt, hbase-7006-addendum.patch,
> hbase-7006-combined.patch, hbase-7006-combined-v1.patch,
> hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch,
> hbase-7006-combined-v6.patch, hbase-7006-combined-v7.patch,
> hbase-7006-combined-v8.patch, hbase-7006-combined-v9.patch, LogSplitting
> Comparison.pdf,
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down hard and 30 nodes had
> 1700 WALs to replay. Replay took almost an hour. It looks like it could run
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least. Can always punt.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira