[
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647084#comment-13647084
]
Jeffrey Zhong commented on HBASE-7006:
--------------------------------------
[[email protected]] Good comments! Please see my responses in reverse order
of your feedbacks:
{quote}
Would there be any advantage NOT writing the WAL on replay and only when done,
then flush
{quote}
This is very good question. Actually I was thinking to evaluate this after this
feature is in as a possible optimization. Currently receiving RS does a WAL
sync for each replay batch. In the optimization scenario, we could replay
mutaions with SKIP_WAL durability and flush at the end. The gain mostly depends
on the "sequential" write performance of wal syncs. I think it's worth a try
here.
{quote}
The two sequenceids are never related right? They are only applied to the logs
of the server who passed the particular sequenceid to the master?
{quote}
No, sequenceIds from different RSs are totally un-related. Yes. Currently we
use the up-to-date flushed sequence id when we open the region by looking all
the store files as we do today.
{quote}
+ "...check if all WALs of a failed region server have been successfully
replayed." How is this done?
{quote}
We rely on the fact that when log split for a failed RS is done then all its
wal files are recovered so we don't really does the check.
{quote}
+ How will a crashed regionserver "...... and appending itself into the list
of...": i.e. append itself to list of crashed servers (am I reading this wrong)?
{quote}
Master SSH does the work not the dead RS.
{quote}
+ Is your assumption about out-of-order replay of edits new to this feature?
{quote}
Yes. I'll amend the design doc based on your other comments. Thanks.
> [MTTR] Study distributed log splitting to see how we can make it faster
> -----------------------------------------------------------------------
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
> Issue Type: Bug
> Components: MTTR
> Reporter: stack
> Assignee: Jeffrey Zhong
> Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch,
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting
> Comparison.pdf,
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down hard and 30 nodes had
> 1700 WALs to replay. Replay took almost an hour. It looks like it could run
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least. Can always punt.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira