Repository: hbase Updated Branches: refs/heads/master 1324a3cb1 -> 1b9242259
HBASE-11280 Document distributed log replay and distributed log splitting (Misty Stanley-Jones) Project: http://git-wip-us.apache.org/repos/asf/hbase/repo Commit: http://git-wip-us.apache.org/repos/asf/hbase/commit/1b924225 Tree: http://git-wip-us.apache.org/repos/asf/hbase/tree/1b924225 Diff: http://git-wip-us.apache.org/repos/asf/hbase/diff/1b924225 Branch: refs/heads/master Commit: 1b9242259704d5c6a653a7f56cb1121b981ae3da Parents: 1324a3c Author: Michael Stack <[email protected]> Authored: Tue Jun 17 18:18:00 2014 -0500 Committer: Michael Stack <[email protected]> Committed: Tue Jun 17 18:18:00 2014 -0500 ---------------------------------------------------------------------- src/main/docbkx/book.xml | 384 ++++++++++++++++++++++++++++++++++++++---- 1 file changed, 351 insertions(+), 33 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/hbase/blob/1b924225/src/main/docbkx/book.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml index 3e22dfd..1af243c 100644 --- a/src/main/docbkx/book.xml +++ b/src/main/docbkx/book.xml @@ -2200,16 +2200,21 @@ rs.close(); <section xml:id="purpose.wal"> <title>Purpose</title> - - <para>Each RegionServer adds updates (Puts, Deletes) to its write-ahead log (WAL) first, - and then to the <xref - linkend="store.memstore" /> for the affected <xref - linkend="store" />. This ensures that HBase has durable writes. Without WAL, there is - the possibility of data loss in the case of a RegionServer failure before each MemStore - is flushed and new StoreFiles are written. <link + <para>The <firstterm>Write Ahead Log (WAL)</firstterm> records all changes to data in + HBase, to file-based storage. Under normal operations, the WAL is not needed because + data changes move from the MemStore to StoreFiles. However, if a RegionServer crashes or + becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to + the data can be replayed. If writing to the WAL fails, the entire operation to modify the + data fails.</para> + <para>HBase uses an implementation of the <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/wal/HLog.html">HLog</link> - is the HBase WAL implementation, and there is one HLog instance per RegionServer. </para> - <para>The WAL is in HDFS in <filename>/hbase/.logs/</filename> with subdirectories per + interface for the WAL. + Usually, there is only one instance of a WAL per RegionServer. The RegionServer records Puts and Deletes to + it, before recording them to the <xref + linkend="store.memstore" /> for the affected <xref + linkend="store" />.</para> + <para>The WAL resides in HDFS in the <filename>/hbase/WALs/</filename> directory (prior to + HBase 0.94, they were stored in <filename>/hbase/.logs/</filename>), with subdirectories per region.</para> <para> For more general information about the concept of write ahead logs, see the Wikipedia <link @@ -2226,39 +2231,352 @@ rs.close(); xml:id="wal_splitting"> <title>WAL Splitting</title> + <para>A RegionServer serves many regions. All of the regions in a region server share the + same active WAL file. Each edit in the WAL file includes information about which region + it belongs to. When a region is opened, the edits in the WAL file which belong to that + region need to be replayed. Therefore, edits in the WAL file must be grouped by region + so that particular sets can be replayed to regenerate the data in a particular region. + The process of grouping the WAL edits by region is called <firstterm>log + splitting</firstterm>. It is a critical process for recovering data if a region server + fails.</para> + <para>Log splitting is done by the HMaster during cluster start-up or by the ServerShutdownHandler + as a region server shuts down. So that consistency is guaranteed, affected regions + are unavailable until data is restored. All WAL edits need to be recovered and replayed + before a given region can become available again. As a result, regions affected by + log splitting are unavailable until the process completes.</para> + <procedure xml:id="log.splitting.step.by.step"> + <title>Log Splitting, Step by Step</title> + <step> + <title>The <filename>/hbase/WALs/<host>,<port>,<startcode></filename> directory is renamed.</title> + <para>Renaming the directory is important because a RegionServer may still be up and + accepting requests even if the HMaster thinks it is down. If the RegionServer does + not respond immediately and does not heartbeat its ZooKeeper session, the HMaster + may interpret this as a RegionServer failure. Renaming the logs directory ensures + that existing, valid WAL files which are still in use by an active but busy + RegionServer are not written to by accident.</para> + <para>The new directory is named according to the following pattern:</para> + <screen><![CDATA[/hbase/WALs/<host>,<port>,<startcode>-splitting]]></screen> + <para>An example of such a renamed directory might look like the following:</para> + <screen>/hbase/WALs/srv.example.com,60020,1254173957298-splitting</screen> + </step> + <step> + <title>Each log file is split, one at a time.</title> + <para>The log splitter reads the log file one edit entry at a time and puts each edit + entry into the buffer corresponding to the editâs region. At the same time, the + splitter starts several writer threads. Writer threads pick up a corresponding + buffer and write the edit entries in the buffer to a temporary recovered edit + file. The temporary edit file is stored to disk with the following naming pattern:</para> + <screen><![CDATA[/hbase/<table_name>/<region_id>/recovered.edits/.temp]]></screen> + <para>This file is used to store all the edits in the WAL log for this region. After + log splitting completes, the <filename>.temp</filename> file is renamed to the + sequence ID of the first log written to the file.</para> + <para>To determine whether all edits have been written, the sequence ID is compared to + the sequence of the last edit that was written to the HFile. If the sequence of the + last edit is greater than or equal to the sequence ID included in the file name, it + is clear that all writes from the edit file have been completed.</para> + </step> + <step> + <title>After log splitting is complete, each affected region is assigned to a + RegionServer.</title> + <para> When the region is opened, the <filename>recovered.edits</filename> folder is checked for recovered + edits files. If any such files are present, they are replayed by reading the edits + and saving them to the MemStore. After all edit files are replayed, the contents of + the MemStore are written to disk (HFile) and the edit files are deleted.</para> + </step> + </procedure> + <section> - <title>How edits are recovered from a crashed RegionServer</title> - <para>When a RegionServer crashes, it will lose its ephemeral lease in - ZooKeeper...TODO</para> - </section> - <section> - <title><varname>hbase.hlog.split.skip.errors</varname></title> + <title>Handling of Errors During Log Splitting</title> - <para>When set to <constant>true</constant>, any error encountered splitting will be - logged, the problematic WAL will be moved into the <filename>.corrupt</filename> - directory under the hbase <varname>rootdir</varname>, and processing will continue. If - set to <constant>false</constant>, the default, the exception will be propagated and - the split logged as failed.<footnote> + <para>If you set the <varname>hbase.hlog.split.skip.errors</varname> option to + <constant>true</constant>, errors are treated as follows:</para> + <itemizedlist> + <listitem> + <para>Any error encountered during splitting will be logged.</para> + </listitem> + <listitem> + <para>The problematic WAL log will be moved into the <filename>.corrupt</filename> + directory under the hbase <varname>rootdir</varname>,</para> + </listitem> + <listitem> + <para>Processing of the WAL will continue</para> + </listitem> + </itemizedlist> + <para>If the <varname>hbase.hlog.split.skip.errors</varname> optionset to + <literal>false</literal>, the default, the exception will be propagated and the + split will be logged as failed.<footnote> <para>See <link xlink:href="https://issues.apache.org/jira/browse/HBASE-2958">HBASE-2958 When hbase.hlog.split.skip.errors is set to false, we fail the split but thats it</link>. We need to do more than just fail split if this flag is set.</para> </footnote></para> + + <section> + <title>How EOFExceptions are treated when splitting a crashed RegionServers' + WALs</title> + + <para>If an EOFException occurs while splitting logs, the split proceeds even when + <varname>hbase.hlog.split.skip.errors</varname> is set to + <literal>false</literal>. An EOFException while reading the last log in the set of + files to split is likely, because the RegionServer is likely to be in the process of + writing a record at the time of a crash. <footnote> + <para>For background, see <link + xlink:href="https://issues.apache.org/jira/browse/HBASE-2643">HBASE-2643 + Figure how to deal with eof splitting logs</link></para> + </footnote></para> + </section> </section> - + <section> - <title>How EOFExceptions are treated when splitting a crashed RegionServers' - WALs</title> - - <para>If we get an EOF while splitting logs, we proceed with the split even when - <varname>hbase.hlog.split.skip.errors</varname> == <constant>false</constant>. An - EOF while reading the last log in the set of files to split is near-guaranteed since - the RegionServer likely crashed mid-write of a record. But we'll continue even if we - got an EOF reading other than the last file in the set.<footnote> - <para>For background, see <link - xlink:href="https://issues.apache.org/jira/browse/HBASE-2643">HBASE-2643 Figure - how to deal with eof splitting logs</link></para> - </footnote></para> + <title>Performance Improvements during Log Splitting</title> + <para> + WAL log splitting and recovery can be resource intensive and take a long time, + depending on the number of RegionServers involved in the crash and the size of the + regions. <xref linkend="distributed.log.splitting" /> and <xref + linkend="distributed.log.replay" /> were developed to improve + performance during log splitting. + </para> + <section xml:id="distributed.log.splitting"> + <title>Distributed Log Splitting</title> + <para><firstterm>Distributed Log Splitting</firstterm> was added in HBase version 0.92 + (<link xlink:href="https://issues.apache.org/jira/browse/HBASE-1364">HBASE-1364</link>) + by Prakash Khemani from Facebook. It reduces the time to complete log splitting + dramatically, improving the availability of regions and tables. For + example, recovering a crashed cluster took around 9 hours with single-threaded log + splitting, but only about six minutes with distributed log splitting.</para> + <para>The information in this section is sourced from Jimmy Xiang's blog post at <link + xlink:href="http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/" />.</para> + + <formalpara> + <title>Enabling or Disabling Distributed Log Splitting</title> + <para>Distributed log processing is enabled by default since HBase 0.92. The setting + is controlled by the <property>hbase.master.distributed.log.splitting</property> + property, which can be set to <literal>true</literal> or <literal>false</literal>, + but defaults to <literal>true</literal>. </para> + </formalpara> + <procedure> + <title>Distributed Log Splitting, Step by Step</title> + <para>After configuring distributed log splitting, the HMaster controls the process. + The HMaster enrolls each RegionServer in the log splitting process, and the actual + work of splitting the logs is done by the RegionServers. The general process for + log splitting, as described in <xref + linkend="log.splitting.step.by.step" /> still applies here.</para> + <step> + <para>If distributed log processing is enabled, the HMaster creates a + <firstterm>split log manager</firstterm> instance when the cluster is started. + The split log manager manages all log files which need + to be scanned and split. The split log manager places all the logs into the + ZooKeeper splitlog node (<filename>/hbase/splitlog</filename>) as tasks. You can + view the contents of the splitlog by issuing the following + <command>zkcli</command> command. Example output is shown.</para> + <screen>ls /hbase/splitlog +[hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost8.sample.com%2C57020%2C1340474893275-splitting%2Fhost8.sample.com%253A57020.1340474893900, +hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost3.sample.com%2C57020%2C1340474893299-splitting%2Fhost3.sample.com%253A57020.1340474893931, +hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost4.sample.com%2C57020%2C1340474893287-splitting%2Fhost4.sample.com%253A57020.1340474893946] + </screen> + <para>The output contains some non-ASCII characters. When decoded, it looks much + more simple:</para> + <screen> +[hdfs://host2.sample.com:56020/hbase/.logs +/host8.sample.com,57020,1340474893275-splitting +/host8.sample.com%3A57020.1340474893900, +hdfs://host2.sample.com:56020/hbase/.logs +/host3.sample.com,57020,1340474893299-splitting +/host3.sample.com%3A57020.1340474893931, +hdfs://host2.sample.com:56020/hbase/.logs +/host4.sample.com,57020,1340474893287-splitting +/host4.sample.com%3A57020.1340474893946] + </screen> + <para>The listing represents WAL file names to be scanned and split, which is a + list of log splitting tasks.</para> + </step> + <step> + <title>The split log manager monitors the log-splitting tasks and workers.</title> + <para>The split log manager is responsible for the following ongoing tasks:</para> + <itemizedlist> + <listitem> + <para>Once the split log manager publishes all the tasks to the splitlog + znode, it monitors these task nodes and waits for them to be + processed.</para> + </listitem> + <listitem> + <para>Checks to see if there are any dead split log + workers queued up. If it finds tasks claimed by unresponsive workers, it + will resubmit those tasks. If the resubmit fails due to some ZooKeeper + exception, the dead worker is queued up again for retry.</para> + </listitem> + <listitem> + <para>Checks to see if there are any unassigned + tasks. If it finds any, it create an ephemeral rescan node so that each + split log worker is notified to re-scan unassigned tasks via the + <code>nodeChildrenChanged</code> ZooKeeper event.</para> + </listitem> + <listitem> + <para>Checks for tasks which are assigned but expired. If any are found, they + are moved back to <code>TASK_UNASSIGNED</code> state again so that they can + be retried. It is possible that these tasks are assigned to slow workers, or + they may already be finished. This is not a problem, because log splitting + tasks have the property of idempotence. In other words, the same log + splitting task can be processed many times without causing any + problem.</para> + </listitem> + <listitem> + <para>The split log manager watches the HBase split log znodes constantly. If + any split log task node data is changed, the split log manager retrieves the + node data. The + node data contains the current state of the task. You can use the + <command>zkcli</command> <command>get</command> command to retrieve the + current state of a task. In the example output below, the first line of the + output shows that the task is currently unassigned.</para> + <screen> +<userinput>get /hbase/splitlog/hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost6.sample.com%2C57020%2C1340474893287-splitting%2Fhost6.sample.com%253A57020.1340474893945 +</userinput> +<computeroutput>unassigned host2.sample.com:57000 +cZxid = 0Ã7115 +ctime = Sat Jun 23 11:13:40 PDT 2012 +...</computeroutput> + </screen> + <para>Based on the state of the task whose data is changed, the split log + manager does one of the following:</para> + + <itemizedlist> + <listitem> + <para>Resubmit the task if it is unassigned</para> + </listitem> + <listitem> + <para>Heartbeat the task if it is assigned</para> + </listitem> + <listitem> + <para>Resubmit or fail the task if it is resigned (see <xref + linkend="distributed.log.replay.failure.reasons" />)</para> + </listitem> + <listitem> + <para>Resubmit or fail the task if it is completed with errors (see <xref + linkend="distributed.log.replay.failure.reasons" />)</para> + </listitem> + <listitem> + <para>Resubmit or fail the task if it could not complete due to + errors (see <xref + linkend="distributed.log.replay.failure.reasons" />)</para> + </listitem> + <listitem> + <para>Delete the task if it is successfully completed or failed</para> + </listitem> + </itemizedlist> + <itemizedlist xml:id="distributed.log.replay.failure.reasons"> + <title>Reasons a Task Will Fail</title> + <listitem><para>The task has been deleted.</para></listitem> + <listitem><para>The node no longer exists.</para></listitem> + <listitem><para>The log status manager failed to move the state of the task + to TASK_UNASSIGNED.</para></listitem> + <listitem><para>The number of resubmits is over the resubmit + threshold.</para></listitem> + </itemizedlist> + </listitem> + </itemizedlist> + </step> + <step> + <title>Each RegionServer's split log worker performs the log-splitting tasks.</title> + <para>Each RegionServer runs a daemon thread called the <firstterm>split log + worker</firstterm>, which does the work to split the logs. The daemon thread + starts when the RegionServer starts, and registers itself to watch HBase znodes. + If any splitlog znode children change, it notifies a sleeping worker thread to + wake up and grab more tasks. If if a worker's current taskâs node data is + changed, the worker checks to see if the task has been taken by another worker. + If so, the worker thread stops work on the current task.</para> + <para>The worker monitors + the splitlog znode constantly. When a new task appears, the split log worker + retrieves the task paths and checks each one until it finds an unclaimed task, + which it attempts to claim. If the claim was successful, it attempts to perform + the task and updates the task's <property>state</property> property based on the + splitting outcome. At this point, the split log worker scans for another + unclaimed task.</para> + <itemizedlist> + <title>How the Split Log Worker Approaches a Task</title> + + <listitem> + <para>It queries the task state and only takes action if the task is in + <literal>TASK_UNASSIGNED </literal>state.</para> + </listitem> + <listitem> + <para>If the task is is in <literal>TASK_UNASSIGNED</literal> state, the + worker attempts to set the state to <literal>TASK_OWNED</literal> by itself. + If it fails to set the state, another worker will try to grab it. The split + log manager will also ask all workers to rescan later if the task remains + unassigned.</para> + </listitem> + <listitem> + <para>If the worker succeeds in taking ownership of the task, it tries to get + the task state again to make sure it really gets it asynchronously. In the + meantime, it starts a split task executor to do the actual work: </para> + <itemizedlist> + <listitem> + <para>Get the HBase root folder, create a temp folder under the root, and + split the log file to the temp folder.</para> + </listitem> + <listitem> + <para>If the split was successful, the task executor sets the task to + state <literal>TASK_DONE</literal>.</para> + </listitem> + <listitem> + <para>If the worker catches an unexpected IOException, the task is set to + state <literal>TASK_ERR</literal>.</para> + </listitem> + <listitem> + <para>If the worker is shutting down, set the the task to state + <literal>TASK_RESIGNED</literal>.</para> + </listitem> + <listitem> + <para>If the task is taken by another worker, just log it.</para> + </listitem> + </itemizedlist> + </listitem> + </itemizedlist> + </step> + <step> + <title>The split log manager monitors for uncompleted tasks.</title> + <para>The split log manager returns when all tasks are completed successfully. If + all tasks are completed with some failures, the split log manager throws an + exception so that the log splitting can be retried. Due to an asynchronous + implementation, in very rare cases, the split log manager loses track of some + completed tasks. For that reason, it periodically checks for remaining + uncompleted task in its task map or ZooKeeper. If none are found, it throws an + exception so that the log splitting can be retried right away instead of hanging + there waiting for something that wonât happen.</para> + </step> + </procedure> + </section> + <section xml:id="distributed.log.replay"> + <title>Distributed Log Replay</title> + <para>After a RegionServer fails, its failed region is assigned to another + RegionServer, which is marked as "recovering" in ZooKeeper. A split log worker directly + replays edits from the WAL of the failed region server to the region at its new + location. When a region is in "recovering" state, it can accept writes but no reads + (including Append and Increment), region splits or merges. </para> + <para>Distributed Log Replay extends the <xref linkend="distributed.log.splitting" /> framework. It works by + directly replaying WAL edits to another RegionServer instead of creating + <filename>recovered.edits</filename> files. It provides the following advantages + over distributed log splitting alone:</para> + <itemizedlist> + <listitem><para>It eliminates the overhead of writing and reading a large number of + <filename>recovered.edits</filename> files. It is not unusual for thousands of + <filename>recovered.edits</filename> files to be created and written concurrently + during a RegionServer recovery. Many small random writes can degrade overall + system performance.</para></listitem> + <listitem><para>It allows writes even when a region is in recovering state. It only takes seconds for a recovering region to accept writes again. +</para></listitem> + </itemizedlist> + <formalpara> + <title>Enabling Distributed Log Replay</title> + <para>To enable distributed log replay, set <varname>hbase.master.distributed.log.replay</varname> to + true. This will be the default for HBase 0.99 (<link + xlink:href="https://issues.apache.org/jira/browse/HBASE-10888">HBASE-10888</link>).</para> + </formalpara> + <para>You must also enable HFile version 3 (which is the default HFile format starting + in HBase 0.99. See <link + xlink:href="https://issues.apache.org/jira/browse/HBASE-10855">HBASE-10855</link>). + Distributed log replay is unsafe for rolling upgrades.</para> + </section> </section> </section> </section>
