Repository: hbase
Updated Branches:
  refs/heads/master 1324a3cb1 -> 1b9242259


HBASE-11280 Document distributed log replay and distributed log splitting 
(Misty Stanley-Jones)


Project: http://git-wip-us.apache.org/repos/asf/hbase/repo
Commit: http://git-wip-us.apache.org/repos/asf/hbase/commit/1b924225
Tree: http://git-wip-us.apache.org/repos/asf/hbase/tree/1b924225
Diff: http://git-wip-us.apache.org/repos/asf/hbase/diff/1b924225

Branch: refs/heads/master
Commit: 1b9242259704d5c6a653a7f56cb1121b981ae3da
Parents: 1324a3c
Author: Michael Stack <[email protected]>
Authored: Tue Jun 17 18:18:00 2014 -0500
Committer: Michael Stack <[email protected]>
Committed: Tue Jun 17 18:18:00 2014 -0500

----------------------------------------------------------------------
 src/main/docbkx/book.xml | 384 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 351 insertions(+), 33 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/hbase/blob/1b924225/src/main/docbkx/book.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml
index 3e22dfd..1af243c 100644
--- a/src/main/docbkx/book.xml
+++ b/src/main/docbkx/book.xml
@@ -2200,16 +2200,21 @@ rs.close();
         <section
           xml:id="purpose.wal">
           <title>Purpose</title>
-
-          <para>Each RegionServer adds updates (Puts, Deletes) to its 
write-ahead log (WAL) first,
-            and then to the <xref
-              linkend="store.memstore" /> for the affected <xref
-              linkend="store" />. This ensures that HBase has durable writes. 
Without WAL, there is
-            the possibility of data loss in the case of a RegionServer failure 
before each MemStore
-            is flushed and new StoreFiles are written. <link
+          <para>The <firstterm>Write Ahead Log (WAL)</firstterm> records all 
changes to data in
+            HBase, to file-based storage. Under normal operations, the WAL is 
not needed because
+            data changes move from the MemStore to StoreFiles. However, if a 
RegionServer crashes or
+          becomes unavailable before the MemStore is flushed, the WAL ensures 
that the changes to
+          the data can be replayed. If writing to the WAL fails, the entire 
operation to modify the
+          data fails.</para>
+          <para>HBase uses an implementation of the <link
               
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/wal/HLog.html";>HLog</link>
-            is the HBase WAL implementation, and there is one HLog instance 
per RegionServer. </para>
-          <para>The WAL is in HDFS in <filename>/hbase/.logs/</filename> with 
subdirectories per
+            interface for the WAL.
+            Usually, there is only one instance of a WAL per RegionServer. The 
RegionServer records Puts and Deletes to
+            it, before recording them to the <xref
+              linkend="store.memstore" /> for the affected <xref
+              linkend="store" />.</para>
+          <para>The WAL resides in HDFS in the 
<filename>/hbase/WALs/</filename> directory (prior to
+            HBase 0.94, they were stored in 
<filename>/hbase/.logs/</filename>), with subdirectories per
             region.</para>
           <para> For more general information about the concept of write ahead 
logs, see the
             Wikipedia <link
@@ -2226,39 +2231,352 @@ rs.close();
           xml:id="wal_splitting">
           <title>WAL Splitting</title>
 
+          <para>A RegionServer serves many regions. All of the regions in a 
region server share the
+            same active WAL file. Each edit in the WAL file includes 
information about which region
+            it belongs to. When a region is opened, the edits in the WAL file 
which belong to that
+            region need to be replayed. Therefore, edits in the WAL file must 
be grouped by region
+            so that particular sets can be replayed to regenerate the data in 
a particular region.
+            The process of grouping the WAL edits by region is called 
<firstterm>log
+              splitting</firstterm>. It is a critical process for recovering 
data if a region server
+            fails.</para>
+          <para>Log splitting is done by the HMaster during cluster start-up 
or by the ServerShutdownHandler
+            as a region server shuts down. So that consistency is guaranteed, 
affected regions
+            are unavailable until data is restored. All WAL edits need to be 
recovered and replayed
+            before a given region can become available again. As a result, 
regions affected by
+            log splitting are unavailable until the process completes.</para>
+          <procedure xml:id="log.splitting.step.by.step">
+            <title>Log Splitting, Step by Step</title>
+            <step>
+              <title>The 
<filename>/hbase/WALs/&lt;host>,&lt;port>,&lt;startcode></filename> directory 
is renamed.</title>
+              <para>Renaming the directory is important because a RegionServer 
may still be up and
+                accepting requests even if the HMaster thinks it is down. If 
the RegionServer does
+                not respond immediately and does not heartbeat its ZooKeeper 
session, the HMaster
+                may interpret this as a RegionServer failure. Renaming the 
logs directory ensures
+                that existing, valid WAL files which are still in use by an 
active but busy
+                RegionServer are not written to by accident.</para>
+              <para>The new directory is named according to the following 
pattern:</para>
+              
<screen><![CDATA[/hbase/WALs/<host>,<port>,<startcode>-splitting]]></screen>
+              <para>An example of such a renamed directory might look like the 
following:</para>
+              
<screen>/hbase/WALs/srv.example.com,60020,1254173957298-splitting</screen>
+            </step>
+            <step>
+              <title>Each log file is split, one at a time.</title>
+              <para>The log splitter reads the log file one edit entry at a 
time and puts each edit
+                entry into the buffer corresponding to the edit’s region. At 
the same time, the
+                splitter starts several writer threads. Writer threads pick up 
a corresponding
+                buffer and write the edit entries in the buffer to a temporary 
recovered edit
+                file. The temporary edit file is stored to disk with the 
following naming pattern:</para>
+              
<screen><![CDATA[/hbase/<table_name>/<region_id>/recovered.edits/.temp]]></screen>
+              <para>This file is used to store all the edits in the WAL log 
for this region. After
+                log splitting completes, the <filename>.temp</filename> file 
is renamed to the
+                sequence ID of the first log written to the file.</para>
+              <para>To determine whether all edits have been written, the 
sequence ID is compared to
+                the sequence of the last edit that was written to the HFile. 
If the sequence of the
+                last edit is greater than or equal to the sequence ID included 
in the file name, it
+                is clear that all writes from the edit file have been 
completed.</para>
+            </step>
+            <step>
+              <title>After log splitting is complete, each affected region is 
assigned to a
+                RegionServer.</title>
+              <para> When the region is opened, the 
<filename>recovered.edits</filename> folder is checked for recovered
+                edits files. If any such files are present, they are replayed 
by reading the edits
+                and saving them to the MemStore. After all edit files are 
replayed, the contents of
+                the MemStore are written to disk (HFile) and the edit files 
are deleted.</para>
+            </step>
+          </procedure>
+  
           <section>
-            <title>How edits are recovered from a crashed RegionServer</title>
-            <para>When a RegionServer crashes, it will lose its ephemeral 
lease in
-              ZooKeeper...TODO</para>
-          </section>
-          <section>
-            <title><varname>hbase.hlog.split.skip.errors</varname></title>
+            <title>Handling of Errors During Log Splitting</title>
 
-            <para>When set to <constant>true</constant>, any error encountered 
splitting will be
-              logged, the problematic WAL will be moved into the 
<filename>.corrupt</filename>
-              directory under the hbase <varname>rootdir</varname>, and 
processing will continue. If
-              set to <constant>false</constant>, the default, the exception 
will be propagated and
-              the split logged as failed.<footnote>
+            <para>If you set the 
<varname>hbase.hlog.split.skip.errors</varname> option to
+                <constant>true</constant>, errors are treated as 
follows:</para>
+            <itemizedlist>
+              <listitem>
+                <para>Any error encountered during splitting will be 
logged.</para>
+              </listitem>
+              <listitem>
+                <para>The problematic WAL log will be moved into the 
<filename>.corrupt</filename>
+                  directory under the hbase <varname>rootdir</varname>,</para>
+              </listitem>
+              <listitem>
+                <para>Processing of the WAL will continue</para>
+              </listitem>
+            </itemizedlist>
+            <para>If the <varname>hbase.hlog.split.skip.errors</varname> 
optionset to
+                <literal>false</literal>, the default, the exception will be 
propagated and the
+              split will be logged as failed.<footnote>
                 <para>See <link
                     
xlink:href="https://issues.apache.org/jira/browse/HBASE-2958";>HBASE-2958 When
                     hbase.hlog.split.skip.errors is set to false, we fail the 
split but thats
                     it</link>. We need to do more than just fail split if this 
flag is set.</para>
               </footnote></para>
+            
+            <section>
+              <title>How EOFExceptions are treated when splitting a crashed 
RegionServers'
+                WALs</title>
+
+              <para>If an EOFException occurs while splitting logs, the split 
proceeds even when
+                  <varname>hbase.hlog.split.skip.errors</varname> is set to
+                <literal>false</literal>. An EOFException while reading the 
last log in the set of
+                files to split is likely, because the RegionServer is likely 
to be in the process of
+                writing a record at the time of a crash. <footnote>
+                  <para>For background, see <link
+                      
xlink:href="https://issues.apache.org/jira/browse/HBASE-2643";>HBASE-2643
+                      Figure how to deal with eof splitting logs</link></para>
+                </footnote></para>
+            </section>
           </section>
-
+          
           <section>
-            <title>How EOFExceptions are treated when splitting a crashed 
RegionServers'
-              WALs</title>
-
-            <para>If we get an EOF while splitting logs, we proceed with the 
split even when
-                <varname>hbase.hlog.split.skip.errors</varname> == 
<constant>false</constant>. An
-              EOF while reading the last log in the set of files to split is 
near-guaranteed since
-              the RegionServer likely crashed mid-write of a record. But we'll 
continue even if we
-              got an EOF reading other than the last file in the set.<footnote>
-                <para>For background, see <link
-                    
xlink:href="https://issues.apache.org/jira/browse/HBASE-2643";>HBASE-2643 Figure
-                    how to deal with eof splitting logs</link></para>
-              </footnote></para>
+            <title>Performance Improvements during Log Splitting</title>
+            <para>
+              WAL log splitting and recovery can be resource intensive and 
take a long time,
+              depending on the number of RegionServers involved in the crash 
and the size of the
+              regions. <xref linkend="distributed.log.splitting" /> and <xref
+                linkend="distributed.log.replay" /> were developed to improve
+              performance during log splitting.
+            </para>
+            <section xml:id="distributed.log.splitting">
+              <title>Distributed Log Splitting</title>
+              <para><firstterm>Distributed Log Splitting</firstterm> was added 
in HBase version 0.92
+                (<link 
xlink:href="https://issues.apache.org/jira/browse/HBASE-1364";>HBASE-1364</link>)
 
+                by Prakash Khemani from Facebook. It reduces the time to 
complete log splitting
+                dramatically, improving the availability of regions and 
tables. For
+                example, recovering a crashed cluster took around 9 hours with 
single-threaded log
+                splitting, but only about six minutes with distributed log 
splitting.</para>
+              <para>The information in this section is sourced from Jimmy 
Xiang's blog post at <link
+              
xlink:href="http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/"; 
/>.</para>
+              
+              <formalpara>
+                <title>Enabling or Disabling Distributed Log Splitting</title>
+                <para>Distributed log processing is enabled by default since 
HBase 0.92. The setting
+                  is controlled by the 
<property>hbase.master.distributed.log.splitting</property>
+                  property, which can be set to <literal>true</literal> or 
<literal>false</literal>,
+                  but defaults to <literal>true</literal>. </para>
+              </formalpara>
+              <procedure>
+                <title>Distributed Log Splitting, Step by Step</title>
+                <para>After configuring distributed log splitting, the HMaster 
controls the process.
+                  The HMaster enrolls each RegionServer in the log splitting 
process, and the actual
+                  work of splitting the logs is done by the RegionServers. The 
general process for
+                  log splitting, as described in <xref
+                    linkend="log.splitting.step.by.step" /> still applies 
here.</para>
+                <step>
+                  <para>If distributed log processing is enabled, the HMaster 
creates a
+                    <firstterm>split log manager</firstterm> instance when the 
cluster is started.
+                    The split log manager manages all log files which need
+                    to be scanned and split. The split log manager places all 
the logs into the
+                    ZooKeeper splitlog node 
(<filename>/hbase/splitlog</filename>) as tasks. You can
+                  view the contents of the splitlog by issuing the following
+                    <command>zkcli</command> command. Example output is 
shown.</para>
+                  <screen>ls /hbase/splitlog
+[hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost8.sample.com%2C57020%2C1340474893275-splitting%2Fhost8.sample.com%253A57020.1340474893900,
 
+hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost3.sample.com%2C57020%2C1340474893299-splitting%2Fhost3.sample.com%253A57020.1340474893931,
 
+hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost4.sample.com%2C57020%2C1340474893287-splitting%2Fhost4.sample.com%253A57020.1340474893946]
                  
+                  </screen>
+                  <para>The output contains some non-ASCII characters. When 
decoded, it looks much
+                    more simple:</para>
+                  <screen>
+[hdfs://host2.sample.com:56020/hbase/.logs
+/host8.sample.com,57020,1340474893275-splitting
+/host8.sample.com%3A57020.1340474893900, 
+hdfs://host2.sample.com:56020/hbase/.logs
+/host3.sample.com,57020,1340474893299-splitting
+/host3.sample.com%3A57020.1340474893931, 
+hdfs://host2.sample.com:56020/hbase/.logs
+/host4.sample.com,57020,1340474893287-splitting
+/host4.sample.com%3A57020.1340474893946]                    
+                  </screen>
+                  <para>The listing represents WAL file names to be scanned 
and split, which is a
+                    list of log splitting tasks.</para>
+                </step>
+                <step>
+                  <title>The split log manager monitors the log-splitting 
tasks and workers.</title>
+                  <para>The split log manager is responsible for the following 
ongoing tasks:</para>
+                  <itemizedlist>
+                    <listitem>
+                      <para>Once the split log manager publishes all the tasks 
to the splitlog
+                        znode, it monitors these task nodes and waits for them 
to be
+                        processed.</para>
+                    </listitem>
+                    <listitem>
+                      <para>Checks to see if there are any dead split log
+                        workers queued up. If it finds tasks claimed by 
unresponsive workers, it
+                        will resubmit those tasks. If the resubmit fails due 
to some ZooKeeper
+                        exception, the dead worker is queued up again for 
retry.</para>
+                    </listitem>
+                    <listitem>
+                      <para>Checks to see if there are any unassigned
+                        tasks. If it finds any, it create an ephemeral rescan 
node so that each
+                        split log worker is notified to re-scan unassigned 
tasks via the
+                          <code>nodeChildrenChanged</code> ZooKeeper 
event.</para>
+                    </listitem>
+                    <listitem>
+                      <para>Checks for tasks which are assigned but expired. 
If any are found, they
+                        are moved back to <code>TASK_UNASSIGNED</code> state 
again so that they can
+                        be retried. It is possible that these tasks are 
assigned to slow workers, or
+                        they may already be finished. This is not a problem, 
because log splitting
+                        tasks have the property of idempotence. In other 
words, the same log
+                        splitting task can be processed many times without 
causing any
+                        problem.</para>
+                    </listitem>
+                    <listitem>
+                      <para>The split log manager watches the HBase split log 
znodes constantly. If
+                        any split log task node data is changed, the split log 
manager retrieves the
+                        node data. The
+                        node data contains the current state of the task. You 
can use the
+                        <command>zkcli</command> <command>get</command> 
command to retrieve the
+                        current state of a task. In the example output below, 
the first line of the
+                        output shows that the task is currently 
unassigned.</para>
+                      <screen>
+<userinput>get 
/hbase/splitlog/hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost6.sample.com%2C57020%2C1340474893287-splitting%2Fhost6.sample.com%253A57020.1340474893945
+</userinput> 
+<computeroutput>unassigned host2.sample.com:57000
+cZxid = 0×7115
+ctime = Sat Jun 23 11:13:40 PDT 2012
+...</computeroutput>  
+                      </screen>
+                      <para>Based on the state of the task whose data is 
changed, the split log
+                        manager does one of the following:</para>
+
+                      <itemizedlist>
+                        <listitem>
+                          <para>Resubmit the task if it is unassigned</para>
+                        </listitem>
+                        <listitem>
+                          <para>Heartbeat the task if it is assigned</para>
+                        </listitem>
+                        <listitem>
+                          <para>Resubmit or fail the task if it is resigned 
(see <xref
+                            linkend="distributed.log.replay.failure.reasons" 
/>)</para>
+                        </listitem>
+                        <listitem>
+                          <para>Resubmit or fail the task if it is completed 
with errors (see <xref
+                            linkend="distributed.log.replay.failure.reasons" 
/>)</para>
+                        </listitem>
+                        <listitem>
+                          <para>Resubmit or fail the task if it could not 
complete due to
+                            errors (see <xref
+                            linkend="distributed.log.replay.failure.reasons" 
/>)</para>
+                        </listitem>
+                        <listitem>
+                          <para>Delete the task if it is successfully 
completed or failed</para>
+                        </listitem>
+                      </itemizedlist>
+                      <itemizedlist 
xml:id="distributed.log.replay.failure.reasons">
+                        <title>Reasons a Task Will Fail</title>
+                        <listitem><para>The task has been 
deleted.</para></listitem>
+                        <listitem><para>The node no longer 
exists.</para></listitem>
+                        <listitem><para>The log status manager failed to move 
the state of the task
+                          to TASK_UNASSIGNED.</para></listitem>
+                        <listitem><para>The number of resubmits is over the 
resubmit
+                          threshold.</para></listitem>
+                      </itemizedlist>
+                    </listitem>
+                  </itemizedlist>
+                </step>
+                <step>
+                  <title>Each RegionServer's split log worker performs the 
log-splitting tasks.</title>
+                  <para>Each RegionServer runs a daemon thread called the 
<firstterm>split log
+                      worker</firstterm>, which does the work to split the 
logs. The daemon thread
+                    starts when the RegionServer starts, and registers itself 
to watch HBase znodes.
+                    If any splitlog znode children change, it notifies a 
sleeping worker thread to
+                    wake up and grab more tasks. If if a worker's current 
task’s node data is
+                    changed, the worker checks to see if the task has been 
taken by another worker.
+                    If so, the worker thread stops work on the current 
task.</para>
+                  <para>The worker monitors
+                    the splitlog znode constantly. When a new task appears, 
the split log worker
+                    retrieves  the task paths and checks each one until it 
finds an unclaimed task,
+                    which it attempts to claim. If the claim was successful, 
it attempts to perform
+                    the task and updates the task's <property>state</property> 
property based on the
+                    splitting outcome. At this point, the split log worker 
scans for another
+                    unclaimed task.</para>
+                  <itemizedlist>
+                    <title>How the Split Log Worker Approaches a Task</title>
+
+                    <listitem>
+                      <para>It queries the task state and only takes action if 
the task is in
+                          <literal>TASK_UNASSIGNED </literal>state.</para>
+                    </listitem>
+                    <listitem>
+                      <para>If the task is is in 
<literal>TASK_UNASSIGNED</literal> state, the
+                        worker attempts to set the state to 
<literal>TASK_OWNED</literal> by itself.
+                        If it fails to set the state, another worker will try 
to grab it. The split
+                        log manager will also ask all workers to rescan later 
if the task remains
+                        unassigned.</para>
+                    </listitem>
+                    <listitem>
+                      <para>If the worker succeeds in taking ownership of the 
task, it tries to get
+                        the task state again to make sure it really gets it 
asynchronously. In the
+                        meantime, it starts a split task executor to do the 
actual work: </para>
+                      <itemizedlist>
+                        <listitem>
+                          <para>Get the HBase root folder, create a temp 
folder under the root, and
+                            split the log file to the temp folder.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the split was successful, the task executor 
sets the task to
+                            state <literal>TASK_DONE</literal>.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the worker catches an unexpected 
IOException, the task is set to
+                            state <literal>TASK_ERR</literal>.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the worker is shutting down, set the the 
task to state
+                              <literal>TASK_RESIGNED</literal>.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the task is taken by another worker, just 
log it.</para>
+                        </listitem>
+                      </itemizedlist>
+                    </listitem>
+                  </itemizedlist>
+                </step>
+                <step>
+                  <title>The split log manager monitors for uncompleted 
tasks.</title>
+                  <para>The split log manager returns when all tasks are 
completed successfully. If
+                    all tasks are completed with some failures, the split log 
manager throws an
+                    exception so that the log splitting can be retried. Due to 
an asynchronous
+                    implementation, in very rare cases, the split log manager 
loses track of some
+                    completed tasks. For that reason, it periodically checks 
for remaining
+                    uncompleted task in its task map or ZooKeeper. If none are 
found, it throws an
+                    exception so that the log splitting can be retried right 
away instead of hanging
+                    there waiting for something that won’t happen.</para>
+                </step>
+              </procedure>
+            </section>
+            <section xml:id="distributed.log.replay">
+              <title>Distributed Log Replay</title>
+              <para>After a RegionServer fails, its failed region is assigned 
to another
+                RegionServer, which is marked as "recovering" in ZooKeeper. A 
split log worker directly
+                replays edits from the WAL of the failed region server to the 
region at its new
+                location. When a region is in "recovering" state, it can 
accept writes but no reads
+                (including Append and Increment), region splits or merges. 
</para>
+              <para>Distributed Log Replay extends the <xref 
linkend="distributed.log.splitting" /> framework. It works by
+                directly replaying WAL edits to another RegionServer instead 
of creating
+                  <filename>recovered.edits</filename> files. It provides the 
following advantages
+                over distributed log splitting alone:</para>
+              <itemizedlist>
+                <listitem><para>It eliminates the overhead of writing and 
reading a large number of
+                  <filename>recovered.edits</filename> files. It is not 
unusual for thousands of
+                  <filename>recovered.edits</filename> files to be created and 
written concurrently
+                  during a RegionServer recovery. Many small random writes can 
degrade overall
+                  system performance.</para></listitem>
+                <listitem><para>It allows writes even when a region is in 
recovering state. It only takes seconds for a recovering region to accept 
writes again. 
+</para></listitem>
+              </itemizedlist>
+              <formalpara>
+                <title>Enabling Distributed Log Replay</title>
+                <para>To enable distributed log replay, set 
<varname>hbase.master.distributed.log.replay</varname> to
+                  true. This will be the default for HBase 0.99 (<link
+                    
xlink:href="https://issues.apache.org/jira/browse/HBASE-10888";>HBASE-10888</link>).</para>
+              </formalpara>
+              <para>You must also enable HFile version 3 (which is the default 
HFile format starting
+                in HBase 0.99. See <link
+                  
xlink:href="https://issues.apache.org/jira/browse/HBASE-10855";>HBASE-10855</link>).
+                Distributed log replay is unsafe for rolling upgrades.</para>
+            </section>
           </section>
         </section>
       </section>

Reply via email to