http://git-wip-us.apache.org/repos/asf/hbase/blob/cb77a925/src/main/docbkx/case_studies.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/case_studies.xml b/src/main/docbkx/case_studies.xml deleted file mode 100644 index 332caf8..0000000 --- a/src/main/docbkx/case_studies.xml +++ /dev/null @@ -1,239 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<chapter - version="5.0" - xml:id="casestudies" - xmlns="http://docbook.org/ns/docbook" - xmlns:xlink="http://www.w3.org/1999/xlink" - xmlns:xi="http://www.w3.org/2001/XInclude" - xmlns:svg="http://www.w3.org/2000/svg" - xmlns:m="http://www.w3.org/1998/Math/MathML" - xmlns:html="http://www.w3.org/1999/xhtml" - xmlns:db="http://docbook.org/ns/docbook"> - <!-- -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ ---> - <title>Apache HBase Case Studies</title> - <section - xml:id="casestudies.overview"> - <title>Overview</title> - <para> This chapter will describe a variety of performance and troubleshooting case studies that - can provide a useful blueprint on diagnosing Apache HBase cluster issues. </para> - <para> For more information on Performance and Troubleshooting, see <xref - linkend="performance" /> and <xref - linkend="trouble" />. </para> - </section> - - <section - xml:id="casestudies.schema"> - <title>Schema Design</title> - <para>See the schema design case studies here: <xref - linkend="schema.casestudies" /> - </para> - - </section> - <!-- schema design --> - - <section - xml:id="casestudies.perftroub"> - <title>Performance/Troubleshooting</title> - - <section - xml:id="casestudies.slownode"> - <title>Case Study #1 (Performance Issue On A Single Node)</title> - <section> - <title>Scenario</title> - <para> Following a scheduled reboot, one data node began exhibiting unusual behavior. - Routine MapReduce jobs run against HBase tables which regularly completed in five or six - minutes began taking 30 or 40 minutes to finish. These jobs were consistently found to be - waiting on map and reduce tasks assigned to the troubled data node (e.g., the slow map - tasks all had the same Input Split). The situation came to a head during a distributed - copy, when the copy was severely prolonged by the lagging node. </para> - </section> - <section> - <title>Hardware</title> - <itemizedlist> - <title>Datanodes:</title> - <listitem> - <para>Two 12-core processors</para> - </listitem> - <listitem> - <para>Six Enerprise SATA disks</para> - </listitem> - <listitem> - <para>24GB of RAM</para> - </listitem> - <listitem> - <para>Two bonded gigabit NICs</para> - </listitem> - </itemizedlist> - <itemizedlist> - <title>Network:</title> - <listitem> - <para>10 Gigabit top-of-rack switches</para> - </listitem> - <listitem> - <para>20 Gigabit bonded interconnects between racks.</para> - </listitem> - </itemizedlist> - </section> - <section> - <title>Hypotheses</title> - <section> - <title>HBase "Hot Spot" Region</title> - <para> We hypothesized that we were experiencing a familiar point of pain: a "hot spot" - region in an HBase table, where uneven key-space distribution can funnel a huge number - of requests to a single HBase region, bombarding the RegionServer process and cause slow - response time. Examination of the HBase Master status page showed that the number of - HBase requests to the troubled node was almost zero. Further, examination of the HBase - logs showed that there were no region splits, compactions, or other region transitions - in progress. This effectively ruled out a "hot spot" as the root cause of the observed - slowness. </para> - </section> - <section> - <title>HBase Region With Non-Local Data</title> - <para> Our next hypothesis was that one of the MapReduce tasks was requesting data from - HBase that was not local to the datanode, thus forcing HDFS to request data blocks from - other servers over the network. Examination of the datanode logs showed that there were - very few blocks being requested over the network, indicating that the HBase region was - correctly assigned, and that the majority of the necessary data was located on the node. - This ruled out the possibility of non-local data causing a slowdown. </para> - </section> - <section> - <title>Excessive I/O Wait Due To Swapping Or An Over-Worked Or Failing Hard Disk</title> - <para> After concluding that the Hadoop and HBase were not likely to be the culprits, we - moved on to troubleshooting the datanode's hardware. Java, by design, will periodically - scan its entire memory space to do garbage collection. If system memory is heavily - overcommitted, the Linux kernel may enter a vicious cycle, using up all of its resources - swapping Java heap back and forth from disk to RAM as Java tries to run garbage - collection. Further, a failing hard disk will often retry reads and/or writes many times - before giving up and returning an error. This can manifest as high iowait, as running - processes wait for reads and writes to complete. Finally, a disk nearing the upper edge - of its performance envelope will begin to cause iowait as it informs the kernel that it - cannot accept any more data, and the kernel queues incoming data into the dirty write - pool in memory. However, using <code>vmstat(1)</code> and <code>free(1)</code>, we could - see that no swap was being used, and the amount of disk IO was only a few kilobytes per - second. </para> - </section> - <section> - <title>Slowness Due To High Processor Usage</title> - <para> Next, we checked to see whether the system was performing slowly simply due to very - high computational load. <code>top(1)</code> showed that the system load was higher than - normal, but <code>vmstat(1)</code> and <code>mpstat(1)</code> showed that the amount of - processor being used for actual computation was low. </para> - </section> - <section> - <title>Network Saturation (The Winner)</title> - <para> Since neither the disks nor the processors were being utilized heavily, we moved on - to the performance of the network interfaces. The datanode had two gigabit ethernet - adapters, bonded to form an active-standby interface. <code>ifconfig(8)</code> showed - some unusual anomalies, namely interface errors, overruns, framing errors. While not - unheard of, these kinds of errors are exceedingly rare on modern hardware which is - operating as it should: </para> - <screen language="bourne"> -$ /sbin/ifconfig bond0 -bond0 Link encap:Ethernet HWaddr 00:00:00:00:00:00 -inet addr:10.x.x.x Bcast:10.x.x.255 Mask:255.255.255.0 -UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 -RX packets:2990700159 errors:12 dropped:0 overruns:1 frame:6 <--- Look Here! Errors! -TX packets:3443518196 errors:0 dropped:0 overruns:0 carrier:0 -collisions:0 txqueuelen:0 -RX bytes:2416328868676 (2.4 TB) TX bytes:3464991094001 (3.4 TB) - </screen> - <para> These errors immediately lead us to suspect that one or more of the ethernet - interfaces might have negotiated the wrong line speed. This was confirmed both by - running an ICMP ping from an external host and observing round-trip-time in excess of - 700ms, and by running <code>ethtool(8)</code> on the members of the bond interface and - discovering that the active interface was operating at 100Mbs/, full duplex. </para> - <screen language="bourne"> -$ sudo ethtool eth0 -Settings for eth0: -Supported ports: [ TP ] -Supported link modes: 10baseT/Half 10baseT/Full - 100baseT/Half 100baseT/Full - 1000baseT/Full -Supports auto-negotiation: Yes -Advertised link modes: 10baseT/Half 10baseT/Full - 100baseT/Half 100baseT/Full - 1000baseT/Full -Advertised pause frame use: No -Advertised auto-negotiation: Yes -Link partner advertised link modes: Not reported -Link partner advertised pause frame use: No -Link partner advertised auto-negotiation: No -Speed: 100Mb/s <--- Look Here! Should say 1000Mb/s! -Duplex: Full -Port: Twisted Pair -PHYAD: 1 -Transceiver: internal -Auto-negotiation: on -MDI-X: Unknown -Supports Wake-on: umbg -Wake-on: g -Current message level: 0x00000003 (3) -Link detected: yes - </screen> - <para> In normal operation, the ICMP ping round trip time should be around 20ms, and the - interface speed and duplex should read, "1000MB/s", and, "Full", respectively. </para> - </section> - </section> - <section> - <title>Resolution</title> - <para> After determining that the active ethernet adapter was at the incorrect speed, we - used the <code>ifenslave(8)</code> command to make the standby interface the active - interface, which yielded an immediate improvement in MapReduce performance, and a 10 times - improvement in network throughput: </para> - <para> On the next trip to the datacenter, we determined that the line speed issue was - ultimately caused by a bad network cable, which was replaced. </para> - </section> - </section> - <!-- case study --> - <section - xml:id="casestudies.perf.1"> - <title>Case Study #2 (Performance Research 2012)</title> - <para> Investigation results of a self-described "we're not sure what's wrong, but it seems - slow" problem. <link - xlink:href="http://gbif.blogspot.com/2012/03/hbase-performance-evaluation-continued.html">http://gbif.blogspot.com/2012/03/hbase-performance-evaluation-continued.html</link> - </para> - </section> - - <section - xml:id="casestudies.perf.2"> - <title>Case Study #3 (Performance Research 2010))</title> - <para> Investigation results of general cluster performance from 2010. Although this research - is on an older version of the codebase, this writeup is still very useful in terms of - approach. <link - xlink:href="http://hstack.org/hbase-performance-testing/">http://hstack.org/hbase-performance-testing/</link> - </para> - </section> - - <section - xml:id="casestudies.max.transfer.threads"> - <title>Case Study #4 (max.transfer.threads Config)</title> - <para> Case study of configuring <code>max.transfer.threads</code> (previously known as - <code>xcievers</code>) and diagnosing errors from misconfigurations. <link - xlink:href="http://www.larsgeorge.com/2012/03/hadoop-hbase-and-xceivers.html">http://www.larsgeorge.com/2012/03/hadoop-hbase-and-xceivers.html</link> - </para> - <para> See also <xref - linkend="dfs.datanode.max.transfer.threads" />. </para> - </section> - - </section> - <!-- performance/troubleshooting --> - -</chapter>
http://git-wip-us.apache.org/repos/asf/hbase/blob/cb77a925/src/main/docbkx/community.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/community.xml b/src/main/docbkx/community.xml deleted file mode 100644 index 813f356..0000000 --- a/src/main/docbkx/community.xml +++ /dev/null @@ -1,149 +0,0 @@ -<?xml version="1.0"?> -<chapter - xml:id="community" - version="5.0" - xmlns="http://docbook.org/ns/docbook" - xmlns:xlink="http://www.w3.org/1999/xlink" - xmlns:xi="http://www.w3.org/2001/XInclude" - xmlns:svg="http://www.w3.org/2000/svg" - xmlns:m="http://www.w3.org/1998/Math/MathML" - xmlns:html="http://www.w3.org/1999/xhtml" - xmlns:db="http://docbook.org/ns/docbook"> - <!-- -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ ---> - <title>Community</title> - <section - xml:id="decisions"> - <title>Decisions</title> - <section - xml:id="feature_branches"> - <title>Feature Branches</title> - <para>Feature Branches are easy to make. You do not have to be a committer to make one. Just - request the name of your branch be added to JIRA up on the developer's mailing list and a - committer will add it for you. Thereafter you can file issues against your feature branch in - Apache HBase JIRA. Your code you keep elsewhere -- it should be public so it can be observed - -- and you can update dev mailing list on progress. When the feature is ready for commit, 3 - +1s from committers will get your feature merged. See <link - xlink:href="http://search-hadoop.com/m/asM982C5FkS1">HBase, mail # dev - Thoughts - about large feature dev branches</link></para> - </section> - <section - xml:id="patchplusonepolicy"> - <title>Patch +1 Policy</title> - <para> The below policy is something we put in place 09/2012. It is a suggested policy rather - than a hard requirement. We want to try it first to see if it works before we cast it in - stone. </para> - <para> Apache HBase is made of <link - xlink:href="https://issues.apache.org/jira/browse/HBASE#selectedTab=com.atlassian.jira.plugin.system.project%3Acomponents-panel">components</link>. - Components have one or more <xref - linkend="OWNER" />s. See the 'Description' field on the <link - xlink:href="https://issues.apache.org/jira/browse/HBASE#selectedTab=com.atlassian.jira.plugin.system.project%3Acomponents-panel">components</link> - JIRA page for who the current owners are by component. </para> - <para> Patches that fit within the scope of a single Apache HBase component require, at least, - a +1 by one of the component's owners before commit. If owners are absent -- busy or - otherwise -- two +1s by non-owners will suffice. </para> - <para> Patches that span components need at least two +1s before they can be committed, - preferably +1s by owners of components touched by the x-component patch (TODO: This needs - tightening up but I think fine for first pass). </para> - <para> Any -1 on a patch by anyone vetos a patch; it cannot be committed until the - justification for the -1 is addressed. </para> - </section> - <section - xml:id="hbase.fix.version.in.JIRA"> - <title>How to set fix version in JIRA on issue resolve</title> - <para>Here is how <link - xlink:href="http://search-hadoop.com/m/azemIi5RCJ1">we agreed</link> to set versions in - JIRA when we resolve an issue. If trunk is going to be 0.98.0 then: </para> - <itemizedlist> - <listitem> - <para> Commit only to trunk: Mark with 0.98 </para> - </listitem> - <listitem> - <para> Commit to 0.95 and trunk : Mark with 0.98, and 0.95.x </para> - </listitem> - <listitem> - <para> Commit to 0.94.x and 0.95, and trunk: Mark with 0.98, 0.95.x, and 0.94.x </para> - </listitem> - <listitem> - <para> Commit to 89-fb: Mark with 89-fb. </para> - </listitem> - <listitem> - <para> Commit site fixes: no version </para> - </listitem> - </itemizedlist> - </section> - <section - xml:id="hbase.when.to.close.JIRA"> - <title>Policy on when to set a RESOLVED JIRA as CLOSED</title> - <para>We <link - xlink:href="http://search-hadoop.com/m/4cIKs1iwXMS1">agreed</link> that for issues that - list multiple releases in their <emphasis>Fix Version/s</emphasis> field, CLOSE the issue on - the release of any of the versions listed; subsequent change to the issue must happen in a - new JIRA. </para> - </section> - <section - xml:id="no.permanent.state.in.zk"> - <title>Only transient state in ZooKeeper!</title> - <para> You should be able to kill the data in zookeeper and hbase should ride over it - recreating the zk content as it goes. This is an old adage around these parts. We just made - note of it now. We also are currently in violation of this basic tenet -- replication at - least keeps permanent state in zk -- but we are working to undo this breaking of a golden - rule. </para> - </section> - </section> - <section - xml:id="community.roles"> - <title>Community Roles</title> - <section - xml:id="OWNER"> - <title>Component Owner/Lieutenant</title> - <para> Component owners are listed in the description field on this Apache HBase JIRA <link - xlink:href="https://issues.apache.org/jira/browse/HBASE#selectedTab=com.atlassian.jira.plugin.system.project%3Acomponents-panel">components</link> - page. The owners are listed in the 'Description' field rather than in the 'Component Lead' - field because the latter only allows us list one individual whereas it is encouraged that - components have multiple owners. </para> - <para> Owners or component lieutenants are volunteers who are (usually, but not necessarily) - expert in their component domain and may have an agenda on how they think their Apache HBase - component should evolve. </para> - <orderedlist> - <title>Component Owner Duties</title> - <listitem> - <para> Owners will try and review patches that land within their component's scope. - </para> - </listitem> - <listitem> - <para> If applicable, if an owner has an agenda, they will publish their goals or the - design toward which they are driving their component </para> - </listitem> - </orderedlist> - <para> If you would like to be volunteer as a component owner, just write the dev list and - we'll sign you up. Owners do not need to be committers. </para> - </section> - </section> - <section - xml:id="hbase.commit.msg.format"> - <title>Commit Message format</title> - <para>We <link - xlink:href="http://search-hadoop.com/m/Gwxwl10cFHa1">agreed</link> to the following SVN - commit message format: - <programlisting>HBASE-xxxxx <title>. (<contributor>)</programlisting> If the person - making the commit is the contributor, leave off the '(<contributor>)' element. </para> - </section> -</chapter>
