Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by RaghuAngadi: http://wiki.apache.org/lucene-hadoop/Hadoop_0%2e14_Upgrade The comment on the change is: First Version of upgrade guide for 0.14 ------------------------------------------------------------------------------ = Upgrade Guide for Hadoop-0.14 = - - '''XXX This document is still under development'''. Should be complete by end of Aug 21st. This page describes upgrade information that is specific to Hadoop-0.14. The usual upgrade described in [:Hadoop_Upgrade: Hadoop Upgrade page] still applies for Hadoop-0.14. == Brief Upgrade Procedure == - In most cases, upgrade to Hadoop-0.14 completes without any problems. In these case, administrators do not need to rest of the sections in this document. The simple upgrade steps are same as listed in [:Hadoop_Upgrade:Hadoop Upgrade]: + In most cases, upgrade to Hadoop-0.14 completes without any problems. In these cases, administrators are not required to be very familiar with rest of the sections in this document. The simple upgrade steps are same as listed in [:Hadoop_Upgrade:Hadoop Upgrade]: 1. If you are running Hadoop-0.13.x, make sure the cluster is finalized. 1. Stop map-reduce cluster(s) and all client applications running on the DFS cluster. @@ -16, +14 @@ 1. Install new version of Hadoop software. 1. Start DFS cluster with âupgrade option. 1. Wait for for cluster upgrade to complete. - 6. Start map-reduce cluster. + 1. Start map-reduce cluster. - 7. Verify the components run properly and finalize the upgrade when convinced. + 1. Verify the components run properly and finalize the upgrade when convinced. The rest of the document describes what happens once the cluster is started with {{{-upgrade}}} option. @@ -28, +26 @@ Depending on number of blocks and number of files in HDFS, upgrade can take anywhere from a few minutes to a few hours. There are three stages in this upgrade : - 1. '''SafeMode''' : Similar to normal restart of the cluster, namenode waits for datanodes in the cluster to report their blocks. The cluster may wait in the state for a long time if some of the datanodes do not report their blocks. + 1. '''Safe Mode''' : Similar to normal restart of the cluster, namenode waits for datanodes in the cluster to report their blocks. The cluster may wait in the state for a long time if some of the datanodes do not report their blocks. 1. '''Datanode Upgrade''' : Once the most of the blocks are reported, namenode asks the registered datanodes to start their local upgrade. Namenode waits for for ''all'' the datanodes to complete their upgrade. 1. '''Deleting {{{.crc}}} files''' : Namenode deletes {{{.crc}}} files that were previously used for storing checksum. - === Monitoring the Upgrade === + == Monitoring the Upgrade == The cluster stays in ''safeMode'' until the upgrade is complete. HDFS webui is a good place to check if safeMode is on or off. As always log files from ''namenode'' and ''datanode'' are useful when nothing else helps. - Once the cluster is started with {{{-upgrade}}} option, the simplest way to monitor the upgrade is with '{{{dfsadmin -upgradeProgress status}}}' command. A typical output from this command looks like this: {{{ + Once the cluster is started with {{{-upgrade}}} option, the simplest way to monitor the upgrade is with '{{{dfsadmin -upgradeProgress status}}}' command. + + === First Stage : Safe Mode === + + The actual Block CRC upgrade starts after all or most of the datanodes have reported their blocks. {{{ + $ bin/hadoop dfsadmin -upgradeProgress status + Distributed upgrade for version -6 is in progress. Status = 0% + + Upgrade has not been started yet. + Last Block Level Stats updated at : Thu Jan 01 00:00:00 UTC 1970 + .... + }}} + The message {{{Upgrade has not been started yet}}} indicates that namenode is in the first stage. When ''status'' is at 0%, usually it is in this stage. If some datanodes don't start, check HDFS webui to find which datanodes are listed under ''Dead Nodes'' table. + + === Second Stage : Datanode Upgrade === + + During this stage a typical output from {{{upgradeProgress}}} command looks like this: {{{ $ bin/hadoop dfsadmin -upgradeProgress status Distributed upgrade for version -6 is in progress. Status = 78% @@ -59, +73 @@ * {{{Un-upgraded}}} : blocks with zero upgraded replicas. * {{{Brief Datanode Status}}} : Each datanode reports its progress to the namenode during the upgrade. This shows average of percent completion on all the datanodes. This also shows how many datanodes have completed their upgrade. For the upgrade to proceed to next stage, all the datanodes should report completion of their local upgrade. + Note that in some cases, a few blocks might be ''over-replicated'' in such cases, upgrade might proceed to next stage even if some of the datanodes do not complete their upgrade. If {{{Fully Upgraded}}} is calculated to be 100%, namenode will proceed to next stage. + + ==== Potential Problems during Second Stage ==== + * ''The upgrade might seem to be stuck'' : Each datanode reports its progress once every minute. If the percent completion does not change change even afeter a few minutes, some datanodes might have some unexpected problems. Use {{{details}}} option with {{{-upgradeProgress}}} command to check which datanodes seem stagnant. {{{ + $ bin/hadoop dfsadmin -upgradeProgress details + Distributed upgrade for version -6 is in progress. Status = 72% + + Last Block Level Stats updated at : Thu Jan 01 00:00:00 UTC 1970 + Last Block Level Stats : Total Blocks : 0 + Fully Upgragraded : 0.00% + Minimally Upgraded : 0.00% + Under Upgraded : 0.00% (includes Un-upgraded blocks) + Un-upgraded : 0.00% + Errors : 0 + Brief Datanode Status : Avg completion of all Datanodes: 81.90% with 0 errors. + 352 out of 893 nodes are not done. + + Datanode Stats (total: 893): pct Completion(%) blocks upgraded (u) blocks remaining (r) errors (e) + + 192.168.0.31:50010 : 54 % 2136 u 1804 r 0 e + 192.168.0.136:50010 : 73 % 3074 u 1085 r 0 e + 192.168.0.24:50010 : 50 % 2044 u 1999 r 0 e + 192.168.0.214:50010 : 100 % 4678 u 0 r 0 e + ... + }}} You can run this command through '{{{grep -v "100 %"}}}' to find the nodes that have not completed their upgrade. If the problem nodes can not be corrected, as a last resort you can check ''Block Level Stats'' to see if the upgrade can be ''forced'' to next stage. E.g. if 98% are fully-upgraded and 2% minimally-upgraded, then you can reasonably sure that at least one copy of a block is upgraded. You can force next stage with {{{force}}} option : {{{ + $ bin/hadoop dfsadmin -upgradeProgress force + Distributed upgrade for version -6 is in progress. Status = 90% + + Force Proceed is ON + Last Block Level Stats updated at : Mon Aug 13 22:43:31 UTC 2007 + Last Block Level Stats : Total Blocks : 1054713 + Fully Upgragraded : 99.40% + Minimally Upgraded : 0.60% + Under Upgraded : 0.00% (includes Un-upgraded blocks) + Un-upgraded : 0.00% + Errors : 0 + Brief Datanode Status : Avg completion of all Datanodes: 99.89% with 0 errors. + 1 out of 893 nodes are not done. + NOTE: Upgrade at the Datanodes has finished. Deleteing ".crc" files + can take longer than status implies. + }}} Note {{{Force Proceed is ON}}} in the status message. + + === Third Stage : Deleting {{{.crc}}} files === + Once the second stage is complete, Namenode reports 90% completiong. It does not have a very good way of estimating time required for deleting the files. The ''status'' reports 90% completion all through this stage. Later tests with larger number of files indicates that it takes one hour to delete 2 million files on a rack server. The upgrade status report looks like the following. {{{ + $ bin/hadoop dfsadmin -upgradeProgress status + Distributed upgrade for version -6 is in progress. Status = 90% + + Last Block Level Stats updated at : Mon Aug 20 20:24:56 UTC 2007 + Last Block Level Stats : Total Blocks : 11604180 + Fully Upgragraded : 100.00% + Minimally Upgraded : 0.00% + Under Upgraded : 0.00% (includes Un-upgraded blocks) + Un-upgraded : 0.00% + Errors : 0 + Brief Datanode Status : Avg completion of all Datanodes: 100.00% with 0 errors. + NOTE: Upgrade at the Datanodes has finished. Deleteing ".crc" files + can take longer than status implies. + }}} Note the last two lines that inform that Namenode is currently deleting {{{.crc}}} files. + + === Upgrade is Finally Complete === + Once the upgrade is complete, ''safeMode'' will be turned off and HDFS runs normally. There is no need to restart the cluster. Now enjoy the new and shiny Hadoop with leaner Namenode. {{{ + $ bin/hadoop dfsadmin -upgradeProgress status + There are no distributed upgrades in progress. + }}} + + === Memory requirements === + + HDFS nodes do not require more memory during the upgrade than for normal operation before the upgrade. We observed that Namenode might use 5-10% more memory (or more GC in JVM) during the upgrade. If the namenode was operating at the edge of its memory limits during the upgrade, it could potentially have some problems. At any time, cluster can be restarted and the HDFS resumes the upgrade. + + === Restarting a cluster === + + The cluster can be restarted during any stage of the upgrade and it will resume the upgrade. + + === Analyzing Log Files === + + As a last resort while diagnosing problems, administrator could look at logs at Namenode and Datanode. It might be information overload to list all the relevant log messages here. Of course, developers most appreciate if the relevant logs are attached while reporting problems with the upgrade, along with output from {{{-upgradeProgress}}} command. + + Some of the warnings on log files are expected during the upgrade. For e.g. during the upgrade, datanodes fetch checksum data located on their peers. These data transfers utilize the new protocols in Hadoop-0.14 that require checksum data to be present along with block data. Since the checksum data is not yet located next to the block you will see the following warning int the datanode logs : {{{ + 2007-08-18 07:17:38,698 WARN org.apache.hadoop.dfs.DataNode: Could not find metadata file for blk_2214836660875523305 + }}} + + +