The problem is that fsimage and edits are no longer being updated, so…if I restart, how could it replay those?
-jeremy On Sep 7, 2011, at 8:48 AM, Ravi Prakash wrote: > Actually I take that back. Restarting the NN might not result in loss of > data. It will probably just take longer to start up because it would read > the fsimage, then apply the fsedits (rather than the SNN doing it). > > On Wed, Sep 7, 2011 at 10:46 AM, Ravi Prakash <[email protected]> wrote: > >> Hi Jeremy, >> >> Couple of questions: >> >> 1. Which version of Hadoop are you using? >> 2. If you write something into HDFS, can you subsequently read it? >> 3. Are you sure your secondarynamenode configuration is correct? It seems >> like your SNN is telling your NN to roll the edit log (move the journaling >> directory from current to .new), but when it tries to download the image >> file, its not finding it. >> 3. I wish I could say I haven't ever seen that stack trace in the logs. I >> was seeing something similar (not the same, quite far from it actually) ( >> https://issues.apache.org/jira/browse/HDFS-2011 ). >> >> If I were you, and I felt exceptionally brave (mind you I've worked with >> only test systems, no production sys-admin guts for me ;-) ) I would >> probably do everything I can, to get the secondarynamenode started properly >> and make it checkpoint properly. >> >> Me thinks restarting the namenode will most likely result in loss of data. >> >> Hope this helps >> Ravi. >> >> >> >> >> On Tue, Sep 6, 2011 at 7:26 PM, Jeremy Hansen <[email protected]> wrote: >> >>> >>> I happened to notice this today and being fairly new to administering >>> hadoop, I'm not exactly sure how to pull out of this situation without data >>> loss. >>> >>> The checkpoint hasn't happened since Sept 2nd. >>> >>> -rw-r--r-- 1 hdfs hdfs 8889 Sep 2 14:09 edits >>> -rw-r--r-- 1 hdfs hdfs 195968056 Sep 2 14:09 fsimage >>> -rw-r--r-- 1 hdfs hdfs 195979439 Sep 2 14:09 fsimage.ckpt >>> -rw-r--r-- 1 hdfs hdfs 8 Sep 2 14:09 fstime >>> -rw-r--r-- 1 hdfs hdfs 100 Sep 2 14:09 VERSION >>> >>> /mnt/data0/dfs/nn/image >>> -rw-r--r-- 1 hdfs hdfs 157 Sep 2 14:09 fsimage >>> >>> I'm also seeing this in the NN logs: >>> >>> 2011-09-06 16:48:23,738 INFO >>> org.apache.hadoop.hdfs.server.**namenode.FSNamesystem: >>> Roll Edit Log from 10.10.10.11 >>> 2011-09-06 16:48:23,740 WARN org.mortbay.log: /getimage: >>> java.io.IOException: GetImage failed. java.lang.NullPointerException >>> at org.apache.hadoop.hdfs.server.**namenode.FSImage.getImageFile(* >>> *FSImage.java:219) >>> at org.apache.hadoop.hdfs.server.**namenode.FSImage.** >>> getFsImageName(FSImage.java:**1584) >>> at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.** >>> run(GetImageServlet.java:75) >>> at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.** >>> run(GetImageServlet.java:70) >>> at java.security.**AccessController.doPrivileged(**Native Method) >>> at javax.security.auth.Subject.**doAs(Subject.java:396) >>> at org.apache.hadoop.security.**UserGroupInformation.doAs(** >>> UserGroupInformation.java:**1115) >>> at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet.** >>> doGet(GetImageServlet.java:70) >>> at javax.servlet.http.**HttpServlet.service(** >>> HttpServlet.java:707) >>> at javax.servlet.http.**HttpServlet.service(** >>> HttpServlet.java:820) >>> at org.mortbay.jetty.servlet.**ServletHolder.handle(** >>> ServletHolder.java:511) >>> at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.** >>> doFilter(ServletHandler.java:**1221) >>> at org.apache.hadoop.http.**HttpServer$QuotingInputFilter.** >>> doFilter(HttpServer.java:824) >>> at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.** >>> doFilter(ServletHandler.java:**1212) >>> at org.mortbay.jetty.servlet.**ServletHandler.handle(** >>> ServletHandler.java:399) >>> at org.mortbay.jetty.security.**SecurityHandler.handle(** >>> SecurityHandler.java:216) >>> at org.mortbay.jetty.servlet.**SessionHandler.handle(** >>> SessionHandler.java:182) >>> at org.mortbay.jetty.handler.**ContextHandler.handle(** >>> ContextHandler.java:766) >>> at org.mortbay.jetty.webapp.**WebAppContext.handle(** >>> WebAppContext.java:450) >>> at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(* >>> *ContextHandlerCollection.java:**230) >>> at org.mortbay.jetty.handler.**HandlerWrapper.handle(** >>> HandlerWrapper.java:152) >>> at org.mortbay.jetty.Server.**handle(Server.java:326) >>> at org.mortbay.jetty.**HttpConnection.handleRequest(** >>> HttpConnection.java:542) >>> at org.mortbay.jetty.**HttpConnection$RequestHandler.** >>> headerComplete(HttpConnection.**java:928) >>> at org.mortbay.jetty.HttpParser.**parseNext(HttpParser.java:549) >>> at org.mortbay.jetty.HttpParser.**parseAvailable(HttpParser.** >>> java:212) >>> at org.mortbay.jetty.**HttpConnection.handle(** >>> HttpConnection.java:404) >>> >>> On the secondary name node: >>> >>> 2011-09-06 16:51:53,538 ERROR >>> org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode: >>> java.io.FileNotFoundException: http://ftrr-nam6000.** >>> chestermcgee.com:50070/**getimage?getimage=1<http://ftrr-nam6000.chestermcgee.com:50070/getimage?getimage=1> >>> at sun.reflect.**NativeConstructorAccessorImpl.**newInstance0(Native >>> Method) >>> at sun.reflect.**NativeConstructorAccessorImpl.**newInstance(** >>> NativeConstructorAccessorImpl.**java:39) >>> at sun.reflect.**DelegatingConstructorAccessorI**mpl.newInstance(* >>> *DelegatingConstructorAccessorI**mpl.java:27) >>> at java.lang.reflect.Constructor.**newInstance(Constructor.java:** >>> 513) >>> at sun.net.www.protocol.http.**HttpURLConnection$6.run(** >>> HttpURLConnection.java:1360) >>> at java.security.**AccessController.doPrivileged(**Native Method) >>> at sun.net.www.protocol.http.**HttpURLConnection.** >>> getChainedException(**HttpURLConnection.java:1354) >>> at sun.net.www.protocol.http.**HttpURLConnection.**getInputStream( >>> **HttpURLConnection.java:1008) >>> at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.** >>> getFileClient(TransferFsImage.**java:183) >>> at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode$3.** >>> run(SecondaryNameNode.java:**348) >>> at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode$3.** >>> run(SecondaryNameNode.java:**337) >>> at java.security.**AccessController.doPrivileged(**Native Method) >>> at javax.security.auth.Subject.**doAs(Subject.java:396) >>> at org.apache.hadoop.security.**UserGroupInformation.doAs(** >>> UserGroupInformation.java:**1115) >>> at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.** >>> downloadCheckpointFiles(**SecondaryNameNode.java:337) >>> at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.** >>> doCheckpoint(**SecondaryNameNode.java:422) >>> at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.** >>> doWork(SecondaryNameNode.java:**313) >>> at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.** >>> run(SecondaryNameNode.java:**276) >>> at java.lang.Thread.run(Thread.**java:619) >>> Caused by: java.io.FileNotFoundException: http://ftrr-nam6000.las1.** >>> fanops.net:50070/getimage?**getimage=1<http://ftrr-nam6000.las1.fanops.net:50070/getimage?getimage=1> >>> at sun.net.www.protocol.http.**HttpURLConnection.**getInputStream( >>> **HttpURLConnection.java:1303) >>> at sun.net.www.protocol.http.**HttpURLConnection.**getHeaderField( >>> **HttpURLConnection.java:2165) >>> at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.** >>> getFileClient(TransferFsImage.**java:175) >>> ... 10 more >>> >>> Any help would be very much appreciated. I'm scared to shut down the NN. >>> I've tried restarting the 2NN. >>> >>> Thank You >>> -jeremy >>> >> >>
