If your HDFS is still working, the fsimage file won't be getting updated but the edits file still should. That's why I asked question 2.
On Wed, Sep 7, 2011 at 11:39 AM, Jeremy Hansen <[email protected]> wrote: > The problem is that fsimage and edits are no longer being updated, so…if I > restart, how could it replay those? > > -jeremy > > > On Sep 7, 2011, at 8:48 AM, Ravi Prakash wrote: > > > Actually I take that back. Restarting the NN might not result in loss of > > data. It will probably just take longer to start up because it would read > > the fsimage, then apply the fsedits (rather than the SNN doing it). > > > > On Wed, Sep 7, 2011 at 10:46 AM, Ravi Prakash <[email protected]> > wrote: > > > >> Hi Jeremy, > >> > >> Couple of questions: > >> > >> 1. Which version of Hadoop are you using? > >> 2. If you write something into HDFS, can you subsequently read it? > >> 3. Are you sure your secondarynamenode configuration is correct? It > seems > >> like your SNN is telling your NN to roll the edit log (move the > journaling > >> directory from current to .new), but when it tries to download the image > >> file, its not finding it. > >> 3. I wish I could say I haven't ever seen that stack trace in the logs. > I > >> was seeing something similar (not the same, quite far from it actually) > ( > >> https://issues.apache.org/jira/browse/HDFS-2011 ). > >> > >> If I were you, and I felt exceptionally brave (mind you I've worked with > >> only test systems, no production sys-admin guts for me ;-) ) I would > >> probably do everything I can, to get the secondarynamenode started > properly > >> and make it checkpoint properly. > >> > >> Me thinks restarting the namenode will most likely result in loss of > data. > >> > >> Hope this helps > >> Ravi. > >> > >> > >> > >> > >> On Tue, Sep 6, 2011 at 7:26 PM, Jeremy Hansen <[email protected]> > wrote: > >> > >>> > >>> I happened to notice this today and being fairly new to administering > >>> hadoop, I'm not exactly sure how to pull out of this situation without > data > >>> loss. > >>> > >>> The checkpoint hasn't happened since Sept 2nd. > >>> > >>> -rw-r--r-- 1 hdfs hdfs 8889 Sep 2 14:09 edits > >>> -rw-r--r-- 1 hdfs hdfs 195968056 Sep 2 14:09 fsimage > >>> -rw-r--r-- 1 hdfs hdfs 195979439 Sep 2 14:09 fsimage.ckpt > >>> -rw-r--r-- 1 hdfs hdfs 8 Sep 2 14:09 fstime > >>> -rw-r--r-- 1 hdfs hdfs 100 Sep 2 14:09 VERSION > >>> > >>> /mnt/data0/dfs/nn/image > >>> -rw-r--r-- 1 hdfs hdfs 157 Sep 2 14:09 fsimage > >>> > >>> I'm also seeing this in the NN logs: > >>> > >>> 2011-09-06 16:48:23,738 INFO > org.apache.hadoop.hdfs.server.**namenode.FSNamesystem: > >>> Roll Edit Log from 10.10.10.11 > >>> 2011-09-06 16:48:23,740 WARN org.mortbay.log: /getimage: > >>> java.io.IOException: GetImage failed. java.lang.NullPointerException > >>> at > org.apache.hadoop.hdfs.server.**namenode.FSImage.getImageFile(* > >>> *FSImage.java:219) > >>> at org.apache.hadoop.hdfs.server.**namenode.FSImage.** > >>> getFsImageName(FSImage.java:**1584) > >>> at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.** > >>> run(GetImageServlet.java:75) > >>> at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.** > >>> run(GetImageServlet.java:70) > >>> at java.security.**AccessController.doPrivileged(**Native Method) > >>> at javax.security.auth.Subject.**doAs(Subject.java:396) > >>> at org.apache.hadoop.security.**UserGroupInformation.doAs(** > >>> UserGroupInformation.java:**1115) > >>> at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet.** > >>> doGet(GetImageServlet.java:70) > >>> at javax.servlet.http.**HttpServlet.service(** > >>> HttpServlet.java:707) > >>> at javax.servlet.http.**HttpServlet.service(** > >>> HttpServlet.java:820) > >>> at org.mortbay.jetty.servlet.**ServletHolder.handle(** > >>> ServletHolder.java:511) > >>> at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.** > >>> doFilter(ServletHandler.java:**1221) > >>> at org.apache.hadoop.http.**HttpServer$QuotingInputFilter.** > >>> doFilter(HttpServer.java:824) > >>> at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.** > >>> doFilter(ServletHandler.java:**1212) > >>> at org.mortbay.jetty.servlet.**ServletHandler.handle(** > >>> ServletHandler.java:399) > >>> at org.mortbay.jetty.security.**SecurityHandler.handle(** > >>> SecurityHandler.java:216) > >>> at org.mortbay.jetty.servlet.**SessionHandler.handle(** > >>> SessionHandler.java:182) > >>> at org.mortbay.jetty.handler.**ContextHandler.handle(** > >>> ContextHandler.java:766) > >>> at org.mortbay.jetty.webapp.**WebAppContext.handle(** > >>> WebAppContext.java:450) > >>> at > org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(* > >>> *ContextHandlerCollection.java:**230) > >>> at org.mortbay.jetty.handler.**HandlerWrapper.handle(** > >>> HandlerWrapper.java:152) > >>> at org.mortbay.jetty.Server.**handle(Server.java:326) > >>> at org.mortbay.jetty.**HttpConnection.handleRequest(** > >>> HttpConnection.java:542) > >>> at org.mortbay.jetty.**HttpConnection$RequestHandler.** > >>> headerComplete(HttpConnection.**java:928) > >>> at org.mortbay.jetty.HttpParser.**parseNext(HttpParser.java:549) > >>> at org.mortbay.jetty.HttpParser.**parseAvailable(HttpParser.** > >>> java:212) > >>> at org.mortbay.jetty.**HttpConnection.handle(** > >>> HttpConnection.java:404) > >>> > >>> On the secondary name node: > >>> > >>> 2011-09-06 16:51:53,538 ERROR > org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode: > >>> java.io.FileNotFoundException: http://ftrr-nam6000.** > >>> chestermcgee.com:50070/**getimage?getimage=1< > http://ftrr-nam6000.chestermcgee.com:50070/getimage?getimage=1> > >>> at > sun.reflect.**NativeConstructorAccessorImpl.**newInstance0(Native > >>> Method) > >>> at sun.reflect.**NativeConstructorAccessorImpl.**newInstance(** > >>> NativeConstructorAccessorImpl.**java:39) > >>> at > sun.reflect.**DelegatingConstructorAccessorI**mpl.newInstance(* > >>> *DelegatingConstructorAccessorI**mpl.java:27) > >>> at > java.lang.reflect.Constructor.**newInstance(Constructor.java:** > >>> 513) > >>> at sun.net.www.protocol.http.**HttpURLConnection$6.run(** > >>> HttpURLConnection.java:1360) > >>> at java.security.**AccessController.doPrivileged(**Native Method) > >>> at sun.net.www.protocol.http.**HttpURLConnection.** > >>> getChainedException(**HttpURLConnection.java:1354) > >>> at > sun.net.www.protocol.http.**HttpURLConnection.**getInputStream( > >>> **HttpURLConnection.java:1008) > >>> at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.** > >>> getFileClient(TransferFsImage.**java:183) > >>> at > org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode$3.** > >>> run(SecondaryNameNode.java:**348) > >>> at > org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode$3.** > >>> run(SecondaryNameNode.java:**337) > >>> at java.security.**AccessController.doPrivileged(**Native Method) > >>> at javax.security.auth.Subject.**doAs(Subject.java:396) > >>> at org.apache.hadoop.security.**UserGroupInformation.doAs(** > >>> UserGroupInformation.java:**1115) > >>> at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.** > >>> downloadCheckpointFiles(**SecondaryNameNode.java:337) > >>> at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.** > >>> doCheckpoint(**SecondaryNameNode.java:422) > >>> at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.** > >>> doWork(SecondaryNameNode.java:**313) > >>> at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.** > >>> run(SecondaryNameNode.java:**276) > >>> at java.lang.Thread.run(Thread.**java:619) > >>> Caused by: java.io.FileNotFoundException: http://ftrr-nam6000.las1.** > >>> fanops.net:50070/getimage?**getimage=1< > http://ftrr-nam6000.las1.fanops.net:50070/getimage?getimage=1> > >>> at > sun.net.www.protocol.http.**HttpURLConnection.**getInputStream( > >>> **HttpURLConnection.java:1303) > >>> at > sun.net.www.protocol.http.**HttpURLConnection.**getHeaderField( > >>> **HttpURLConnection.java:2165) > >>> at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.** > >>> getFileClient(TransferFsImage.**java:175) > >>> ... 10 more > >>> > >>> Any help would be very much appreciated. I'm scared to shut down the > NN. > >>> I've tried restarting the 2NN. > >>> > >>> Thank You > >>> -jeremy > >>> > >> > >> > >
