[jira] Commented: (CHUKWA-487) Collector left in a bad state after temprorary NN outage

Bill Graham (JIRA) Mon, 10 May 2010 13:50:58 -0700

    [ 
https://issues.apache.org/jira/browse/CHUKWA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865918#action_12865918
 ]


Bill Graham commented on CHUKWA-487:
------------------------------------

Here's what I saw in the logs when I had to restart my NN. It took a little 
while to exit safe mode. I had to restore from he secondary name node so there 
might have been some data loss upon restore.

131122010-05-06 17:32:19,515 INFO Timer-3 SeqFileWriter - 
stat:datacollection.writer.hdfs dataSize=318716 dataRate=10622
2010-05-06 17:32:49,518 INFO Timer-3 SeqFileWriter - 
stat:datacollection.writer.hdfs dataSize=196741 dataRate=6557
2010-05-06 17:33:06,367 INFO Timer-1 root - 
stats:ServletCollector,numberHTTPConnection:129,numberchunks:217
2010-05-06 17:33:19,521 INFO Timer-3 SeqFileWriter - 
stat:datacollection.writer.hdfs dataSize=0 dataRate=0
2010-05-06 17:33:49,523 INFO Timer-3 SeqFileWriter - 
stat:datacollection.writer.hdfs dataSize=0 dataRate=0
2010-05-06 17:34:01,142 WARN 
org.apache.hadoop.dfs.dfsclient$leasechec...@36b60b93 DFSClient - Problem 
renewing lease for DFSClient_-10
88933168: org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.dfs.SafeModeException: Cannot renew lease for 
DFSClient_-1088933168.
 Name node is in safe mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe 
mode will be turned off automatically.
        at org.apache.hadoop.dfs.FSNamesystem.renewLease(FSNamesystem.java:1823)
        at org.apache.hadoop.dfs.NameNode.renewLease(NameNode.java:458)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890)
        at org.apache.hadoop.ipc.Client.call(Client.java:716)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
        at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:781)
        at java.lang.Thread.run(Thread.java:619)

2010-05-06 17:34:01,608 WARN Timer-2094 SeqFileWriter - Got an exception in 
rotate
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException: 
Cannot complete file 
/chukwa/logs/201006172737418_xxxxxxxxxcom_71ea99261284ab9f0566faa.chukwa. Name 
node is in safe mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe 
mode will be turned off automatically.
        at 
org.apache.hadoop.dfs.FSNamesystem.completeFileInternal(FSNamesystem.java:1209)
        at 
org.apache.hadoop.dfs.FSNamesystem.completeFile(FSNamesystem.java:1200)
        at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:351)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890)
        at org.apache.hadoop.ipc.Client.call(Client.java:716)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at org.apache.hadoop.dfs.$Proxy0.complete(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at org.apache.hadoop.dfs.$Proxy0.complete(Unknown Source)
        at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:2736)
        at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:2657)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
        at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
        at 
org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter.rotate(SeqFileWriter.java:194)
        at 
org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter$1.run(SeqFileWriter.java:235)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
2010-05-06 17:34:01,647 FATAL Timer-2094 SeqFileWriter - IO Exception in 
rotate. Exiting!
2010-05-06 17:34:01,661 FATAL btpool0-6248 SeqFileWriter - IOException when 
trying to write a chunk, Collector is going to exit!
java.io.IOException: Stream closed.
        at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.isClosed(DFSClient.java:2245)
        at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:2481)
        at 
org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155)
        at 
org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
        at 
org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
        at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
        at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at 
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1016)
        at 
org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter.add(SeqFileWriter.java:281)
        at 
org.apache.hadoop.chukwa.datacollection.collector.servlet.ServletCollector.accept(ServletCollector.java:152)
        at 
org.apache.hadoop.chukwa.datacollection.collector.servlet.ServletCollector.doPost(ServletCollector.java:190)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:487)
        at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:362)
        at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:729)
        at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:324)
        at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:505)
        at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:843)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:647)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:380)
        at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:395)
        at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:450)
2010-05-06 17:34:06,370 INFO Timer-1 root - 
stats:ServletCollector,numberHTTPConnection:28,numberchunks:0
2010-05-06 17:35:06,375 INFO Timer-1 root - 
stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
2010-05-06 17:36:06,379 INFO Timer-1 root - 
stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
2010-05-06 17:37:06,384 INFO Timer-1 root - 
stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
...

> Collector left in a bad state after temprorary NN outage
> --------------------------------------------------------
>
>                 Key: CHUKWA-487
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-487
>             Project: Hadoop Chukwa
>          Issue Type: Bug
>          Components: data collection
>    Affects Versions: 0.4.0
>            Reporter: Bill Graham
>
> When the name node returns errors to the collector, at some point the 
> collector dies half way. This behavior should be changed to either resemble 
> the agents and keep trying, or to completely shutdown. Instead, what I'm 
> seeing is that the collector logs that it's shutting down, and the 
> var/pidDir/Collector.pid file gets removed, but the collector continues to 
> run, albeit not handling new data. Instead, this log entry is repeated ad 
> infinitum:
> 2010-05-06 17:35:06,375 INFO Timer-1 root - 
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:36:06,379 INFO Timer-1 root - 
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:37:06,384 INFO Timer-1 root - 
> stats:ServletCollector,numberHTTPConnection:0,numberchunks:0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-487) Collector left in a bad state after temprorary NN outage

Reply via email to