[ 
https://issues.apache.org/jira/browse/HADOOP-9684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700048#comment-13700048
 ] 

Hua xu commented on HADOOP-9684:
--------------------------------

It May be better that:
{code}
 private synchronized void setupIOstreams() throws InterruptedException {
      
-     if (socket != null || shouldCloseConnection.get()) {
-       return;
-     }
      
+     if(this.out != null || shouldCloseConnection.get()){
+         return;
+     }
      
      short ioFailures = 0;
      short timeoutFailures = 0;
      try {
        if (LOG.isDebugEnabled()) {
          LOG.debug("Connecting to "+server);
        }
        while (true) {
          try {
            this.socket = socketFactory.createSocket();
            this.socket.setTcpNoDelay(tcpNoDelay);
            // connection time out is 20s
            NetUtils.connect(this.socket, remoteId.getAddress(), 20000);
            this.socket.setSoTimeout(pingInterval);
            break;
          } catch (SocketTimeoutException toe) {
            /* The max number of retries is 45,
             * which amounts to 20s*45 = 15 minutes retries.
             */
            handleConnectionFailure(timeoutFailures++, 45, toe);
          } catch (IOException ie) {
            handleConnectionFailure(ioFailures++, maxRetries, ie);
          }
        }
        InputStream inStream = NetUtils.getInputStream(socket);
        OutputStream outStream = NetUtils.getOutputStream(socket);
        writeRpcHeader(outStream);
        if (useSasl) {
          final InputStream in2 = inStream;
          final OutputStream out2 = outStream;
          UserGroupInformation ticket = remoteId.getTicket();
          if (authMethod == AuthMethod.KERBEROS) {
            if (ticket.getRealUser() != null) {
              ticket = ticket.getRealUser();
            }
          }
          if (ticket.doAs(new PrivilegedExceptionAction<Boolean>() {
            @Override
            public Boolean run() throws IOException {
              return setupSaslConnection(in2, out2);
            }
          })) {
            // Sasl connect is successful. Let's set up Sasl i/o streams.
            inStream = saslRpcClient.getInputStream(inStream);
            outStream = saslRpcClient.getOutputStream(outStream);
          } else {
            // fall back to simple auth because server told us so.
            authMethod = AuthMethod.SIMPLE;
            header = new ConnectionHeader(header.getProtocol(),
                header.getUgi(), authMethod);
            useSasl = false;
          }
        }
        if (doPing) {
          this.in = new DataInputStream(new BufferedInputStream
            (new PingInputStream(inStream)));
        } else {
          this.in = new DataInputStream(new BufferedInputStream
            (inStream));
        }
        
        
        //byte[] data = new byte[1024*1024*5];
        
        this.out = new DataOutputStream(new BufferedOutputStream(outStream));
        
        writeHeader();

        // update last activity time
        touch();

        // start the receiver thread after the socket connection has been set up
        start();

      } 
-         catch (IOException e) {
-       markClosed(e);
-       close();
-      }
+     catch (Throwable tb){
+               IOException ex = null;
+               
+               if(tb instanceof IOException) ex = (IOException) tb;
+               
+               markClosed(ex);
+               close();
+               
+               if(tb instanceof Error) throw (Error)tb;        
          }
    }

{code}
                
> The initialization may be missed for org.apache.ipc.Client$Connection
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-9684
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9684
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 1.0.3, 0.21.0
>            Reporter: Hua xu
>
> Today, we see that a TaskTracer has keeped throwing the same exception in our 
> production environment.it is that:
> 2013-07-01 18:41:40,023 INFO org.apache.hadoop.mapred.TaskTracker: 
> addFreeSlot : current free slots : 7
> 2013-07-01 18:41:43,026 INFO org.apache.hadoop.mapred.TaskTracker: 
> LaunchTaskAction (registerTask): attempt_201208241212_27521_m_000002_3 task's 
> state:UNASSIGNED
> 2013-07-01 18:41:43,026 INFO org.apache.hadoop.mapred.TaskTracker: Trying to 
> launch : attempt_201208241212_27521_m_000002_3 which needs 1 slots
> 2013-07-01 18:41:43,026 INFO org.apache.hadoop.mapred.TaskTracker: In 
> TaskLauncher, current free slots : 7 and trying to launch 
> attempt_201208241212_27521_m_000002_3 which needs 1 slots
> 2013-07-01 18:41:43,026 INFO 
> org.apache.hadoop.mapreduce.server.tasktracker.Localizer: User-directories 
> for the user sds are already initialized on this TT. Not doing anything.
> 2013-07-01 18:41:43,029 WARN org.apache.hadoop.mapred.TaskTracker: Error 
> initializing attempt_201208241212_27521_m_000002_3:
> java.lang.NullPointerException
> 2013-07-01 18:41:43,029 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to 
> set finish time for task attempt_201208241212_27521_m_000002_3 when no start 
> time is set, stackTrace is : java.lang.Exception
>       at 
> org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:195)
>       at 
> org.apache.hadoop.mapred.MapTaskStatus.setFinishTime(MapTaskStatus.java:51)
>       at 
> org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:2937)
>       at 
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2255)
>       at 
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2212)
>   Then, we view the log files of the TaskTracker,and find that the 
> TaskTracker throwed Several OutOfMemoryError: Java heap space about ten days 
> ago. after that, the TaskTracker has still throws the exception:
> 2013-06-22 12:39:42,296 INFO org.apache.hadoop.mapred.TaskTracker: 
> LaunchTaskAction (registerTask): attempt_201208241212_26088_m_000043_1 task's 
> state:UNASSIGNED
> 2013-06-22 12:39:42,296 INFO org.apache.hadoop.mapred.TaskTracker: Trying to 
> launch : attempt_201208241212_26088_m_000043_1 which needs 1 slots
> 2013-06-22 12:39:42,296 INFO org.apache.hadoop.mapred.TaskTracker: In 
> TaskLauncher, current free slots : 7 and trying to launch 
> attempt_201208241212_26088_m_000043_1 which needs 1 slots
> 2013-06-22 12:39:42,296 INFO 
> org.apache.hadoop.mapreduce.server.tasktracker.Localizer: Initializing user 
> sds on this TT.
> 2013-06-22 12:39:42,300 WARN org.apache.hadoop.mapred.TaskTracker: Error 
> initializing attempt_201208241212_26088_m_000043_1:
> java.lang.NullPointerException
>       at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:630)
>       at org.apache.hadoop.ipc.Client.call(Client.java:886)
>       at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
>       at $Proxy5.getFileInfo(Unknown Source)
>       at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>       at $Proxy5.getFileInfo(Unknown Source)
>       at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:850)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:620)
>       at 
> org.apache.hadoop.mapred.TaskTracker.localizeJobTokenFile(TaskTracker.java:3984)
>       at 
> org.apache.hadoop.mapred.TaskTracker.localizeJobFiles(TaskTracker.java:1036)
>       at 
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:977)
>       at 
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2247)
>       at 
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2212)
> 2013-06-22 12:39:42,300 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to 
> set finish time for task attempt_201208241212_26088_m_000043_1 when no start 
> time is set, stackTrace is : java.lang.Exception
>       at 
> org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:195)
>       at 
> org.apache.hadoop.mapred.MapTaskStatus.setFinishTime(MapTaskStatus.java:51)
>       at 
> org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:2937)
>       at 
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2255)
>       at 
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2212)
>   
>   Since then, the TaskTracker has not completed any task.we can find the 
> property of Connection -out is null from the above exeption,and it is caused 
> by the failure of method setupIOstream() in the class 
> org.apache.ipc.Client$Connection.anyway, the instance of Connection is not 
> null and cached by the org.apache.ipc.Client.
> {code}
> private Connection getConnection(InetSocketAddress addr,
>                                    Class<?> protocol,
>                                    UserGroupInformation ticket,
>                                    Call call)
>                                    throws IOException, InterruptedException {
>     if (!running.get()) {
>       // the client is stopped
>       throw new IOException("The client is stopped");
>     }
>     Connection connection;
>     /* we could avoid this allocation for each RPC by having a  
>      * connectionsId object and with set() method. We need to manage the
>      * refs for keys in HashMap properly. For now its ok.
>      */
>     ConnectionId remoteId = new ConnectionId(addr, protocol, ticket);
>     do {
>       synchronized (connections) {
>         connection = connections.get(remoteId);
>         if (connection == null) {
>           connection = new Connection(remoteId);
>           connections.put(remoteId, connection);
>         }
>       }
>     } while (!connection.addCall(call));
>     
>     //we don't invoke the method below inside "synchronized (connections)"
>     //block above. The reason for that is if the server happens to be slow,
>     //it will take longer to establish a connection and that will slow the
>     //entire system down.
>     connection.setupIOstreams();
>     
>     
>     return connection;
>   }
> {code}
>   We guess that it throwed a OutOfMemoryError when a thread called the 
> setupIOstream() because of RPC, so that some properties of the Connection are 
> null, and throw NullPointerException when the Connection are accessed by 
> other threads,which is fron the cache.It must be correct to make sure that 
> the instance of Connection could only be access after initialized 
> successfully. 
>   On the other hand, we also simulate this scenario.Firstly, one thread 
> create the Connection instance and throw OutOfMemoryError when it call 
> connection.setupIOstreams().after that, the other thread start call RPC 
> through by the instance and it keeps throwing the same exceptions

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to