Hua xu created HADOOP-9684:
------------------------------

             Summary: the initialization or missed may be for 
org.apache.ipc.Client$Connection
                 Key: HADOOP-9684
                 URL: https://issues.apache.org/jira/browse/HADOOP-9684
             Project: Hadoop Common
          Issue Type: Bug
          Components: ipc
    Affects Versions: 0.21.0, 1.0.3
            Reporter: Hua xu


Today, we see that a TaskTracer has keeped throwing the same exception in our 
production environment.it is that:

2013-07-01 18:41:40,023 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot 
: current free slots : 7
2013-07-01 18:41:43,026 INFO org.apache.hadoop.mapred.TaskTracker: 
LaunchTaskAction (registerTask): attempt_201208241212_27521_m_000002_3 task's 
state:UNASSIGNED
2013-07-01 18:41:43,026 INFO org.apache.hadoop.mapred.TaskTracker: Trying to 
launch : attempt_201208241212_27521_m_000002_3 which needs 1 slots
2013-07-01 18:41:43,026 INFO org.apache.hadoop.mapred.TaskTracker: In 
TaskLauncher, current free slots : 7 and trying to launch 
attempt_201208241212_27521_m_000002_3 which needs 1 slots
2013-07-01 18:41:43,026 INFO 
org.apache.hadoop.mapreduce.server.tasktracker.Localizer: User-directories for 
the user sds are already initialized on this TT. Not doing anything.
2013-07-01 18:41:43,029 WARN org.apache.hadoop.mapred.TaskTracker: Error 
initializing attempt_201208241212_27521_m_000002_3:
java.lang.NullPointerException

2013-07-01 18:41:43,029 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to 
set finish time for task attempt_201208241212_27521_m_000002_3 when no start 
time is set, stackTrace is : java.lang.Exception
        at 
org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:195)
        at 
org.apache.hadoop.mapred.MapTaskStatus.setFinishTime(MapTaskStatus.java:51)
        at 
org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:2937)
        at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2255)
        at 
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2212)

  Then, we view the log files of the TaskTracker,and find that the TaskTracker 
throwed Several OutOfMemoryError: Java heap space about ten days ago. after 
that, the TaskTracker has still throws the exception:


2013-06-22 12:39:42,296 INFO org.apache.hadoop.mapred.TaskTracker: 
LaunchTaskAction (registerTask): attempt_201208241212_26088_m_000043_1 task's 
state:UNASSIGNED
2013-06-22 12:39:42,296 INFO org.apache.hadoop.mapred.TaskTracker: Trying to 
launch : attempt_201208241212_26088_m_000043_1 which needs 1 slots
2013-06-22 12:39:42,296 INFO org.apache.hadoop.mapred.TaskTracker: In 
TaskLauncher, current free slots : 7 and trying to launch 
attempt_201208241212_26088_m_000043_1 which needs 1 slots
2013-06-22 12:39:42,296 INFO 
org.apache.hadoop.mapreduce.server.tasktracker.Localizer: Initializing user sds 
on this TT.
2013-06-22 12:39:42,300 WARN org.apache.hadoop.mapred.TaskTracker: Error 
initializing attempt_201208241212_26088_m_000043_1:
java.lang.NullPointerException
        at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:630)
        at org.apache.hadoop.ipc.Client.call(Client.java:886)
        at 
org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
        at $Proxy5.getFileInfo(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy5.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:850)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:620)
        at 
org.apache.hadoop.mapred.TaskTracker.localizeJobTokenFile(TaskTracker.java:3984)
        at 
org.apache.hadoop.mapred.TaskTracker.localizeJobFiles(TaskTracker.java:1036)
        at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:977)
        at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2247)
        at 
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2212)

2013-06-22 12:39:42,300 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to 
set finish time for task attempt_201208241212_26088_m_000043_1 when no start 
time is set, stackTrace is : java.lang.Exception
        at 
org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:195)
        at 
org.apache.hadoop.mapred.MapTaskStatus.setFinishTime(MapTaskStatus.java:51)
        at 
org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:2937)
        at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2255)
        at 
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2212)
  
  Since then, the TaskTracker has not completed any task.we can find the 
property of Connection -out is null from the above exeption,and it is caused by 
the failure of method setupIOstream() in the class 
org.apache.ipc.Client$Connection.anyway, the instance of Connection is not null 
and cached by the org.apache.ipc.Client.We guess that it throwed a 
OutOfMemoryError when a thread called the setupIOstream() because of RPC, so 
that some properties of the Connection are null, and throw NullPointerException 
when the Connection are accessed by other threads,which is fron the cache.It 
must be correct to make sure that the instance of Connection could only be 
cached after initialized successfully. We fixed the Method getConnection() to 
guarantee the semantic:

private Connection getConnection(InetSocketAddress addr,
                                   Class<?> protocol,
                                   UserGroupInformation ticket,
                                   Call call)
                                   throws IOException, InterruptedException {
    if (!running.get()) {
      // the client is stopped
      throw new IOException("The client is stopped");
    }
    Connection connection;
    /* we could avoid this allocation for each RPC by having a  
     * connectionsId object and with set() method. We need to manage the
     * refs for keys in HashMap properly. For now its ok.
     */
    ConnectionId remoteId = new ConnectionId(addr, protocol, ticket);
    do {
      synchronized (connections) {
        connection = connections.get(remoteId);
        if (connection == null) {
          connection = new Connection(remoteId);
          
 +         connection.setupIOstreams();
          
          connections.put(remoteId, connection);
        }
      }
    } while (!connection.addCall(call));
    
    //we don't invoke the method below inside "synchronized (connections)"
    //block above. The reason for that is if the server happens to be slow,
    //it will take longer to establish a connection and that will slow the
    //entire system down.
-    connection.setupIOstreams();
    
    return connection;
  }

  On the other hand, we also simulate this scenario.Firstly, one thread create 
the Connection instance and throw OutOfMemoryError when it call 
connection.setupIOstreams().after that, the other thread start call RPC through 
by the instance and it keeps throwing the same exceptions


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to