[
https://issues.apache.org/jira/browse/HADOOP-9684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700048#comment-13700048
]
Hua xu commented on HADOOP-9684:
--------------------------------
It May be better that:
{code}
private synchronized void setupIOstreams() throws InterruptedException {
- if (socket != null || shouldCloseConnection.get()) {
- return;
- }
+ if(this.out != null || shouldCloseConnection.get()){
+ return;
+ }
short ioFailures = 0;
short timeoutFailures = 0;
try {
if (LOG.isDebugEnabled()) {
LOG.debug("Connecting to "+server);
}
while (true) {
try {
this.socket = socketFactory.createSocket();
this.socket.setTcpNoDelay(tcpNoDelay);
// connection time out is 20s
NetUtils.connect(this.socket, remoteId.getAddress(), 20000);
this.socket.setSoTimeout(pingInterval);
break;
} catch (SocketTimeoutException toe) {
/* The max number of retries is 45,
* which amounts to 20s*45 = 15 minutes retries.
*/
handleConnectionFailure(timeoutFailures++, 45, toe);
} catch (IOException ie) {
handleConnectionFailure(ioFailures++, maxRetries, ie);
}
}
InputStream inStream = NetUtils.getInputStream(socket);
OutputStream outStream = NetUtils.getOutputStream(socket);
writeRpcHeader(outStream);
if (useSasl) {
final InputStream in2 = inStream;
final OutputStream out2 = outStream;
UserGroupInformation ticket = remoteId.getTicket();
if (authMethod == AuthMethod.KERBEROS) {
if (ticket.getRealUser() != null) {
ticket = ticket.getRealUser();
}
}
if (ticket.doAs(new PrivilegedExceptionAction<Boolean>() {
@Override
public Boolean run() throws IOException {
return setupSaslConnection(in2, out2);
}
})) {
// Sasl connect is successful. Let's set up Sasl i/o streams.
inStream = saslRpcClient.getInputStream(inStream);
outStream = saslRpcClient.getOutputStream(outStream);
} else {
// fall back to simple auth because server told us so.
authMethod = AuthMethod.SIMPLE;
header = new ConnectionHeader(header.getProtocol(),
header.getUgi(), authMethod);
useSasl = false;
}
}
if (doPing) {
this.in = new DataInputStream(new BufferedInputStream
(new PingInputStream(inStream)));
} else {
this.in = new DataInputStream(new BufferedInputStream
(inStream));
}
//byte[] data = new byte[1024*1024*5];
this.out = new DataOutputStream(new BufferedOutputStream(outStream));
writeHeader();
// update last activity time
touch();
// start the receiver thread after the socket connection has been set up
start();
}
- catch (IOException e) {
- markClosed(e);
- close();
- }
+ catch (Throwable tb){
+ IOException ex = null;
+
+ if(tb instanceof IOException) ex = (IOException) tb;
+
+ markClosed(ex);
+ close();
+
+ if(tb instanceof Error) throw (Error)tb;
}
}
{code}
> The initialization may be missed for org.apache.ipc.Client$Connection
> ---------------------------------------------------------------------
>
> Key: HADOOP-9684
> URL: https://issues.apache.org/jira/browse/HADOOP-9684
> Project: Hadoop Common
> Issue Type: Bug
> Components: ipc
> Affects Versions: 1.0.3, 0.21.0
> Reporter: Hua xu
>
> Today, we see that a TaskTracer has keeped throwing the same exception in our
> production environment.it is that:
> 2013-07-01 18:41:40,023 INFO org.apache.hadoop.mapred.TaskTracker:
> addFreeSlot : current free slots : 7
> 2013-07-01 18:41:43,026 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction (registerTask): attempt_201208241212_27521_m_000002_3 task's
> state:UNASSIGNED
> 2013-07-01 18:41:43,026 INFO org.apache.hadoop.mapred.TaskTracker: Trying to
> launch : attempt_201208241212_27521_m_000002_3 which needs 1 slots
> 2013-07-01 18:41:43,026 INFO org.apache.hadoop.mapred.TaskTracker: In
> TaskLauncher, current free slots : 7 and trying to launch
> attempt_201208241212_27521_m_000002_3 which needs 1 slots
> 2013-07-01 18:41:43,026 INFO
> org.apache.hadoop.mapreduce.server.tasktracker.Localizer: User-directories
> for the user sds are already initialized on this TT. Not doing anything.
> 2013-07-01 18:41:43,029 WARN org.apache.hadoop.mapred.TaskTracker: Error
> initializing attempt_201208241212_27521_m_000002_3:
> java.lang.NullPointerException
> 2013-07-01 18:41:43,029 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to
> set finish time for task attempt_201208241212_27521_m_000002_3 when no start
> time is set, stackTrace is : java.lang.Exception
> at
> org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:195)
> at
> org.apache.hadoop.mapred.MapTaskStatus.setFinishTime(MapTaskStatus.java:51)
> at
> org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:2937)
> at
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2255)
> at
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2212)
> Then, we view the log files of the TaskTracker,and find that the
> TaskTracker throwed Several OutOfMemoryError: Java heap space about ten days
> ago. after that, the TaskTracker has still throws the exception:
> 2013-06-22 12:39:42,296 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction (registerTask): attempt_201208241212_26088_m_000043_1 task's
> state:UNASSIGNED
> 2013-06-22 12:39:42,296 INFO org.apache.hadoop.mapred.TaskTracker: Trying to
> launch : attempt_201208241212_26088_m_000043_1 which needs 1 slots
> 2013-06-22 12:39:42,296 INFO org.apache.hadoop.mapred.TaskTracker: In
> TaskLauncher, current free slots : 7 and trying to launch
> attempt_201208241212_26088_m_000043_1 which needs 1 slots
> 2013-06-22 12:39:42,296 INFO
> org.apache.hadoop.mapreduce.server.tasktracker.Localizer: Initializing user
> sds on this TT.
> 2013-06-22 12:39:42,300 WARN org.apache.hadoop.mapred.TaskTracker: Error
> initializing attempt_201208241212_26088_m_000043_1:
> java.lang.NullPointerException
> at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:630)
> at org.apache.hadoop.ipc.Client.call(Client.java:886)
> at
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
> at $Proxy5.getFileInfo(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> at $Proxy5.getFileInfo(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:850)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:620)
> at
> org.apache.hadoop.mapred.TaskTracker.localizeJobTokenFile(TaskTracker.java:3984)
> at
> org.apache.hadoop.mapred.TaskTracker.localizeJobFiles(TaskTracker.java:1036)
> at
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:977)
> at
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2247)
> at
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2212)
> 2013-06-22 12:39:42,300 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to
> set finish time for task attempt_201208241212_26088_m_000043_1 when no start
> time is set, stackTrace is : java.lang.Exception
> at
> org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:195)
> at
> org.apache.hadoop.mapred.MapTaskStatus.setFinishTime(MapTaskStatus.java:51)
> at
> org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:2937)
> at
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2255)
> at
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2212)
>
> Since then, the TaskTracker has not completed any task.we can find the
> property of Connection -out is null from the above exeption,and it is caused
> by the failure of method setupIOstream() in the class
> org.apache.ipc.Client$Connection.anyway, the instance of Connection is not
> null and cached by the org.apache.ipc.Client.
> {code}
> private Connection getConnection(InetSocketAddress addr,
> Class<?> protocol,
> UserGroupInformation ticket,
> Call call)
> throws IOException, InterruptedException {
> if (!running.get()) {
> // the client is stopped
> throw new IOException("The client is stopped");
> }
> Connection connection;
> /* we could avoid this allocation for each RPC by having a
> * connectionsId object and with set() method. We need to manage the
> * refs for keys in HashMap properly. For now its ok.
> */
> ConnectionId remoteId = new ConnectionId(addr, protocol, ticket);
> do {
> synchronized (connections) {
> connection = connections.get(remoteId);
> if (connection == null) {
> connection = new Connection(remoteId);
> connections.put(remoteId, connection);
> }
> }
> } while (!connection.addCall(call));
>
> //we don't invoke the method below inside "synchronized (connections)"
> //block above. The reason for that is if the server happens to be slow,
> //it will take longer to establish a connection and that will slow the
> //entire system down.
> connection.setupIOstreams();
>
>
> return connection;
> }
> {code}
> We guess that it throwed a OutOfMemoryError when a thread called the
> setupIOstream() because of RPC, so that some properties of the Connection are
> null, and throw NullPointerException when the Connection are accessed by
> other threads,which is fron the cache.It must be correct to make sure that
> the instance of Connection could only be access after initialized
> successfully.
> On the other hand, we also simulate this scenario.Firstly, one thread
> create the Connection instance and throw OutOfMemoryError when it call
> connection.setupIOstreams().after that, the other thread start call RPC
> through by the instance and it keeps throwing the same exceptions
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira