I have experienced the same condition.
Here is the exception I was seeing:
2006-09-17 22:29:08,798 INFO org.apache.hadoop.mapred.TaskTracker:
task_0001_r_000032_3 0.7986655% reduce > reduce
2006-09-17 22:29:27,112 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000032_3 06/09/17 22:29:27 WARN mapred.TaskTracker: Error
running child
2006-09-17 22:29:12,854 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000032_3 org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.dfs.LeaseExpiredException: No lease on
/user/stack/nla/2005-outputs/segments/20060915202116-nla2005/crawl_fetch/part-00032/data
2006-09-17 22:29:12,854 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000032_3 at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:454)
2006-09-17 22:29:12,854 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000032_3 at
org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:228)
2006-09-17 22:29:12,854 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000032_3 at
sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
2006-09-17 22:29:12,854 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000032_3 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
2006-09-17 22:29:12,855 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000032_3 at java.lang.reflect.Method.invoke(Method.java:585)
2006-09-17 22:29:12,855 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000032_3 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:332)
2006-09-17 22:29:12,855 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000032_3 at
org.apache.hadoop.ipc.Server$Handler.run(Server.java:468)
2006-09-17 22:29:12,855 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000032_3
2006-09-17 22:29:12,855 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000032_3 at
org.apache.hadoop.ipc.Client$Connection.run(Client.java:159)
2006-09-17 22:29:13,673 INFO org.apache.hadoop.ipc.Server: Server
listener on port 50050: disconnecting client 207.241.227.163. Number of
active connections: 0
After reading this mail from the archive,
http://mail-archives.apache.org/mod_mbox/lucene-hadoop-dev/200607.mbox/[EMAIL PROTECTED]
, I made following patch for my hadoop instances:
Index: src/java/org/apache/hadoop/dfs/DFSClient.java
===================================================================
--- src/java/org/apache/hadoop/dfs/DFSClient.java (revision 409788)
+++ src/java/org/apache/hadoop/dfs/DFSClient.java (working copy)
@@ -403,18 +434,23 @@
public void run() {
long lastRenewed = 0;
while (running) {
- if (System.currentTimeMillis() - lastRenewed >
(LEASE_PERIOD / 2)) {
+ // Divide by 3 instead of by 2 so we start renewing earlier
+ // and set down "ipc.client.timeout" from its 60 to 20
or 30.
+ // See this note for why:
+ //
http://mail-archives.apache.org/mod_mbox/lucene-hadoop-dev/200607.mbox/[EMAIL PROTECTED]
+ if (System.currentTimeMillis() - lastRenewed >
(LEASE_PERIOD / 3)) {
Timing-wise, the patch and config. make it so we should get in at least
one renewal attempt before lease timeout. This got rid of the lease
expiration exceptions for me but the above is less than optimal.
St.Ack
Dennis Kubes wrote:
This is going to be a long email. I am seeing a race condition
between the renewing of leases on the DFS and RPC client timeout
values and I wanted to get some feedback on this.
We keep seeing leases failing to be renewed where a task has to start
over. Usually the task will complete fine but it still bugs me that
the leases were failing to be renewed. After doing some research here
is what I found. When the task is not renewed we see something like
this in the log:
06/10/31 13:23:57 INFO fs.FSNamesystem: Removing lease [Lease.
Holder: DFSClient_task_0007_r_000001_0, heldlocks: 0, pendingcreates:
0], leases remaining: 10
06/10/31 13:24:43 INFO ipc.Server: Server handler 5 on 9000 call
error: org.apache.hadoop.dfs.LeaseExpiredException: No lease on
/user/phoenix/crawl/engine/unit2cat/part-00001
06/10/31 13:25:13 INFO ipc.Server: Server handler 2 on 9000 call
error: org.apache.hadoop.dfs.LeaseExpiredException: No lease on
/user/phoenix/crawl/engine/unit2cat/part-00001
06/10/31 13:25:13 INFO ipc.Server: Server handler 7 on 9000 call
error: org.apache.hadoop.dfs.LeaseExpiredException: No lease on
/user/phoenix/crawl/engine/unit2cat/part-00001
06/10/31 13:25:13 INFO ipc.Server: Server handler 5 on 9000 call
error: org.apache.hadoop.dfs.LeaseExpiredException: No lease on
/user/phoenix/crawl/engine/unit2cat/part-00001
06/10/31 13:25:13 INFO ipc.Server: Server handler 6 on 9000 call
error: org.apache.hadoop.dfs.LeaseExpiredException: No lease on
/user/phoenix/crawl/engine/unit2cat/.part-00001.crc
06/10/31 13:25:13 INFO fs.FSNamesystem: Attempt to release a lock from
an unknown lease holder DFSClient_task_0007_r_000001_0 for
/user/phoenix/crawl/engine/unit2cat/.part-00001.crc
Tracking this back through the code, the only way it can get to this
position is if the lease is expired in the FSNamesystem.LeaseMonitor
thread. As far as I can tell the only ways for the lease to become
expired are one the system shuts down and the shutdown hook fires the
DFSClient.ClientFinalizer thread to release locks or two the
DFSClient.LeaseChecker thread doesn't renew the lease in time. The
system is not shutting down so it has to be something in the
LeaseChecker.
The LeaseChecker continuously runs, once a second, and makes RPC calls
to renew the leases. The leases have to be renewed every 60 seconds
LEASE_PERIOD or they because expired and are removed by the
LeaseMonitor. The LeaseChecker renews leases when their renew time is
half over, so 30 seconds and beyond. So how is it that the leases
aren't getting renewed. Well the LeaseChecker actually is making an
RPC call to the NameNode to renew the lease. This RPC call uses an
ipc Client object which has a default socket timeout value of 10
seconds but a configured value of 60 seconds in hadoop-default.xml.
The Client class uses a Connection object that has a thread which
waits for input from the SocketInputStream. So on the RPC method
invocation we write the output through the Connection sendParam and we
wait for the return Writable value through the input stream in the
Connection thread. This input stream will continually wait for input
and the RPC call will block while it is waiting. Now let's suppose
that for some reason the RPC call goes bad and blocks. It is going to
wait 60 seconds to go again. If the lease hasn't just been renewed
then a race condition can occur where the lease will expire and
potentially be removed by the LeaseMonitor before the RPC class times
out and can be refired.
I think a fix to this would be to either change the default timeout
value in the hadoop-default.xml which could have other unintended
consequences or to increase the LEASE_PERIOD to be something around 2
minutes. I am looking for some feedback on what everyone thinks.
Dennis Kubes