Hi, all I noticed there's wrong timestamp from the status report of " ./hadoop dfsadmin -upgradeProgress details", although the time setting on the server is right, will this matter?
<------------------------------------------------------------------------------------------------------------------------- Distributed upgrade for version -6 is in progress. Status = 0% Last Block Level Stats updated at : Thu Jan 01 08:00:00 GMT+08:00 1970 Last Block Level Stats : Total Blocks : 0 Fully Upgragraded : 0.00% Minimally Upgraded : 0.00% Under Upgraded : 0.00% (includes Un-upgraded blocks) Un-upgraded : 0.00% Errors : 0 Brief Datanode Status : Avg completion of all Datanodes: 0.00% with 0 errors. Datanode Stats (total: 0): pct Completion(%) blocks upgraded (u) blocks remaining (r) errors (e) There are no known Datanodes -------------------------------------------------------------------------------------------------------------------------> Here is the tcpdump I made using "tcpdump host 192.168.2.101 and 192.168.2.1" on one of the data-nodes from the start of the cluster to the loss of the connection, where 192.168.2.101 is the datanode, and 192.168.2.1 is the name-node. <------------------------------------------------------------------------------------------------------------------------- 03:21:01.082055 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: S 3124566345:3124566345(0) win 5840 <mss 1460,sackOK,timestamp 12085778 0,nop,wscale 7> 03:21:01.084143 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: S 635998938:635998938(0) ack 3124566346 win 5792 <mss 1460,sackOK,timestamp 211599828 12085778,nop,wscale 7> 03:21:01.082120 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: . ack 1 win 46 <nop,nop,timestamp 12085778 211599828> 03:21:01.090313 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: P 1:22(21) ack 1 win 46 <nop,nop,timestamp 211599830 12085778> 03:21:01.095758 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: . ack 22 win 46 <nop,nop,timestamp 12085781 211599830> 03:21:01.095876 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: P 1:21(20) ack 22 win 46 <nop,nop,timestamp 12085781 211599830> 03:21:01.095903 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: . ack 21 win 46 <nop,nop,timestamp 211599832 12085781> 03:21:01.096282 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: P 21:773(752) ack 22 win 46 <nop,nop,timestamp 12085782 211599832> 03:21:01.096304 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: . ack 773 win 57 <nop,nop,timestamp 211599832 12085782> 03:21:01.097154 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: P 22:766(744) ack 773 win 57 <nop,nop,timestamp 211599832 12085782> 03:21:01.097795 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: P 773:797(24) ack 766 win 58 <nop,nop,timestamp 12085782 211599832> 03:21:01.100199 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: P 766:918(152) ack 797 win 57 <nop,nop,timestamp 211599833 12085782> 03:21:01.106536 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: P 797:941(144) ack 918 win 69 <nop,nop,timestamp 12085784 211599833> 03:21:01.108781 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: P 918:1382(464) ack 941 win 69 <nop,nop,timestamp 211599835 12085784> 03:21:01.113305 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: P 941:957(16) ack 1382 win 81 <nop,nop,timestamp 12085786 211599835> 03:21:01.155108 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: . ack 957 win 69 <nop,nop,timestamp 211599846 12085786> 03:21:01.155199 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: P 957:1005(48) ack 1382 win 81 <nop,nop,timestamp 12085796 211599846> 03:21:01.155217 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: . ack 1005 win 69 <nop,nop,timestamp 211599846 12085796> 03:21:01.155273 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: P 1382:1430(48) ack 1005 win 69 <nop,nop,timestamp 211599847 12085796> 03:21:01.155453 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: P 1005:1069(64) ack 1430 win 81 <nop,nop,timestamp 12085796 211599847> 03:21:01.199106 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: . ack 1069 win 69 <nop,nop,timestamp 211599857 12085796> 03:21:01.214178 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: P 1430:1494(64) ack 1069 win 69 <nop,nop,timestamp 211599861 12085796> 03:21:01.214481 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: P 1069:1597(528) ack 1494 win 81 <nop,nop,timestamp 12085811 211599861> 03:21:01.214518 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: . ack 1597 win 81 <nop,nop,timestamp 211599861 12085811> 03:21:01.218638 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: P 1494:1974(480) ack 1597 win 81 <nop,nop,timestamp 211599862 12085811> 03:21:01.222363 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: P 1597:2173(576) ack 1974 win 93 <nop,nop,timestamp 12085813 211599862> 03:21:01.224255 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: P 1974:2006(32) ack 2173 win 93 <nop,nop,timestamp 211599864 12085813> 03:21:01.224521 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: P 2173:2237(64) ack 2006 win 93 <nop,nop,timestamp 12085814 211599864> 03:21:01.227368 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: P 2006:2054(48) ack 2237 win 93 <nop,nop,timestamp 211599865 12085814> 03:21:01.227689 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: P 2237:2493(256) ack 2054 win 93 <nop,nop,timestamp 12085814 211599865> 03:21:01.228913 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: P 2054:2102(48) ack 2493 win 104 <nop,nop,timestamp 211599865 12085814> 03:21:01.268981 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: . ack 2102 win 93 <nop,nop,timestamp 12085825 211599865> 03:21:01.344551 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: P 2102:2246(144) ack 2493 win 104 <nop,nop,timestamp 211599894 12085825> 03:21:01.344689 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: . ack 2246 win 104 <nop,nop,timestamp 12085844 211599894> 03:21:02.037296 IP TE-DN-001.local.TEST.28260 > 192.168.2.1.cslistener: S 638840154:638840154(0) win 5840 <mss 1460,sackOK,timestamp 211600067 0,nop,wscale 7> 03:21:02.037414 IP 192.168.2.1.cslistener > TE-DN-001.local.TEST.28260: S 3130232567:3130232567(0) ack 638840155 win 5792 <mss 1460,sackOK,timestamp 12086017 211600067,nop,wscale 7> 03:21:02.037473 IP TE-DN-001.local.TEST.28260 > 192.168.2.1.cslistener: . ack 1 win 46 <nop,nop,timestamp 211600067 12086017> 03:21:02.049490 IP TE-DN-001.local.TEST.28260 > 192.168.2.1.cslistener: P 1:110(109) ack 1 win 46 <nop,nop,timestamp 211600070 12086017> 03:21:02.049626 IP 192.168.2.1.cslistener > TE-DN-001.local.TEST.28260: . ack 110 win 46 <nop,nop,timestamp 12086020 211600070> 03:21:02.357928 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: P 2246:2278(32) ack 2493 win 104 <nop,nop,timestamp 211600147 12085844> 03:21:02.358048 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: . ack 2278 win 104 <nop,nop,timestamp 12086097 211600147> 03:21:02.358089 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: P 2278:2374(96) ack 2493 win 104 <nop,nop,timestamp 211600147 12086097> 03:21:02.358178 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: . ack 2374 win 104 <nop,nop,timestamp 12086097 211600147> 03:21:02.358316 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: P 2493:2525(32) ack 2374 win 104 <nop,nop,timestamp 12086097 211600147> 03:21:02.358356 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: F 2525:2525(0) ack 2374 win 104 <nop,nop,timestamp 12086097 211600147> 03:21:02.359169 IP TE-DN-001.local.TEST.ssh > 192.168.2.1.43129: F 2374:2374(0) ack 2526 win 104 <nop,nop,timestamp 211600147 12086097> 03:21:02.359254 IP 192.168.2.1.43129 > TE-DN-001.local.TEST.ssh: . ack 2375 win 104 <nop,nop,timestamp 12086097 211600147> 03:22:03.064540 IP TE-DN-001.local.TEST.28260 > 192.168.2.1.cslistener: P 110:214(104) ack 1 win 46 <nop,nop,timestamp 211615323 12086020> 03:22:03.064664 IP 192.168.2.1.cslistener > TE-DN-001.local.TEST.28260: . ack 214 win 46 <nop,nop,timestamp 12101272 211615323> 03:22:08.065775 arp who-has TE-DN-001.local.TEST tell 192.168.2.1 03:22:08.065791 arp reply TE-DN-001.local.TEST is-at 00:18:37:02:74:76 (oui Unknown) 03:22:54.349567 IP 192.168.2.1.cslistener > TE-DN-001.local.TEST.28260: P 1:20(19) ack 214 win 46 <nop,nop,timestamp 12114091 211615323> 03:22:54.349624 IP TE-DN-001.local.TEST.28260 > 192.168.2.1.cslistener: . ack 20 win 46 <nop,nop,timestamp 211628143 12114091> 03:22:54.349708 IP 192.168.2.1.cslistener > TE-DN-001.local.TEST.28260: P 20:39(19) ack 214 win 46 <nop,nop,timestamp 12114091 211628143> 03:22:54.349718 IP TE-DN-001.local.TEST.28260 > 192.168.2.1.cslistener: . ack 39 win 46 <nop,nop,timestamp 211628143 12114091> 03:22:54.385237 IP TE-DN-001.local.TEST.28260 > 192.168.2.1.cslistener: P 214:242(28) ack 39 win 46 <nop,nop,timestamp 211628152 12114091> 03:22:54.385342 IP 192.168.2.1.cslistener > TE-DN-001.local.TEST.28260: . ack 242 win 46 <nop,nop,timestamp 12114100 211628152> 03:22:54.391417 IP 192.168.2.1.cslistener > TE-DN-001.local.TEST.28260: P 39:146(107) ack 242 win 46 <nop,nop,timestamp 12114101 211628152> 03:22:54.433048 IP TE-DN-001.local.TEST.28260 > 192.168.2.1.cslistener: . ack 146 win 46 <nop,nop,timestamp 211628164 12114101> 03:22:55.525390 IP TE-DN-001.local.TEST.28260 > 192.168.2.1.cslistener: F 242:242(0) ack 146 win 46 <nop,nop,timestamp 211628437 12114101> 03:22:55.525719 IP 192.168.2.1.cslistener > TE-DN-001.local.TEST.28260: F 146:146(0) ack 243 win 46 <nop,nop,timestamp 12114385 211628437> 03:22:55.525746 IP TE-DN-001.local.TEST.28260 > 192.168.2.1.cslistener: . ack 147 win 46 <nop,nop,timestamp 211628437 12114385> -------------------------------------------------------------------------------------------------------------------------> On 9/13/07, Raghu Angadi <[EMAIL PROTECTED]> wrote: > > Hi, > > Datanode should be able to connect to Namenode for any progress on > upgrade. Do you see any other errors reported in datanode log? You need > to fix the connection problem first. > > Are you comfortable taking tcpdump for Namenode port on the client? I > think client should be trying to reconnect. > > Note that it is safe to restart the cluster or just the datanodes before > the upgrade completes. > > Raghu. > Open Study wrote: > > Also I checked the log of the name node, and found one exception as > followed > > > > 2007-09-13 02:17:25,324 INFO org.apache.hadoop.ipc.Server: IPC Server > > handler 6 on 9000: starting > > 2007-09-13 02:17:25,324 INFO org.apache.hadoop.ipc.Server: IPC Server > > handler 7 on 9000: starting > > 2007-09-13 02:17:25,324 INFO org.apache.hadoop.ipc.Server: IPC Server > > handler 8 on 9000: starting > > 2007-09-13 02:17:25,325 INFO org.apache.hadoop.ipc.Server: IPC Server > > handler 9 on 9000: starting > > 2007-09-13 02:17:25,400 INFO > org.apache.hadoop.dfs.BlockCrcUpgradeNamenode: > > Block CRC Upgrade is still running. > > Avg completion of all Datanodes: 0.00%with > > 0 errors. > > 2007-09-13 02:17:25,406 WARN org.apache.hadoop.ipc.Server: IPC Server > > handler 5 on 9000, call getProtocolVersion(org.apache.hado > > op.dfs.ClientProtocol, 14) from 192.168.2.1:53211: output error > > java.nio.channels.ClosedChannelException > > at sun.nio.ch.SocketChannelImpl.ensureWriteOpen( > > SocketChannelImpl.java:125) > > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java > :294) > > at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer( > > SocketChannelOutputStream.java:108) > > at org.apache.hadoop.ipc.SocketChannelOutputStream.write( > > SocketChannelOutputStream.java:89) > > at java.io.BufferedOutputStream.flushBuffer( > > BufferedOutputStream.java:65) > > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java > :123) > > at java.io.DataOutputStream.flush(DataOutputStream.java:106) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:585) > > 2007-09-13 02:18:24,921 INFO > org.apache.hadoop.dfs.BlockCrcUpgradeNamenode: > > Block CRC Upgrade is still running. > > Avg completion of all Datanodes: 0.00% with 0 errors. > > > > It seems some thing was going wong on data node side, however the log of > one > > of the data nodes show it was started, and it was still running as I > can > > find from the processes list, but some how lost connection with the > > name-node. > > > > ************************************************************/ > > 2007-09-12 22:23:35,319 INFO org.apache.hadoop.dfs.DataNode: > STARTUP_MSG: > > /************************************************************ > > STARTUP_MSG: Starting DataNode > > STARTUP_MSG: host = TE-DN-002/192.168.2.102 > > STARTUP_MSG: args = [] > > ************************************************************/ > > 2007-09-12 22:23:35,533 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: > > Initializing JVM Metrics with processName=DataNode, sessi > > onId=null > > 2007-09-12 22:24:35,619 INFO org.apache.hadoop.ipc.RPC: Problem > connecting > > to server: /192.168.2.1:9000 > > 2007-09-12 22:25:34,878 INFO org.apache.hadoop.dfs.Storage: Recovering > > storage directory /home/textd/data/fs/data from previous > > upgrade. > > 2007-09-12 22:25:49,586 INFO org.apache.hadoop.dfs.DataNode: > > Distributed upgrade for DataNode version -6 to current LV -7 is > > initialized. > > 2007-09-12 22:25:49,586 INFO org.apache.hadoop.dfs.Storage: Upgrading > > storage directory /home/textd/data/fs/data. > > old LV = -4; old CTime = 0. > > new LV = -7; new CTime = 1189616555276 > > > > The hardware configuration was > > Namenode: P4D, 3G RAM > > 3 Datanodes: AMD 64 4000x2, 1G RAM > > They worked with hadoop 0.13.1 > > > > Any idea or suggestion? > > > >