Hey Bryan, Any chance you can get a tshark trace on the 0.19 namenode? Maybe tshark -s 100000 -w nndump.pcap port 7276
Also, are the clocks synced on the two machines? The failure of your distcp is at 23:32:39, but the namenode log message you posted was 23:29:09. Did those messages actually pop out at the same time? Thanks -Todd On Wed, Apr 8, 2009 at 11:39 PM, Bryan Duxbury <[email protected]> wrote: > Hey all, > > I was trying to copy some data from our cluster on 0.19.2 to a new cluster > on 0.18.3 by using disctp and the hftp:// filesystem. Everything seemed to > be going fine for a few hours, but then a few tasks failed because a few > files got 500 errors when trying to be read from the 19 cluster. As a result > the job died. Now that I'm trying to restart it, I get this error: > > [rapl...@ds-nn2 ~]$ hadoop distcp hftp://ds-nn1:7276/ > hdfs://ds-nn2:7276/cluster-a > 09/04/08 23:32:39 INFO tools.DistCp: srcPaths=[hftp://ds-nn1:7276/] > 09/04/08 23:32:39 INFO tools.DistCp: destPath=hdfs://ds-nn2:7276/cluster-a > With failures, global counters are inaccurate; consider running with -i > Copy failed: java.net.SocketException: Unexpected end of file from server > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:769) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:766) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1000) > at > org.apache.hadoop.dfs.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:183) > at > org.apache.hadoop.dfs.HftpFileSystem$LsParser.getFileStatus(HftpFileSystem.java:193) > at > org.apache.hadoop.dfs.HftpFileSystem.getFileStatus(HftpFileSystem.java:222) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667) > at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:588) > at org.apache.hadoop.tools.DistCp.copy(DistCp.java:609) > at org.apache.hadoop.tools.DistCp.run(DistCp.java:768) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > at org.apache.hadoop.tools.DistCp.main(DistCp.java:788) > > I changed nothing at all between the first attempt and the subsequent > failed attempts. The only clues in the namenode log for the 19 cluster are: > > 2009-04-08 23:29:09,786 WARN org.apache.hadoop.ipc.Server: Incorrect header > or version mismatch from 10.100.50.252:47733 got version 47 expected > version 2 > > Anyone have any ideas? > > -Bryan >
