Re: dfs incompatibility .3 and .4-dev?

Dennis Kubes Wed, 07 Jun 2006 15:45:41 -0700

All of the data nodes are there via bin/hadoop dfs -report. Oneinteresting thing. I shut down via stop-all.sh again and restarted viastart-all.sh and everything seems to be working. I reran fsck andeverything is not reporting healthy. I have not tried another fetch yetbut a generate was successful as was and updatedb and readdb. I amseeing alot of the errors below in the log but I think those are fixedby some of the recent patches.

2006-06-07 17:44:19,916 INFO org.apache.hadoop.dfs.DataNode: Lostconnection to namenode. Retrying...2006-06-07 17:44:24,920 INFO org.apache.hadoop.dfs.DataNode: Exception:java.lang.IllegalThreadStateException2006-06-07 17:44:24,921 INFO org.apache.hadoop.dfs.DataNode: Lostconnection to namenode. Retrying...2006-06-07 17:44:29,925 INFO org.apache.hadoop.dfs.DataNode: Exception:java.lang.IllegalThreadStateException2006-06-07 17:44:29,925 INFO org.apache.hadoop.dfs.DataNode: Lostconnection to namenode. Retrying...2006-06-07 17:44:34,929 INFO org.apache.hadoop.dfs.DataNode: Exception:java.lang.IllegalThreadStateException2006-06-07 17:44:34,929 INFO org.apache.hadoop.dfs.DataNode: Lostconnection to namenode. Retrying...


Dennis

Konstantin Shvachko wrote:

That might be the same problem.
Related changes to hadoop have been committed just 1 hour before yourinitial email.
So they are probably not in nutch yet.
Although "exactly one block missing in each file" looks suspicious.
Try
bin/hadoop dfs -report
to see how many data nodes you have now.
If all of them are reported then this is different.

--Konstantin

Dennis Kubes wrote:
Another interesting thing is that every single file is corrupt andmissing exactly one block.
Dennis Kubes wrote:
I don't know if this is the same problem or not but here is what Iam experiencing.
I have an 11 node cluster deployed a fresh nutch install with 3.1.Startup completed fine. Filesystem healthy. Performed 1st inject,generate, fetch for 1000 urls. Filesystem intact. Performed 2ndinject, generate, fetch for 1000 urls. Filesystem healthy. Mergedcrawldbs. Filesystem healthy. Merged segments. Filesystemhealthy. Inverted links. Healthy. Indexed. Healthy. Performedsearches. Healthy. Now here is where it gets interested. Shutdownall servers via stop-all.sh. Started all server via start-all.sh.Filesystem reports healthy. Performed inject and generate of 1000urls. Filesystem reports healthy. Performed fetch of the newsegments and get errors below and full corrupted filesystem (bothnew segments and old data).
java.io.IOException: Could not obtain block: blk_6625125900957460239file=/user/phoenix/temp/segments1/20060607165425/crawl_generate/part-00006offset=0atorg.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:529)atorg.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:638)atorg.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:84)atorg.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:159)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
    at java.io.DataInputStream.readFully(DataInputStream.java:176)
    at java.io.DataInputStream.readFully(DataInputStream.java:152)
atorg.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:263)
    at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:247)
    at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:237)
atorg.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:36)atorg.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:53)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:105)
atorg.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:847)
Hope this helps in tracking down the problem if it is even the sameproblem.
Dennis

Konstantin Shvachko wrote:
Thanks Stefan.

I spend some time investigating the problem.
There are 3 of them actually.
1). At startup data nodes are now registering with the name node.If registering doesn't work,because the name node is busy at the moment, which could easily bethe case if it is loadinga two week long log, then the data node would just fail and won'tstart at all.
See HADOOP-282.
2). When the cluster is running and the name node gets busy, andthe data node as the resultfails to connect to it, then the data node falls into an infiniteloop doing nothing but throwing anexception. So for the name node it is dead, since it is not sendingany heartbeats.
See HADOOP-285.
3). People say that they have seen loss of recent data, while theold data is still present.And this is happening when the cluster was brought down (for theupgrade), and restarted.We know HADOOP-227 that logs/edits are accumulating as long as thecluster is running.So if it was up for 2 weeks then the edits file is most probablyhuge. If it is corrupted then
the data is lost.
I could not reproduce that, just don't have any 2-week old editsfiles yet.I thoroughly examined one cluster and found missing blocks on thenodes that pretended to beup as in (2) above. Didn't see any data loss at all. I think largeedits files should be further
investigated.
There are patches fixing HADOOP-282 and HADOOP-285. We do not havepatchfor HADOOP-227 yet, so people need to restart the name node (justthe name node) dependingon the activity on the cluster, namely depending on the size of theedits file.
Stefan Groschupf wrote:
Hi Konstantin,
Could you give some more information about what happened to you.
- what is your cluster size
9 datanode, 1 namenode.
- amount of data
Total raw bytes: 6023680622592 (5609.98 Gb)
Used raw bytes: 2357053984804 (2195.17 Gb)
- how long did dfs ran without restarting the name node beforeupgrading
I would say 2 weeks.
I would love to figure out what was my problem today. :)
we discussed the three kinds of data looses, hardware, softwareor human errors.
Looks like you are not alone :-(
Too bad that the other didn't report it earlier. :)
Everything was happening in the same time.
+ updated from hadoop .2.1 to .4.
+ problems to get all datanodes started
what was the problem with datanodes?
Scenario:
I don't think there was a real problem. I notice that thedatanodes was not able to connect to the namenode.Later one I just add a "sleep 5" into the dfs starting scriptafter starteing the name node and that sloved the problem.
That is right, we did the same.
However at this time I updated, notice that problem, was thinking"ok, not working yet, lets wait another week", downgrading.
+ downgrade to hadoop .3.1
+ error message of incompatible dfs (I guess . already hadstarted to write to the log)
What is the message?
Sorry I can not find the exception anymore in the logs. :-(
Something like "version conflict -1 vs -2" :-o Sorry didn'tremember exactly.
Yes. You are running the old version (-1) code that would notaccept the "future" version (-2) images.The image was converted to v. -2 when you tried to run the upgradedhadoop.
Regards,
Konstantin

Re: dfs incompatibility .3 and .4-dev?

Reply via email to