Hi all,

We had a test crawl, using Nutch 0.8 pre-Hadoop. This consisted of a master server and 10 slaves, where NameNode/JobTracker ran on the master, and DataNodes/TaskTrackers ran on each of the slaves.

As a test, we removed two of the slaves from the cluster. We did these one at a time, by shutting down Nutch, pulling out a slave, starting Nutch, waiting for all block replication to finish, and then repeating this same process one more time.

After we did this, we seemed to have a valid cluster - running mapreduce tasks worked, etc.

Then we moved the servers to a new colo facility, and also changed the network setup such that each server got a new name.

Sometime after that, we tried running another mapreduce task, and it failed with "missing block" errors.

Using Andrzej's DFS checker tool, we determined that 20% of the blocks were missing.

We also saw various log entries that looked like:

060321 125017 Obsoleting block blk_955860084498163084

One additional note - we'd been using a minimal (2x) replication factor, due to disk space constraints.

So first question is whether anybody has run into a similar problem.

Second, it didn't seem like the Nutch (remember, this is pre-Hadoop) DFS code saves the server name or IP address with blocks, but I haven't dug into this. If it did, that would obviously cause problems for us.

Third, it's still suspicious that we removed 20% of the DataNode servers, and now we're missing 20% of the blocks. Makes me wonder if the auto-replicated blocks (what happened when we took the two servers out of the cluster) wound up in a state where they were flagged as unused when the network changed.

Anyway, curious if anybody has insights here. We've done a fair amount of poking around, to no avail. I don't think there's any way to get the blocks back, as they definitely seem to be gone, and file recovery on Linux seems pretty iffy. I'm mostly interested in figuring out if this is a known issue ("Of course you can't change the server names and expect it to work"), or whether it's a symptom of lurking NDFS bugs.

Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Reply via email to