Re: "lost" NDFS blocks following network reorg

Stefan Groschupf Sun, 26 Mar 2006 16:49:18 -0800

Hi hadoop developers,

I moved this discussion to the hadoop developer list since it is maymore reponsable to this problem than the nutch users mailing list.


I spend some time to read code and find some interesting things.

The local name of the data node is machineName + ":" + tmpPort. Soit can change if the port is blocked or the machine name change.May we should create the datanode only once and write it to the datafolder to be able read it later on.(?)

This local name is used to send block reports to the name node.FSNamesystem#processReport(Block newReport[], UTF8 dataNodeLocalName)process this report.In the first line of this method the DatanodeInfo is loaded by thedataNode's localName. The datanode already is in this map since aheart beat is send before a block report.

So:

DatanodeInfo node = (DatanodeInfo) datanodeMap.get(name); // noproblem but just a 'empty' container:

...

Block oldReport[] = node.getBlocks(); // will return null since noBlocks are yet associated with this node.

Since oldReport is null all code is skipped until line 901. But thisonly adds the blocks to the node container.

In line 924 begins a section of code that collects all obsoleteblocks. First of all I wondering why we iterate throw all blockshere, this could be expansice and it would be enough to iterate overall blocks that are reported by this datanode, isn't it?If a block is still valid is tested by FSDirectory#isValidBlock thatchecks if the block is in activeBlocks.The problem I see now is that the only method that adds Blocks tactiveBlocks is unprotectedAddFile(UTF8 name, Block blocks[]). Buthere also the name node local name that may changed is involved.This method is also used to load the state of stopped or crashed namenode.So in case you stop the dfs, change host names a set of blocks willbe marked as obsolete and deleted.

Writing a test case for this behavior is very difficult since itinvolve a change of the machine name.

Makes my observation sense or do I had overseen a detail and theproblem Ken describe is caused by a other problem?In any case I suggest to make the data node name persistent so incaseport or host-name change the name node will not handle the samedatanode as a new one.


Stefan





Am 26.03.2006 um 23:11 schrieb Doug Cutting:

Ken Krugler wrote:
Anyway, curious if anybody has insights here. We've done a fairamount of poking around, to no avail. I don't think there's anyway to get the blocks back, as they definitely seem to be gone,and file recovery on Linux seems pretty iffy. I'm mostlyinterested in figuring out if this is a known issue ("Of courseyou can't change the server names and expect it to work"), orwhether it's a symptom of lurking NDFS bugs.
It's hard to tell, after the fact, whether stuff like this is piloterror or a bug. Others have reported similar things, so it'seither a bug or it's too easy to make pilot errors. So somethingneeds to change. But what?
We need to start testing stuff like this systematically. Areproducible test case would make this much easier to diagnose.
I'm sorry I can't be more helpful.  I'm sorry you lost data.

Doug


---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com

Re: "lost" NDFS blocks following network reorg

Reply via email to