[jira] Commented: (HADOOP-124) Files still rotting in DFS of latest Hadoop

Bryan Pendleton (JIRA) Fri, 07 Apr 2006 11:30:45 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-124?page=comments#action_12373677 ]


Bryan Pendleton commented on HADOOP-124:
----------------------------------------

I have certainly seen some weirdness with my cluster where a stop-all seemed to 
think it had succeeded but there were still datanode and tasktracker instances 
running. Locking at the directory level seems like a good hedge against that 
sort of problem. As a bonus - it'll prevent blocks from getting removed if 
someone mistakenly starts up a second set of datanodes against a different 
namenode, as long as the datanode daemon for the competing DFS instance is 
running on the machine.

Keeping track of only one copy of a block for a given IP seems like a good 
start, but might be too simple. What if, by some unfortunate process, a lot of 
duplicates get stored on a node. If there's no way to detect this, it could be 
that a bunch of copies of a block end up being duplicated on a node 
unnecessarily. One way out of that would be to notice if a host is reporting 
more than one copy of a block, and kick off a knee-jerk fix: 1) Extra-replicate 
the block 2) Ask the node with dups to remove its copies of the block. Of 
course, in the case where two datanode instances are somehow servicing from the 
same directory, the knee-jerk reaction would kick off for all blocks that get 
stored into that space - so locking would definitely be necessary to make this 
work.

> Files still rotting in DFS of latest Hadoop
> -------------------------------------------
>
>          Key: HADOOP-124
>          URL: http://issues.apache.org/jira/browse/HADOOP-124
>      Project: Hadoop
>         Type: Bug

>   Components: dfs
>  Environment: ~30 node cluster
>     Reporter: Bryan Pendleton

>
> DFS files are still rotting.
> I suspect that there's a problem with block accounting/detecting identical 
> hosts in the namenode. I have 30 physical nodes, with various numbers of 
> local disks, meaning that my current 'bin/hadoop dfs -report" shows 80 nodes 
> after a full restart. However, when I discovered the  problem (which resulted 
> in losing about 500gb worth of temporary data because of missing blocks in 
> some of the larger chunks) -report showed 96 nodes. I suspect somehow there 
> were extra datanodes running against the same paths, and that the namenode 
> was counting those as replicated instances, which then showed up 
> over-replicated, and one of them was told to delete its local block, leading 
> to the block actually getting lost.
> I will debug it more the next time the situation arises. This is at least the 
> 5th time I've had a large amount of file data "rot" in DFS since January.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-124) Files still rotting in DFS of latest Hadoop

Reply via email to