On Wed, Dec 9, 2009 at 4:15 PM, Nick Bailey <ni...@mailtrust.com> wrote:
> That seems to make sense, our collection process for putting data in hadoop > sees a 'Hadoop exited with: 65280' error fairly regularly where it will fail > to put a file and queue it to be retried. > > Yep, that sounds like it could be it. I just filed HDFS-821: https://issues.apache.org/jira/browse/HDFS-821 -Todd > > > -----Original Message----- > From: "Todd Lipcon" <t...@cloudera.com> > Sent: Wednesday, December 9, 2009 7:02pm > To: common-user@hadoop.apache.org > Cc: core-u...@hadoop.apache.org > Subject: Re: Hadoop dfs usage and actual size discrepancy > > Hi Nick, > > My guess is that the tmp/ directories of the DNs were rather full. I've > occasionally seen this on clusters where writes have been failing. > > There should be some kind of thread which garbage collects partial blocks > from the DN's tmp dirs, but it's not implemented, as far as I know. This > comment is in FSDataset.java: > > // REMIND - mjc - eventually we should have a timeout system > // in place to clean up block files left by abandoned clients. > // We should have some timer in place, so that if a blockfile > // is created but non-valid, and has been idle for >48 hours, > // we can GC it safely. > > This comment is from April 2007 ;-) > > I'll file a JIRA to consider implementing this. > > Thanks > -Todd > > On Wed, Dec 9, 2009 at 3:57 PM, Nick Bailey <ni...@mailtrust.com> wrote: > > > Actually looks like restarting has helped. DFS used has gone down to > 43TB > > from 50TB and appears to still be going down. > > > > Don't know what was wrong with the DataNode process. Possibly a cloudera > > problem. Thanks for the help Brian. > > > > -Nick > > > > > > > > -----Original Message----- > > From: "Nick Bailey" <ni...@mailtrust.com> > > Sent: Wednesday, December 9, 2009 5:55pm > > To: common-user@hadoop.apache.org > > Cc: common-user@hadoop.apache.org, common-user@hadoop.apache.org, > > core-u...@hadoop.apache.org > > Subject: Re: Hadoop dfs usage and actual size discrepancy > > > > One interesting thing is the output of the command to restart the > datanode. > > > > $ sudo service hadoop-datanode restart > > Stopping Hadoop datanode daemon (hadoop-datanode): no datanode to stop > > [ OK ] > > Starting Hadoop datanode daemon (hadoop-datanode): starting datanode, > > logging to /log/location > > [ OK ] > > > > Notice when stopping the datanode it says 'no datanode to stop'. It says > > this even though the datanode is definetly running. Also there is only 1 > > datanode process, and it isn't getting stopped by this command, so > basically > > I actually didn't restart anything. I checked and at least a few of the > > other nodes are also exhibiting this behavior. > > > > I don't know if its related, after killing the process and actually > > restarting the datanode, it still doesn't appear to be clearing out any > > extra data. I'll manually restart the datanodes by killing processes for > > now and see if maybe that helps. > > > > -Nick > > > > > > -----Original Message----- > > From: "Nick Bailey" <ni...@mailtrust.com> > > Sent: Wednesday, December 9, 2009 5:44pm > > To: common-user@hadoop.apache.org > > Cc: common-user@hadoop.apache.org, core-u...@hadoop.apache.org > > Subject: Re: Hadoop dfs usage and actual size discrepancy > > > > Well for that specific machine, du pretty much matches the report. Not > all > > of our nodes are at 4.11TB that one is actually overloaded and we are > > running the balancer currently to fix it. > > > > Restarting the datanode on that machine didn't seem to clear out any > data. > > I'll probably go ahead and restart all the datanodes but I'm not hopeful > to > > that clearing out all the data. > > > > Thanks for helping out though. Any other ideas out there? > > > > -Nick > > > > -----Original Message----- > > From: "Brian Bockelman" <bbock...@cse.unl.edu> > > Sent: Wednesday, December 9, 2009 4:57pm > > To: common-user@hadoop.apache.org > > Cc: core-u...@hadoop.apache.org > > Subject: Re: Hadoop dfs usage and actual size discrepancy > > > > Hey Nick, > > > > Non-DFS Used must be something new in 19.x, I guess. > > > > What happens if you do "du -hs" on the datanode directory? Are they all > > approximately 4.11TB? What happens after you restart a datanode? Does > it > > clean out a bunch of data? > > > > Never seen this locally, and we beat the bejesus out of our cluster... > > > > Brian > > > > On Dec 9, 2009, at 10:54 PM, Nick Bailey wrote: > > > > > Brian, > > > > > > Hadoop version 18.3. More specifically cloudera's version. Our > dfsadmin > > -report doesn't contain any lines with "Non DFS Used". so that grep won't > > work. Here is an example of the report for one of the nodes > > > > > > > > > Name: XXXXXXXXXXXXX > > > State : In Service > > > Total raw bytes: 4919829360640 (4.47 TB) > > > Remaining raw bytes: 108009550121(100.59 GB) > > > Used raw bytes: 4520811248473 (4.11 TB) > > > % used: 91.89% > > > Last contact: Wed Dec 09 16:50:10 EST 2009 > > > > > > Besides what I already posted the rest of the report is just a repeat > of > > that for every node. > > > > > > Nick > > > > > > -----Original Message----- > > > From: "Brian Bockelman" <bbock...@cse.unl.edu> > > > Sent: Wednesday, December 9, 2009 4:48pm > > > To: common-user@hadoop.apache.org > > > Cc: core-u...@hadoop.apache.org > > > Subject: Re: Hadoop dfs usage and actual size discrepancy > > > > > > Hey Nick, > > > > > > What's the output of this: > > > > > > hadoop dfsadmin -report | grep "Non DFS Used" | grep -v "0 KB" | awk > > '{sum += $4} END {print sum}' > > > > > > What version of Hadoop is this? > > > > > > Brian > > > > > > On Dec 9, 2009, at 10:25 PM, Nick Bailey wrote: > > > > > >> Output from bottom of fsck report: > > >> > > >> Total size: 8711239576255 B (Total open files size: 3571494 B) > > >> Total dirs: 391731 > > >> Total files: 2612976 (Files currently being written: 3) > > >> Total blocks (validated): 2274747 (avg. block size 3829542 B) > > (Total open file blocks (not validated): 1) > > >> Minimally replicated blocks: 2274747 (100.0 %) > > >> Over-replicated blocks: 75491 (3.3186548 %) > > >> Under-replicated blocks: 36945 (1.6241367 %) > > >> Mis-replicated blocks: 0 (0.0 %) > > >> Default replication factor: 3 > > >> Average block replication: 3.017153 > > >> Corrupt blocks: 0 > > >> Missing replicas: 36945 (0.53830105 %) > > >> Number of data-nodes: 25 > > >> Number of racks: 1 > > >> > > >> > > >> > > >> Output from top of dfsadmin -report: > > >> > > >> Total raw bytes: 110689488793600 (100.67 TB) > > >> Remaining raw bytes: 46994184353977 (42.74 TB) > > >> Used raw bytes: 55511654282643 (50.49 TB) > > >> % used: 50.15% > > >> > > >> Total effective bytes: 0 (0 KB) > > >> Effective replication multiplier: Infinity > > >> > > >> > > >> Not sure what the last two lines fo the dfsadmin report mean, but we > > have a neglible amount of over replicated blocks according to fsck. The > > rest of the dfsadmin report confirms what the web interface says in that > the > > nodes have way more data than 8.6TB * 3. > > >> > > >> Thoughts? > > >> > > >> > > >> > > >> -----Original Message----- > > >> From: "Brian Bockelman" <bbock...@cse.unl.edu> > > >> Sent: Wednesday, December 9, 2009 3:35pm > > >> To: common-user@hadoop.apache.org > > >> Cc: core-u...@hadoop.apache.org > > >> Subject: Re: Hadoop dfs usage and actual size discrepancy > > >> > > >> Hey Nick, > > >> > > >> Try: > > >> > > >> hadoop fsck / > > >> hadoop dfsadmin -report > > >> > > >> Should give you information about, for example, the non-HDFS data and > > the average replication factor. > > >> > > >> Or is this how you determined you had a replication factor of 3? > > >> > > >> Brian > > >> > > >> On Dec 9, 2009, at 9:33 PM, Nick Bailey wrote: > > >> > > >>> We have a hadoop cluster with a 100TB capacity, and according to the > > dfs web interface we are using 50% of our capacity (50TB). However doing > > 'hadoop fs -dus /' says the total size of everything is about 8.6TB. > > Everything has a replication factor of 3 so we should only be using > around > > 26TB of our cluster. > > >>> > > >>> I've verified the replication factors and I've also checked the > > datanode machines to see if something non hadoop related is accidentally > > being stored on the drives hadoop is using for storage, but nothing is. > > >>> > > >>> Has anyone had a similar problem and have any debugging suggestions? > > >>> > > >>> Thanks, > > >>> Nick Bailey > > >>> > > >> > > >> > > > > > > > > > > > > > > > > > > > > > > > > >