On Wed, Dec 9, 2009 at 4:15 PM, Nick Bailey <ni...@mailtrust.com> wrote:

> That seems to make sense, our collection process for putting data in hadoop
> sees a 'Hadoop exited with: 65280' error fairly regularly where it will fail
> to put a file and queue it to be retried.
>
>
Yep, that sounds like it could be it. I just filed HDFS-821:
https://issues.apache.org/jira/browse/HDFS-821

-Todd


>
>
> -----Original Message-----
> From: "Todd Lipcon" <t...@cloudera.com>
> Sent: Wednesday, December 9, 2009 7:02pm
> To: common-user@hadoop.apache.org
> Cc: core-u...@hadoop.apache.org
> Subject: Re: Hadoop dfs usage and actual size discrepancy
>
> Hi Nick,
>
> My guess is that the tmp/ directories of the DNs were rather full. I've
> occasionally seen this on clusters where writes have been failing.
>
> There should be some kind of thread which garbage collects partial blocks
> from the DN's tmp dirs, but it's not implemented, as far as I know. This
> comment is in FSDataset.java:
>
>  // REMIND - mjc - eventually we should have a timeout system
>  // in place to clean up block files left by abandoned clients.
>  // We should have some timer in place, so that if a blockfile
>  // is created but non-valid, and has been idle for >48 hours,
>  // we can GC it safely.
>
> This comment is from April 2007 ;-)
>
> I'll file a JIRA to consider implementing this.
>
> Thanks
> -Todd
>
> On Wed, Dec 9, 2009 at 3:57 PM, Nick Bailey <ni...@mailtrust.com> wrote:
>
> > Actually looks like restarting has helped.  DFS used has gone down to
> 43TB
> > from 50TB and appears to still be going down.
> >
> > Don't know what was wrong with the DataNode process.  Possibly a cloudera
> > problem.  Thanks for the help Brian.
> >
> > -Nick
> >
> >
> >
> > -----Original Message-----
> > From: "Nick Bailey" <ni...@mailtrust.com>
> > Sent: Wednesday, December 9, 2009 5:55pm
> > To: common-user@hadoop.apache.org
> > Cc: common-user@hadoop.apache.org, common-user@hadoop.apache.org,
> > core-u...@hadoop.apache.org
> > Subject: Re: Hadoop dfs usage and actual size discrepancy
> >
> > One interesting thing is the output of the command to restart the
> datanode.
> >
> > $ sudo service hadoop-datanode restart
> > Stopping Hadoop datanode daemon (hadoop-datanode): no datanode to stop
> >                                                           [  OK  ]
> > Starting Hadoop datanode daemon (hadoop-datanode): starting datanode,
> > logging to /log/location
> >                                                           [  OK  ]
> >
> > Notice when stopping the datanode it says 'no datanode to stop'.  It says
> > this even though the datanode is definetly running.  Also there is only 1
> > datanode process, and it isn't getting stopped by this command, so
> basically
> > I actually didn't restart anything.  I checked and at least a few of the
> > other nodes are also exhibiting this behavior.
> >
> > I don't know if its related, after killing the process and actually
> > restarting the datanode, it still doesn't appear to be clearing out any
> > extra data.  I'll manually restart the datanodes by killing processes for
> > now and see if maybe that helps.
> >
> > -Nick
> >
> >
> > -----Original Message-----
> > From: "Nick Bailey" <ni...@mailtrust.com>
> > Sent: Wednesday, December 9, 2009 5:44pm
> > To: common-user@hadoop.apache.org
> > Cc: common-user@hadoop.apache.org, core-u...@hadoop.apache.org
> > Subject: Re: Hadoop dfs usage and actual size discrepancy
> >
> > Well for that specific machine, du pretty much matches the report.  Not
> all
> > of our nodes are at 4.11TB that one is actually overloaded and we are
> > running the balancer currently to fix it.
> >
> > Restarting the datanode on that machine didn't seem to clear out any
> data.
> >  I'll probably go ahead and restart all the datanodes but I'm not hopeful
> to
> > that clearing out all the data.
> >
> > Thanks for helping out though. Any other ideas out there?
> >
> > -Nick
> >
> > -----Original Message-----
> > From: "Brian Bockelman" <bbock...@cse.unl.edu>
> > Sent: Wednesday, December 9, 2009 4:57pm
> > To: common-user@hadoop.apache.org
> > Cc: core-u...@hadoop.apache.org
> > Subject: Re: Hadoop dfs usage and actual size discrepancy
> >
> > Hey Nick,
> >
> > Non-DFS Used must be something new in 19.x, I guess.
> >
> > What happens if you do "du -hs" on the datanode directory?  Are they all
> > approximately 4.11TB?  What happens after you restart a datanode?  Does
> it
> > clean out a bunch of data?
> >
> > Never seen this locally, and we beat the bejesus out of our cluster...
> >
> > Brian
> >
> > On Dec 9, 2009, at 10:54 PM, Nick Bailey wrote:
> >
> > > Brian,
> > >
> > > Hadoop version 18.3. More specifically cloudera's version.  Our
> dfsadmin
> > -report doesn't contain any lines with "Non DFS Used". so that grep won't
> > work. Here is an example of the report for one of the nodes
> > >
> > >
> > > Name: XXXXXXXXXXXXX
> > > State          : In Service
> > > Total raw bytes: 4919829360640 (4.47 TB)
> > > Remaining raw bytes: 108009550121(100.59 GB)
> > > Used raw bytes: 4520811248473 (4.11 TB)
> > > % used: 91.89%
> > > Last contact: Wed Dec 09 16:50:10 EST 2009
> > >
> > > Besides what I already posted the rest of the report is just a repeat
> of
> > that for every node.
> > >
> > > Nick
> > >
> > > -----Original Message-----
> > > From: "Brian Bockelman" <bbock...@cse.unl.edu>
> > > Sent: Wednesday, December 9, 2009 4:48pm
> > > To: common-user@hadoop.apache.org
> > > Cc: core-u...@hadoop.apache.org
> > > Subject: Re: Hadoop dfs usage and actual size discrepancy
> > >
> > > Hey Nick,
> > >
> > > What's the output of this:
> > >
> > > hadoop dfsadmin -report | grep "Non DFS Used" | grep -v "0 KB" | awk
> > '{sum += $4} END {print sum}'
> > >
> > > What version of Hadoop is this?
> > >
> > > Brian
> > >
> > > On Dec 9, 2009, at 10:25 PM, Nick Bailey wrote:
> > >
> > >> Output from bottom of fsck report:
> > >>
> > >> Total size:    8711239576255 B (Total open files size: 3571494 B)
> > >> Total dirs:    391731
> > >> Total files:   2612976 (Files currently being written: 3)
> > >> Total blocks (validated):      2274747 (avg. block size 3829542 B)
> > (Total open file blocks (not validated): 1)
> > >> Minimally replicated blocks:   2274747 (100.0 %)
> > >> Over-replicated blocks:        75491 (3.3186548 %)
> > >> Under-replicated blocks:       36945 (1.6241367 %)
> > >> Mis-replicated blocks:         0 (0.0 %)
> > >> Default replication factor:    3
> > >> Average block replication:     3.017153
> > >> Corrupt blocks:                0
> > >> Missing replicas:              36945 (0.53830105 %)
> > >> Number of data-nodes:          25
> > >> Number of racks:               1
> > >>
> > >>
> > >>
> > >> Output from top of dfsadmin -report:
> > >>
> > >> Total raw bytes: 110689488793600 (100.67 TB)
> > >> Remaining raw bytes: 46994184353977 (42.74 TB)
> > >> Used raw bytes: 55511654282643 (50.49 TB)
> > >> % used: 50.15%
> > >>
> > >> Total effective bytes: 0 (0 KB)
> > >> Effective replication multiplier: Infinity
> > >>
> > >>
> > >> Not sure what the last two lines fo the dfsadmin report mean, but we
> > have a neglible amount of over replicated blocks according to fsck.  The
> > rest of the dfsadmin report confirms what the web interface says in that
> the
> > nodes have way more data than 8.6TB * 3.
> > >>
> > >> Thoughts?
> > >>
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: "Brian Bockelman" <bbock...@cse.unl.edu>
> > >> Sent: Wednesday, December 9, 2009 3:35pm
> > >> To: common-user@hadoop.apache.org
> > >> Cc: core-u...@hadoop.apache.org
> > >> Subject: Re: Hadoop dfs usage and actual size discrepancy
> > >>
> > >> Hey Nick,
> > >>
> > >> Try:
> > >>
> > >> hadoop fsck /
> > >> hadoop dfsadmin -report
> > >>
> > >> Should give you information about, for example, the non-HDFS data and
> > the average replication factor.
> > >>
> > >> Or is this how you determined you had a replication factor of 3?
> > >>
> > >> Brian
> > >>
> > >> On Dec 9, 2009, at 9:33 PM, Nick Bailey wrote:
> > >>
> > >>> We have a hadoop cluster with a 100TB capacity, and according to the
> > dfs web interface we are using 50% of our capacity (50TB).  However doing
> > 'hadoop fs -dus /' says the total size of everything is  about 8.6TB.
> >  Everything has a replication factor of 3 so we should only be using
> around
> > 26TB of our cluster.
> > >>>
> > >>> I've verified the replication factors and I've also checked the
> > datanode machines to see if something non hadoop related is accidentally
> > being stored on the drives hadoop is using for storage, but nothing is.
> > >>>
> > >>> Has anyone had a similar problem and have any debugging suggestions?
> > >>>
> > >>> Thanks,
> > >>> Nick Bailey
> > >>>
> > >>
> > >>
> > >
> > >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>

Reply via email to