[jira] Issue Comment Edited: (HADOOP-2991) dfs.du.reserved not honored in 0.15/16 (regression from 0.14+patch for 2549)

Allen Wittenauer (JIRA) Tue, 11 Mar 2008 23:10:22 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577719#action_12577719
 ]


aw edited comment on HADOOP-2991 at 3/11/08 11:07 PM:
--------------------------------------------------------------------


Ahh file systems.  Can't live with them, can't live with them.

First off: I'm not a big fan of percentages when dealing with file systems.

Back in the day, UFS would reserve 10% of the system for root's usage.  So on a 
10G disk, it would save 1G for itself.  Not a big deal and when the file system 
had issues, that worked out well.  100G would turn into 10G.  Ugh.  Not cool.  
Go even bigger and the amounts get insane.  So many implementations changed 
this to a sliding scale rather than a single percentage.  Some food for thought.

Secondly, df.  A great source of cross platform trouble... Let me throw out one 
of my favorite real world examples, this time from one of my home machines:

Filesystem             size   used  avail capacity  Mounted on
int                    165G    28K    21G     1%    /int
int/home               165G    68G    21G    77%    /export/home
int/mii2u              165G  1014K    21G     1%    /int/mii2u
int/squid-cache        5.0G   4.4G   591M    89%    /int/squid-cache
int/local              165G   289M    21G     2%    /usr/local

Stop.  Go back and look carefully at those numbers.  

In case you haven't guessed, this is a (partial) output of a df -h from a 
Solaris machine utilizing ZFS.  It is pretty clear that with the exception of 
the file system using a hard quota (int/squid-cache), size != used+available.  
Instead, size=(all fs used)+available.  Using "used" in any sort of capacity 
isn't going to tell you anything about how much space is actually available.  
This type of output is fairly common for any pool-based storage system.

Then there are file system quotas, which depending upon the OS, may or may not 
show up in df output.  The same thing with the aforementioned percentages with 
reserved space.

Anyway, what does this all mean?

Well, in my mind, that all of the above suggestions in the JIRA just really 
don't work out well... and that's just on UNIX.  Heck, even a heterogeneous 
UNIX environment makes me shudder.  How does one work with pooled storage *and* 
traditional file systems if you want to have a single config?

Quite frankly, you can't.  As much as I hate to say it, I suspect the answer 
(as unpopular as it might be) is probably to set a hard limit on how much space 
the HDFS will use rather than trying to second guess what the operating system 
is doing.  Does this suck?  Yes.  Does this suck less than all of the 
gymnastics around trying to figure this out dynamically?  I think so.

Let's face it, in order to make an app like Hadoop not eat more space than what 
you want vs. what is configured in the file system, you are essentially looking 
at partitioning it.  At that point, you might as well just configure it in the 
app and be done with it. In the end, this basically means that HDFS needs to 
keep track of how much space it is using at all times and not go over that 
limit.  This likely also means that it must implement high and low water marks 
such that if the low water mark is hit, writes to the filesystem get 
deferred/deprioritized and high water marks basically mean to start rebalancing 
the blocks or saying the file system is full.

Now, I know that it might be difficult to calculate what the max space should 
be.  On reflection though, I'm not really sure that's true.  If I know what 
size my slice is and I have an idea of how much of that I want to give to HDFS, 
then I can calculate that max value.  If an admin gets in trouble with the 
space being allocated, the ability to lower the high and low water marks, which 
should trigger a rebalance, thus freeing space.  This is essentially how apps 
like squid work. It works quite well.  [Interestingly enough, the file system 
structure on disk is quite similar to how the data node stores its blocks.... 
Hmm... ]

One thing to point out with this solution:  if the admin overcommits the space 
on the drive then, quite frankly, they hung themselves.  They know how much 
space they gave HDFS.  If they go over it, oh well.  I'd much rather have 
MapRed blow up than HDFS blow up, since it is much easier to pick up the pieces 
of a broken job than it is of the file system, especially in the case where 
there are under-replicated blocks.

Again, I totally admit that this solution is likely to be unpopular.  But I 
can't see a way out of this mess that works with the multiple types of storage 
systems in use.

P.S., while I'm here, let me throw my more of my own personal prejudices into 
this:  putting something like hadoop in / or some other file system  (but not 
necessarily device) that is used by the OS is just *begging* for trouble.  
That's just a bad practice for a real, production system.  If someone does 
that, they rightly deserve any pain that it caused.

      was (Author: aw):
    
https://issues.apache.org/jira/browse/HADOOP-2991

Ahh file systems.  Can't live with them, can't live with them.

First off: I'm not a big fan of percentages when dealing with file systems.

Back in the day, UFS would reserve 10% of the system for root's usage.  So on a 
10G disk, it would save 1G for itself.  Not a big deal and when the file system 
had issues, that worked out well.  100G would turn into 10G.  Ugh.  Not cool.  
Go even bigger and the amounts get insane.  So many implementations changed 
this to a sliding scale rather than a single percentage.  Some food for thought.

Secondly, df.  A great source of cross platform trouble... Let me throw out one 
of my favorite real world examples, this time from one of my home machines:

Filesystem             size   used  avail capacity  Mounted on
int                    165G    28K    21G     1%    /int
int/home               165G    68G    21G    77%    /export/home
int/mii2u              165G  1014K    21G     1%    /int/mii2u
int/squid-cache        5.0G   4.4G   591M    89%    /int/squid-cache
int/local              165G   289M    21G     2%    /usr/local

Stop.  Go back and look carefully at those numbers.  

In case you haven't guessed, this is a (partial) output of a df -h from a 
Solaris machine utilizing ZFS.  It is pretty clear that with the exception of 
the file system using a hard quota (int/squid-cache), size != used+available.  
Instead, size=(all fs used)+available.  Using "used" in any sort of capacity 
isn't going to tell you anything about how much space is actually available.  
This type of output is fairly common for any pool-based storage system.

Then there are file system quotas, which depending upon the OS, may or may not 
show up in df output.  The same thing with the aforementioned percentages with 
reserved space.

Anyway, what does this all mean?

Well, in my mind, that all of the above suggestions in the JIRA just really 
don't work out well... and that's just on UNIX.  Heck, even a heterogeneous 
UNIX environment makes me shudder.  How does one work with pooled storage *and* 
traditional file systems if you want to have a single config?

Quite frankly, you can't.  As much as I hate to say it, I suspect the answer 
(as unpopular as it might be) is probably to set a hard limit on how much space 
the HDFS will use rather than trying to second guess what the operating system 
is doing.  Does this suck?  Yes.  Does this suck less than all of the 
gymnastics around trying to figure this out dynamically?  I think so.

Let's face it, in order to make an app like Hadoop not eat more space than what 
you want vs. what is configured in the file system, you are essentially looking 
at partitioning it.  At that point, you might as well just configure it in the 
app and be done with it. In the end, this basically means that HDFS needs to 
keep track of how much space it is using at all times and not go over that 
limit.  This likely also means that it must implement high and low water marks 
such that if the low water mark is hit, writes to the filesystem get 
deferred/deprioritized and high water marks basically mean to start rebalancing 
the blocks or saying the file system is full.

Now, I know that it might be difficult to calculate what the max space should 
be.  On reflection though, I'm not really sure that's true.  If I know what 
size my slice is and I have an idea of how much of that I want to give to HDFS, 
then I can calculate that max value.  If an admin gets in trouble with the 
space being allocated, the ability to lower the high and low water marks, which 
should trigger a rebalance, thus freeing space.  This is essentially how apps 
like squid work. It works quite well.  [Interestingly enough, the file system 
structure on disk is quite similar to how the data node stores its blocks.... 
Hmm... ]

One thing to point out with this solution:  if the admin overcommits the space 
on the drive then, quite frankly, they hung themselves.  They know how much 
space they gave HDFS.  If they go over it, oh well.  I'd much rather have 
MapRed blow up than HDFS blow up, since it is much easier to pick up the pieces 
of a broken job than it is of the file system, especially in the case where 
there are under-replicated blocks.

Again, I totally admit that this solution is likely to be unpopular.  But I 
can't see a way out of this mess that works with the multiple types of storage 
systems in use.

P.S., while I'm here, let me throw my more of my own personal prejudices into 
this:  putting something like hadoop in / or some other file system  (but not 
necessarily device) that is used by the OS is just *begging* for trouble.  
That's just a bad practice for a real, production system.  If someone does 
that, they rightly deserve any pain that it caused.
  
> dfs.du.reserved not honored in 0.15/16 (regression from 0.14+patch for 2549)
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-2991
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2991
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.15.0, 0.15.1, 0.15.2, 0.15.3, 0.16.0
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> changes for https://issues.apache.org/jira/browse/HADOOP-1463
> have caused a regression. earlier:
> - we could set dfs.du.reserve to 1G and be *sure* that 1G would not be used.
> now this is no longer true. I am quoting Pete Wyckoff's example:
> <example>
> Let's look at an example. 100 GB disk and /usr using 45 GB and dfs using 50 
> GBs now
> Df -kh shows:
> Capacity = 100 GB
> Available = 1 GB (remember ~4 GB chopped out for metadata and stuff)
> Used = 95 GBs   
> remaining = 100 GB - 50 GB - 1GB = 49 GB 
> Min(remaining, available) = 1 GB
> 98% of which is usable for DFS apparently - 
> So, we're at the limit, but are free to use 98% of the remaining 1GB.
> </example>
> this is broke. based on the discussion on 1463 - it seems like the notion of 
> 'capacity' as being the first field of 'df' is problematic. For example - 
> here's what our df output looks like:
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/sda3             130G  123G   49M 100% /
> as u can see - 'Size' is a misnomer - that much space is not available. 
> Rather the actual usable space is 123G+49M ~ 123G. (not entirely sure what 
> the discrepancy is due to - but have heard this may be due to space reserved 
> for file system metadata). Because of this discrepancy - we end up in a 
> situation where file system is out of space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-2991) dfs.du.reserved not honored in 0.15/16 (regression from 0.14+patch for 2549)

Reply via email to