Greg Folkert wrote:
On Thu, 2007-05-03 at 10:36 +0800, Bob wrote:
Greg Folkert wrote:
8< massive snip
Auto adding of space has been done. Trust me on this one. Don't do it.
I've seen the after effects. A typo in a script can do more harm in 20
seconds (or less) than any one person could do in 20 years when dealing
with disk space. You would be well advised to setup an existing projects
disk space monitoring system and have it urgently mail you that disk
space is becoming a premium commodity on /blah filesystem.
As long as it's not possible for a growing partition to take space away
from another partition, I can't see how this would do anything but give
you more time to react, you run your messed up script that starts
pushing the contents of /dev/zero into a text file and eating harddrive
space, if your sitting next to the machine you see the HDD lights come
on and hear the seeking, if your remote the first clue you get is an
email or SMS saying /blah has been automatically grown by 5GB because it
was more than 85% full, [0] you can kill your script, delete the text
file, reduce the partition again and nothing crashed, if all it had done
was email you a message pointing out /blah was running out of space, and
/blah was also required by some other vital process that ran out of
space and crashed before you could kill the script, you'd wish you had
auto growth.
No, you don't get the fact that saturation of IO is a bad thing. Even on
8 multi-path IO sub-systems. Sometimes things spin and spin and spin.
When they do... it gets icky.
To give you an example:
I was managing a n-tier (2-tier and 3 to 5 tier) setup. We had
18 Terabytes of spare mirrored disk allocated to this machine.
We had only 8 TB of mirrored disk already for DATA. One day the
Oracle DB starts CRANKING and CRANKING on transaction logs. But
its not really doing any work. The Oracle DBA had disk
allocation rights from the spare 18TB of disk. So, he made a
minutely cronjob, checking for percentage of space free. If the
LV and Filesystem didn't have X% free add another X amount of
space.
Unfortuneately, at 8TB, 1% of space is ~85GB. Problem was, that
he was adding 32GB chunks and extending the filesystem by 32GB
at a time. Once a minute. The LV extending commands were never
completing let alone the File System extending. You can guess
how well that went.
In about 11 hours, all 18TB was allocated but not used and was
not committed but wasn't recoverable until it completed the
commit process. And the Filesystem was never extended properly.
We had to switch over to a fail-over machine and pray the DB was
good on the hot-copy. It was... so not much was really bad.
So, you see, it is all about understanding scale. When you start talking
about TeraBytes or PetaBytes you need to have completion checks before
the next step or a repeat. And do things in large enough "chunks" to
make a difference.
That sounds like poor implementation, the critical bit is not to let it
iterate without it having worked the first time. I still don't see
anything wrong with the concept as long as it's implemented right,
although I've never dealt with a system on the scale you're talking about.
I think the best way to implement it would be to set limits on how much
a partition can grow, so once /blah has eaten 90% of the remaining
space, it's not allowed to grow any more and the processes responsible
for the expansion, along with any others that require space on /blah,
crash hopefully not taking the whole system with it but still leaving a
good chunk of space for other vital partitions to grow into, should they
need to, so when you get off your train, plain or hobby horse, hopefully
you can still ssh in and fix things.
The machine I mentioned was SO BUSY, ssh took over 10 minutes to get a
login and then they key-auth expired before the login process was
finished. Even the console was that slow, but at least it didn't time
out.
I've never seen anything that busy, had a MythTV backed that would sit
with the load average up at 3 or 5 and I though that was "getting my
moneys worth".
--
Garrr, do your bit for global warming, become a pirate, you can "borrow" my
copy of Windows 95 if you want.
--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]