On Mon, Mar 18, 2013 at 7:13 PM, Greg Smith <g...@2ndquadrant.com> wrote: > I wasn't trying to flog EBS as any more or less reliable than other types of > storage. What I was trying to emphasize, similarly to your "quite a > stretch" comment, was the uncertainty involved when such deployments fail. > Failures happen due to many causes outside of just EBS itself. But people > are so far removed from the physical objects that fail, it's harder now to > point blame the right way when things fail.
I didn't mean to imply you personally were going out of your way to flog EBS, but there is a sufficient vacuum in the narrative that someone could reasonably interpereted it that way, so I want to set it straight. The problem is the quantity of databases per human. The Pythons said it best: 'A simple question of weight ratios.' > A quick example will demonstrate what I mean. Let's say my server at home > dies. There's some terrible log messages, it crashes, and when it comes > back up it's broken. Troubleshooting and possibly replacement parts follow. > I will normally expect an eventual resolution that includes data like "the > drive showed X SMART errors" or "I swapped the memory with a similar system > and the problem followed the RAM". I'll learn something about what failed > that I might use as feedback to adjust my practices. But an EC2+EBS failure > doesn't let you get to the root cause effectively most of the time, and that > makes people nervous. Yes, the layering makes it tougher to do vertical treatment of obscure issues. Redundancy has often been the preferred solution here: bugs come and go all the time, and everyone at each level tries to fix what they can without much coordination from the layer above or below. There are hopefully benefits in throughput of progress at each level from this abstraction, but predicting when any one particular issue will go understood top to bottom is even harder than it already was. Also, I think the line of reasoning presented is biased towards a certain class of database: there are many, many databases with minimal funding and oversight being run in the traditional way, and the odds they'll get a vigorous root cause analysis in event of an obscure issue is already close to nil. Although there are other considerations at play (like not just leaving those users with nothing more than a "bad block" message), checksums open some avenues gradually benefit those use cases, too. > I can already see "how do checksums alone help narrow the blame?" as the > next question. I'll post something summarizing how I use them for that > tomorrow, just out of juice for that tonight. Not from me. It seems pretty intuitive from here how database maintained checksums assist in partitioning the problem. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers