On 29 Feb 2012, at 01:13, Daevid Vincent wrote: >> -----Original Message----- >> From: Stuart Dallas [mailto:stu...@3ft9.com] >> >> Seriously? Errors like this should not be getting anywhere near your >> production servers. This is especially true if you're really getting 30k >> hits/s. > > Don't get me started. I joined here almost a year ago. They didn't even have > a RCS or Wiki at the time. Nothing was OOP. There were no PHPDoc or even > comments in the code. They used to make each site by copying an existing one > and modifying it (ie. no shared libraries or resources). I could go on and > on. Suffice it to say we've made HUGE leaps and bounds (thanks to me!), but > there is only 3 of us developers here and no official test person let alone > a test team. > > It is what it is. I'm doing the best I can with the limited resources > available to me.
Good stuff, but the idea that you need an official test person or a test team to produce solid code that minimises runtime errors is, in my opinion, completely the wrong attitude. I've been in similar situations several times and have found that the key is not to try and solve the problem in one big push. The key to solving the problem of a large codebase with minimal testing is simply to start somewhere. Put in the infrastructure for unit testing, then make writing tests a part of your standard development process. Over time you will find that you are unit testing the majority of the code. Yes, that will make things take longer, but you can also be confident that when you fix a bug it stays fixed, because you know there's a unit test that verifies that the bug has not returned. In my experience and opinion, limited resources is a big reason to implement some level of unit testing as soon as humanly possible, not a reason why you can't. Once you have the unit testing infrastructure in place, make running the tests the first step in your deployment process. You mention you now use a version control system, consider adding a hook to require that the unit tests pass before allowing code to be committed. Alternatively implement a CI environment which publicly ridicules anyone who checks in code which breaks the unit tests - it's amazing how much of a motivator this can be, even in a small team of seasoned professionals. > And let me tell you a little secret, when you get to that scale, you see all > kinds of errors you don't see on your VM or even with a test team. DB > connections go away. Funny things happen to memcache. Concurrency issues > arise. Web bots and search engines rape, pillage and ravage your site in > ways that make you feel dirty. So yeah, you do hit weird situations and > cases you can't possibly test for, but show up in error logs. Not a secret. Not even close to being a secret. I'm no stranger to sites with the sort of traffic you have, and then some, and I'm fully aware that it presents a unique set of challenges, but there are simple steps you can take to make life easier. Most of the specific issues you mention (DB connections, memcache weirdness, concurrency problems, and crawler activity) are the result of poor architecture and/or flawed server configuration. Where the architecture is poor I'd recommend you design a new architecture and find a way that you can start moving towards it, piece by piece, without having too much impact on your day-to-day activities. This is not always easy but I've done it several times and know it can be done in most situations. There will always be issues that crop up that you couldn't possibly have known would happen, but you can load test your app to see what happens at levels of traffic an order of magnitude above that which you expect. One of my current clients has a test tool that can generate traffic levels that hit the limit of EC2 network throughput specifically to see what would happen if they had a sudden and dramatic increase in usage. Knowing the application can cope at 10x the expected level of traffic is the only way you can be sure that it can cope with 1x the expected traffic without breaking a sweat. There will always be situations that you don't foresee, and conditions that are difficult to test, but saying that you "can't possibly test" for them is simply wrong. >> For a commercial, zero-hassle solution I can't recommend >> http://newrelic.com/ highly enough. Simple installation followed by highly >> detailed reports with zero issues (so far). They do a free trial of all > the >> pro features so you can see if it gets you what you need. And no, I don't >> work for them, I just think they've built a freakin' awesome product > that's >> invaluable when diagnosing issues that only occur in production. I've > never >> used it on a site with that level of traffic, and I'm sure it won't be a >> problem, but you may want to only deploy it to a fraction of your >> infrastructure. > > A quick look at that product seems interesting, but not what I really need. > We have a ton of monitoring solutions in place to get metrics and > performance data. I just need a good 'hook' to get details when errors > occur. You obviously didn't look closely enough. NewRelic is a PHP extension which hooks in to errors (and many other things) and provides detailed information for everything that happens while your application is running. Do yourself a favour and try it. As an example I recently diagnosed a snowball performance problem with a ColdFusion application by simply installing NewRelic and waiting for the next time the server came crashing down. Without the insights that tool gave me it would have taken me many times longer to identify and fix the root cause. >> If you want a homemade solution, the uncaught exceptions are easily dealt >> with... CATCH THEM, do something useful with them, and then die > gracefully. >> Rocket science this ain't! > > Thanks captain obvious. :) If it was obvious why did you feel the need to ask the question? > I can do that (and did do that), but again, at these scales, all the > text-book code you think you know starts to go out the window. Frameworks > break down. RDBMS topple over. You have to write things creatively, leanly > (and sometimes error on the side of 'assume something is there' rather than > 'assume the worst' or your code spends too much time checking the edge > cases). Hit it and quit it! Get in. Get out. I can't put try/catch around > everything everywhere, it's just not efficient or practical. Even the SQL > queries we write are 'wide' and we pull in as much logical stuff as we can > in one DB call, get it into a memcache slab and then pull it out of there > over and over, rather than surgical queries to get small chunks of data > which would murder mySQL. This wreaks of architectural problems. I understand that it's a codebase that you've inherited and that you're doing your best with it, but what you're describing are not features of a well-designed, scalable web application. The idea that MySQL is best used to pull large datasets rather than just exactly what you need makes my skin crawl. You may want to consider having an offline process populate Memcache, or look at your DB schema to see if there's a better way to store the data with a view to optimising access to it. Oh, and assumptions are the mother of all screw ups. Your logs would be far more useful if the code caught and properly dealt with problems as they occurred. Yes, there will be a slight (and I mean very slight) performance hit for catching exceptions and checking for error conditions, but do you really believe that most of the large, complex applications out there are not doing these things? Solid code is far more important than fast code. Servers are cheap, your time is not. Do the maths! > Part of the reason I took this job is exactly because of these challenges > and I've learned an incredible amount here (I've also had to wash the guilt > off of me some nights, as some code I've written goes against everything I > was taught and thought I knew for the past decade -- but it works and works > well -- it just FEELS wrong). We do a lot of things that would make my > college professors cringe. THAT is the difference between the REAL world and > the LAB. ;-) Granted, in the "real world" you cut corners, but these are now accepted techniques and should no longer feel wrong. For example, fully normalised databases are painful in a web context, but the idea of de-normalising the schema and duplicating data to optimise for access would make anyone who prefers the "right" way feel dirty. But sometimes it's necessary. Not doing things the academic way should not make you feel dirty. If it does I'd suggest you look at specifically what it is about what you're doing that makes you feel that way, because if what you're doing is backed up by valid reasons it should not feel dirty. >> See the set_exception_handler function for an >> easy way to set up a global function to catch uncaught exceptions if you >> don't have a limited number of entry points. >> >> You can similarly catch the warnings using the set_error_handler function, >> tho be aware that this won't be triggered for fatal errors. > > And this is the meat of the solution. Thanks! I'll look into these handlers > and see if I can inject it into someplace useful. I have high hopes for this > now. I still maintain that using NewRelic would be far more efficient and controllable, but you can certainly roll your own solution. You may also want to check out tools like Scribe and Splunk to assist with managing and examining your log files. >> But seriously... a minimal level of structured testing would prevent > issues >> like this being deployed to your production servers. Sure, instrument to >> help resolve these issues now, but if I were you I'd be putting a lot of >> effort into improving your development process. Contact me off-list if > you'd >> like to talk about this in more detail. > > See above. I have begged for even a single dedicated tester. I have offered > to sacrifice the open req I had for a junior developer to get a tester. That > resulted in them taking away the req because "clearly I didn't need the > developer then" and "we can just test it ourselves". You're preaching to the > choir my friend. I've been doing this for 15+ years at various companies. > ;-) You don't need a dedicated tester, and even if you did have one that doesn't mean that you don't need to test it yourselves. I rarely find myself on the same side as an employer [unless it's me :)] but yours is spot on, firstly because if you're happy to compromise on a developer to get a tester then you didn't really need another developer, but primarily because you should be testing it yourselves. If you think a dedicated tester will absolve you of your responsibility to test your own stuff then you have a lot more to learn. Tools that enable you to automate a lot of what a dedicated tester would do are legion. Unit tests, CI systems, Selenium, and others will propel your organisation on the way to building solid software that doesn't fill your logs up with repeated messages that arise simply because the developer didn't test a variety of inputs for a given function, both valid and invalid. I hope I haven't come across as too preachy or rude in these two emails, but I've heard the arguments you're making many times and they just don't hold water for me. You may have been doing this for 15+ years, but have you done it at this scale before? Have you done it in a small company that has the proper processes and tools in place? I hope my comments prove useful, and my offer to discuss this off-list stands. -- Stuart Dallas 3ft9 Ltd http://3ft9.com/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php