On Jul 20, 2010, at 7:00 AM, robert mena wrote:

I am in a process of developing a new version of my application. The new version must be able to have high availability and scale so I am evaluating both hardware and software techniques so I can cope with the increasing traffic and a SLA.

High availability and high scalability are two different goals. Vincent de Lau mentions MySQL Cluster (the NDB storage engine) but this is meant to solve high availability more than it's meant to solve high scalability.

You should probably define (for yourself--I don't need to know) specifically what you mean by high availability and high scalability. What amount of downtime can you tolerate? What's the average & max number of concurrent requests you need to serve? What's the average & max bytes per request? What is the ratio of database reads versus writes? Many other such questions...

The point is once you know a specific goal, you can build to it. Qualitative goals like "high scalability" mean different things to different people, and there's no way to verify that you've met the goal, fallen far short, or greatly over-engineered.

Have you read "High Performance MySQL, 2nd Edition" (http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716/ ) and "Scalable Internet Architectures" (http://www.amazon.com/Scalable-Internet-Architectures-Theo-Schlossnagle/dp/067232699X/)? If not, do read them.

I already use Zend_Cache to reduce unnecessary calls to the database, use Headers to cache elements (like images, css) in the browser side, separate web server from the database server.

Read books by Steve Souders (http://www.amazon.com/Steve-Souders/e/B001I9TVJS/ ), and his blog (http://www.stevesouders.com/blog/)?

Learn to use Yslow or Google PageSpeed. These are performance analyzers for the web front-end that are analogous to using EXPLAIN to analyze SQL queries. It's amazing to run high-profile websites (e.g. HuffingtonPost) through these tools and see how many bone-headed things they're doing.

You need to have a strategy for testing availability and scalability during development, as much as you need a strategy for building for those goals. What load testing tools are you using? ApacheBench, Siege, HP httperf, JMeter, etc. What code profiling tools are you using?

You need a strategy for monitoring your infrastructure availability and scalability at runtime too (e.g. Nagios). There's no such thing as a perfectly available system (or at least it's cost-prohibitive). Part of the strategy of high availability is being able to respond to and recover from interruptions.

So where does ZF enters here? Well, the app will be itself developed using ZF but I have some questions about:

Keep in mind the role of a framework. A framework helps to speed up development efficiency, not runtime efficiency. These are two separate priorities.

High-scale websites may start by using a framework during the prototyping phase, but then after they have it working so they can measure where the bottlenecks exist, they start refactoring, which generally means separating their critical path code from using the framework, one piece at a time. Eventually, they rewrite a lot of their web app in their own custom code, designed to be optimized for their specific web app. They gradually reduce their usage of the framework in this way.

Don't try to implement a perfectly optimized site from the ground up. You need a prototype that you can run and measure using real users and real data. Then you'll refactor incrementally as you identify the real bottlenecks. This is the reason many high-scale sites start with a beta-access period.

a) database

Should I use regular databases? In my case MySQL with the master - slave situation. Or should I try Mongo/Cassandra because of the auto-sharding features? This is something that I don't need right now but some sort future planning.

Don't adopt Mongo/Cassandra or any of the other NoSQL solutions (or any other technology, for that matter) until you've demonstrated a need. Used properly, RDBMS solutions can give you the scalability you need, unless you're up at the level of Facebook.

Are there any plans to have those Nosql adapters for ZF 1.x?

There are proposals, but I don't see them getting much traction yet. It's going to be months or years before they're ready to use in ZF. But you can use those NoSQL databases directly through PHP extensions, without integrating with ZF. That's the advantage of ZF being a library of loosely-coupled components, instead of a tightly integrated framework -- you can choose to swap out usage of any component.

b) content management

in my app I have to upload files (images) besides text. The text is stored in the database so If I stick with the database replication I'll be fine. But how about the other files? I think I need to control which files have been uploaded to the management node and if they were correctly replicated to all the client nodes.

I'm not sure copying the media files to *all* the client nodes is the right architecture. I'm not saying it is or isn't, just encouraging you to question that assumption. A single server running nginx or lighttpd can serve an awful lot of static images, working in parallel to Apache running your PHP app. Then you don't have to transfer image files around at all.

For the replication of the files I am considering some sort of queue to generate a list of files and call rsync to do the actual transfer.

I wouldn't use rsync for this; it's too intelligent, trying to discover which files need to be synced.

You could use a BitTorrent type technology to do your file transfer. I haven't tried this, but this blog was intriguing: http://engineering.twitter.com/2010/07/murder-fast-datacenter-code-deploys.html

c) cache management

in a similar situation when I add/modify/delete a content I'd like to signal each client node that they should "invalidate" that cache object (if they have)

You should use memcached, which already has a distributed architecture. You shouldn't need to signal multiple caches that way.

d) Which version use?

The development will begin in up to three months and probably last for six more months after that. So should I start using the ZF 2.0- dev or is it in too early stages? This will be the first time that I'll have the opportunity to start from scratch so I'll probably have to maintain this code base for at least 5 years...

It doesn't matter. You're going to be doing continuous refactoring anyway, so don't wait for ZF2.0.

e) CDN

This is probably OT (so feel free to reply me directly) but I am looking for companies that I could host the servers (client nodes) and/or that could provide a 'director' service, redirecting the requests to the closest active client node (so I can add new nodes or remove for maintenance).

Have you read about the free CoralCDN (http://www.coralcdn.org/)? This may be good for small static resources like small images, javascript, and CSS.

Use a paid delivery network (e.g. Akamai) for large media like videos. They should give you a simple interface to request a given media file. Then they take care of optimizing for geography/ topology. Don't try to do that yourself.

Regards,
Bill Karwin

Reply via email to