On Jul 20, 2010, at 7:00 AM, robert mena wrote:
I am in a process of developing a new version of my application.
The new version must be able to have high availability and scale so
I am evaluating both hardware and software techniques so I can cope
with the increasing traffic and a SLA.
High availability and high scalability are two different goals.
Vincent de Lau mentions MySQL Cluster (the NDB storage engine) but
this is meant to solve high availability more than it's meant to solve
high scalability.
You should probably define (for yourself--I don't need to know)
specifically what you mean by high availability and high scalability.
What amount of downtime can you tolerate? What's the average & max
number of concurrent requests you need to serve? What's the average &
max bytes per request? What is the ratio of database reads versus
writes? Many other such questions...
The point is once you know a specific goal, you can build to it.
Qualitative goals like "high scalability" mean different things to
different people, and there's no way to verify that you've met the
goal, fallen far short, or greatly over-engineered.
Have you read "High Performance MySQL, 2nd Edition" (http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716/
) and "Scalable Internet Architectures" (http://www.amazon.com/Scalable-Internet-Architectures-Theo-Schlossnagle/dp/067232699X/)?
If not, do read them.
I already use Zend_Cache to reduce unnecessary calls to the
database, use Headers to cache elements (like images, css) in the
browser side, separate web server from the database server.
Read books by Steve Souders (http://www.amazon.com/Steve-Souders/e/B001I9TVJS/
), and his blog (http://www.stevesouders.com/blog/)?
Learn to use Yslow or Google PageSpeed. These are performance
analyzers for the web front-end that are analogous to using EXPLAIN to
analyze SQL queries. It's amazing to run high-profile websites (e.g.
HuffingtonPost) through these tools and see how many bone-headed
things they're doing.
You need to have a strategy for testing availability and scalability
during development, as much as you need a strategy for building for
those goals. What load testing tools are you using? ApacheBench,
Siege, HP httperf, JMeter, etc. What code profiling tools are you
using?
You need a strategy for monitoring your infrastructure availability
and scalability at runtime too (e.g. Nagios). There's no such thing
as a perfectly available system (or at least it's cost-prohibitive).
Part of the strategy of high availability is being able to respond to
and recover from interruptions.
So where does ZF enters here? Well, the app will be itself
developed using ZF but I have some questions about:
Keep in mind the role of a framework. A framework helps to speed up
development efficiency, not runtime efficiency. These are two
separate priorities.
High-scale websites may start by using a framework during the
prototyping phase, but then after they have it working so they can
measure where the bottlenecks exist, they start refactoring, which
generally means separating their critical path code from using the
framework, one piece at a time. Eventually, they rewrite a lot of
their web app in their own custom code, designed to be optimized for
their specific web app. They gradually reduce their usage of the
framework in this way.
Don't try to implement a perfectly optimized site from the ground up.
You need a prototype that you can run and measure using real users and
real data. Then you'll refactor incrementally as you identify the
real bottlenecks. This is the reason many high-scale sites start with
a beta-access period.
a) database
Should I use regular databases? In my case MySQL with the master -
slave situation. Or should I try Mongo/Cassandra because of the
auto-sharding features? This is something that I don't need right
now but some sort future planning.
Don't adopt Mongo/Cassandra or any of the other NoSQL solutions (or
any other technology, for that matter) until you've demonstrated a
need. Used properly, RDBMS solutions can give you the scalability you
need, unless you're up at the level of Facebook.
Are there any plans to have those Nosql adapters for ZF 1.x?
There are proposals, but I don't see them getting much traction yet.
It's going to be months or years before they're ready to use in ZF.
But you can use those NoSQL databases directly through PHP extensions,
without integrating with ZF. That's the advantage of ZF being a
library of loosely-coupled components, instead of a tightly integrated
framework -- you can choose to swap out usage of any component.
b) content management
in my app I have to upload files (images) besides text. The text is
stored in the database so If I stick with the database replication
I'll be fine. But how about the other files? I think I need to
control which files have been uploaded to the management node and if
they were correctly replicated to all the client nodes.
I'm not sure copying the media files to *all* the client nodes is the
right architecture. I'm not saying it is or isn't, just encouraging
you to question that assumption. A single server running nginx or
lighttpd can serve an awful lot of static images, working in parallel
to Apache running your PHP app. Then you don't have to transfer image
files around at all.
For the replication of the files I am considering some sort of queue
to generate a list of files and call rsync to do the actual transfer.
I wouldn't use rsync for this; it's too intelligent, trying to
discover which files need to be synced.
You could use a BitTorrent type technology to do your file transfer.
I haven't tried this, but this blog was intriguing: http://engineering.twitter.com/2010/07/murder-fast-datacenter-code-deploys.html
c) cache management
in a similar situation when I add/modify/delete a content I'd like
to signal each client node that they should "invalidate" that cache
object (if they have)
You should use memcached, which already has a distributed
architecture. You shouldn't need to signal multiple caches that way.
d) Which version use?
The development will begin in up to three months and probably last
for six more months after that. So should I start using the ZF 2.0-
dev or is it in too early stages? This will be the first time that
I'll have the opportunity to start from scratch so I'll probably
have to maintain this code base for at least 5 years...
It doesn't matter. You're going to be doing continuous refactoring
anyway, so don't wait for ZF2.0.
e) CDN
This is probably OT (so feel free to reply me directly) but I am
looking for companies that I could host the servers (client nodes)
and/or that could provide a 'director' service, redirecting the
requests to the closest active client node (so I can add new nodes
or remove for maintenance).
Have you read about the free CoralCDN (http://www.coralcdn.org/)?
This may be good for small static resources like small images,
javascript, and CSS.
Use a paid delivery network (e.g. Akamai) for large media like
videos. They should give you a simple interface to request a given
media file. Then they take care of optimizing for geography/
topology. Don't try to do that yourself.
Regards,
Bill Karwin