Hi Bill.

Thanks for the long reply.

On Tue, Jul 20, 2010 at 1:06 PM, Bill Karwin <[email protected]> wrote:

>
> On Jul 20, 2010, at 7:00 AM, robert mena wrote:
>
>  I am in a process of developing a new version of my application.  The new
>> version must be able to have high availability and scale so I am evaluating
>> both hardware and software techniques so I can cope with the increasing
>> traffic and a SLA.
>>
>
> High availability and high scalability are two different goals.  Vincent de
> Lau mentions MySQL Cluster (the NDB storage engine) but this is meant to
> solve high availability more than it's meant to solve high scalability.
>
>
I know that.  I had to mention because I have to address both items in the
new project.


> You should probably define (for yourself--I don't need to know)
> specifically what you mean by high availability and high scalability.  What
> amount of downtime can you tolerate?  What's the average & max number of
> concurrent requests you need to serve?  What's the average & max bytes per
> request?  What is the ratio of database reads versus writes?  Many other
> such questions...
>
>
Thanks for that. I surely need to define those parameters and they will
change what solutions I'll be forced to use.  But again let's say I define:
- 99.95% of uptime,
- from 256 to 1024 concurrent http requests
- the html served around 60KB uncompressed (the rest are images) but a
single page with ~150KB -> 300KB
- the ratio will be difficult but a 1 write for every 10 reads reasonable?
 Actually the writes would be for logging so each page will have one log
write and more reads.  I would probably use a separate db for the writes.



> The point is once you know a specific goal, you can build to it.
>  Qualitative goals like "high scalability" mean different things to
> different people, and there's no way to verify that you've met the goal,
> fallen far short, or greatly over-engineered.
>
> Have you read "High Performance MySQL, 2nd Edition" (
> http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716/)
> and "Scalable Internet Architectures" (
> http://www.amazon.com/Scalable-Internet-Architectures-Theo-Schlossnagle/dp/067232699X/)?
>  If not, do read them.
>
>
Thanks I'll have a look at those.


>
>  I already use Zend_Cache to reduce unnecessary calls to the database, use
>> Headers to cache elements (like images, css) in the browser side, separate
>> web server from the database server.
>>
>
> Read books by Steve Souders (
> http://www.amazon.com/Steve-Souders/e/B001I9TVJS/), and his blog (
> http://www.stevesouders.com/blog/)?
>
>
Thanks I already follow and use his "rules" in the current incarnation of
the software.


> Learn to use Yslow or Google PageSpeed.  These are performance analyzers
> for the web front-end that are analogous to using EXPLAIN to analyze SQL
> queries.  It's amazing to run high-profile websites (e.g. HuffingtonPost)
> through these tools and see how many bone-headed things they're doing.
>

You got that right. Specially those adding javascript multiple times.


>
> You need to have a strategy for testing availability and scalability during
> development, as much as you need a strategy for building for those goals.
>  What load testing tools are you using?  ApacheBench, Siege, HP httperf,
> JMeter, etc.  What code profiling tools are you using?
>

I will be using AB and for the PHP code the webgrind.


>
> You need a strategy for monitoring your infrastructure availability and
> scalability at runtime too (e.g. Nagios).  There's no such thing as a
> perfectly available system (or at least it's cost-prohibitive).  Part of the
> strategy of high availability is being able to respond to and recover from
> interruptions.
>
>
>  So where does ZF enters here?  Well, the app will be itself developed
>> using ZF but I have some questions about:
>>
>
> Keep in mind the role of a framework.  A framework helps to speed up
> development efficiency, not runtime efficiency.  These are two separate
> priorities.
>
> High-scale websites may start by using a framework during the prototyping
> phase, but then after they have it working so they can measure where the
> bottlenecks exist, they start refactoring, which generally means separating
> their critical path code from using the framework, one piece at a time.
>  Eventually, they rewrite a lot of their web app in their own custom code,
> designed to be optimized for their specific web app.  They gradually reduce
> their usage of the framework in this way.
>
>
Bill,  my question was more like this... which components of ZF (or even
others built on top of that) could be used for that starting point?  For
example.  I've read the usage of queues as a strategy to handle some
situations. Should I use Zend_Queue? etc..



> Don't try to implement a perfectly optimized site from the ground up.  You
> need a prototype that you can run and measure using real users and real
> data.  Then you'll refactor incrementally as you identify the real
> bottlenecks.  This is the reason many high-scale sites start with a
> beta-access period.
>
>
>  a) database
>>
>> Should I use regular databases?  In my case MySQL with the master - slave
>> situation.  Or should I try Mongo/Cassandra because of the auto-sharding
>> features?   This is something that I don't need right now but some sort
>> future planning.
>>
>
> Don't adopt Mongo/Cassandra or any of the other NoSQL solutions (or any
> other technology, for that matter) until you've demonstrated a need.  Used
> properly, RDBMS solutions can give you the scalability you need, unless
> you're up at the level of Facebook.
>
>
Ok.


>
>  Are there any plans to have those Nosql adapters for ZF 1.x?
>>
>
> There are proposals, but I don't see them getting much traction yet.  It's
> going to be months or years before they're ready to use in ZF.  But you can
> use those NoSQL databases directly through PHP extensions, without
> integrating with ZF.  That's the advantage of ZF being a library of
> loosely-coupled components, instead of a tightly integrated framework -- you
> can choose to swap out usage of any component.
>
>
Ok.


>
>  b) content management
>>
>> in my app I have to upload files (images) besides text.  The text is
>> stored in the database so If I stick with the database replication I'll be
>> fine.  But how about the other files?  I think I need to control which files
>> have been uploaded to the management node and if they were correctly
>> replicated to all the client nodes.
>>
>
> I'm not sure copying the media files to *all* the client nodes is the right
> architecture.  I'm not saying it is or isn't, just encouraging you to
> question that assumption.  A single server running nginx or lighttpd can
> serve an awful lot of static images, working in parallel to Apache running
> your PHP app.  Then you don't have to transfer image files around at all.
>
>
Ok. The idea of copying to all client nodes is an attempt to treat all
clients as equals and ease the administration of the "director" of the
requests.  But using a light webserver to serve all the static contents due
to a better performance is indeed a great approach.


>
>  For the replication of the files I am considering some sort of queue to
>> generate a list of files and call rsync to do the actual transfer.
>>
>
> I wouldn't use rsync for this; it's too intelligent, trying to discover
> which files need to be synced.
>
>
I would use the queue to generate the list of files to use rsync just as a
copying "tool".


> You could use a BitTorrent type technology to do your file transfer.  I
> haven't tried this, but this blog was intriguing:
> http://engineering.twitter.com/2010/07/murder-fast-datacenter-code-deploys.html
>
>
I'll investigate that.


>
>  c) cache management
>>
>> in a similar situation when I add/modify/delete a content I'd like to
>> signal each client node that they should "invalidate" that cache object (if
>> they have)
>>
>
> You should use memcached, which already has a distributed architecture.
>  You shouldn't need to signal multiple caches that way.
>
>
that's great. I'll have a look at that.


>
>  d) Which version use?
>>
>> The development will begin in up to three months and probably last for six
>> more months after that.  So should I start using the ZF 2.0-dev or is it in
>> too early stages?  This will be the first time that I'll have the
>> opportunity to start from scratch so I'll probably have to maintain this
>> code base for at least 5 years...
>>
>
> It doesn't matter.  You're going to be doing continuous refactoring anyway,
> so don't wait for ZF2.0.
>
>
OK.


>
>  e) CDN
>>
>> This is probably OT (so feel free to reply me directly) but I am looking
>> for companies that I could host the servers (client nodes) and/or that could
>> provide a 'director' service, redirecting the requests to the closest active
>> client node (so I can add new nodes or remove for maintenance).
>>
>
> Have you read about the free CoralCDN (http://www.coralcdn.org/)?  This
> may be good for small static resources like small images, javascript, and
> CSS.
>
> Use a paid delivery network (e.g. Akamai) for large media like videos.
>  They should give you a simple interface to request a given media file.
>  Then they take care of optimizing for geography/topology.  Don't try to do
> that yourself.
>
>
Thanks again.


> Regards,
> Bill Karwin
>
>

Reply via email to