Dear Willy,
       Thank you for your help. We have a clear performance goal for our
cluster. The goal is high availability and maximizing throughput under a
predefined constant latency. However, we don't have a clear idea what
architecture or software would allow us to achieve that yet.  Let me
provide more details and try to answers your questions then.
        We have over three millions of files. Each static file is rather
small (< 5MB) and has a unique identifier used as well as an URL. As a
result, we are in the second case you mentioned. In particular, we should
concern about if everybody downloads the same file simultaneously. We
replicate each file at least two servers to provide fail over and load
balancing. In particular, if a server temporary fails, users can retrieve
the files kept on the failing server from another server.
       We do not have caching layer at the moments. More precisely, every
request is served directly from the web servers. We want the system to
scale linearly with the system size. In particular, when a new server is
added, we want traffic to be equally channeled to the new server compared
to existing servers.
       I will investigate Varnish cache and see if it fits to our system
then.

On Wed, Nov 23, 2011 at 8:15 AM, Willy Tarreau <w...@1wt.eu> wrote:

> Hi,
>
> On Fri, Nov 18, 2011 at 05:48:54PM +0100, Rerngvit Yanggratoke wrote:
> > Hello All,
> >         First of all, pardon me if I'm not communicating very well.
> English
> > is not my native language. We are running a static file distribution
> > cluster. The cluster consists of many web servers serving static files
> over
> > HTTP.  We have very large number of files such that a single server
> simply
> > can not keep all files (don't have enough disk space). In particular, a
> > file can be served only from a subset of servers. Each file is uniquely
> > identified by a file's URI. I would refer to this URI later as a key.
> >         I am investigating deploying HAProxy as a front end to this
> > cluster. We want HAProxy to provide load balancing and automatic fail
> over.
> > In other words, a request comes first to HAProxy and HAProxy should
> forward
> > the request to appropriate backend server. More precisely, for a
> particular
> > key, there should be at least two servers being forwarded to from HAProxy
> > for the sake of load balancing. My question is what load
> > balancing strategy should I use?
> >         I could use hashing(based on key) or consistent hashing. However,
> > each file would end up being served by a single server on a particular
> > moment. That means I wouldn't have load balancing and fail over for a
> > particular key.
>
> This question is much more a question of architecture than of
> configuration.
> What is important is not what you can do with haproxy, but how you want
> your
> service to run. I suspect that if you acquired hardware and bandwidth to
> build
> your service, you have pretty clear ideas of how your files will be
> distributed
> and/or replicated between your servers. You also know whether you'll serve
> millions of files or just a few tens, which means in the first case that
> you
> can safely have one server per URL, and in the later that you would risk
> overloading a server if everybody downloads the same file at a time. Maybe
> you have installed caches to avoid overloading some servers. You have
> probably
> planned what will happen when you add new servers, and what is supposed to
> happen when a server temporarily fails.
>
> All of these are very important questions, they determine whether your site
> will work or fail.
>
> Once you're able to respond to these questions, it becomes much more
> obvious
> what the LB strategy can be, if you want to dedicate server farms to some
> URLs, or load-balance each hash among a few servers because you have a
> particular replication strategy. And once you know what you need, then we
> can study how haproxy can respond to this need. Maybe it can't at all,
> maybe
> it's easy to modify it to respond to your needs, maybe it does respond
> pretty
> well.
>
> My guess from what you describe is that it could make a lot of sense to
> have one layer of haproxy in front of Varnish caches. The first layer of
> haproxy chooses a cache based on a consistent hash of the URL, and each
> varnish is then configured to address a small bunch of servers in round
> robin. But this means that you need to assign servers to farms, and that
> if you lose a varnish, all the servers behind it are lost too.
>
> If your files are present on all servers, it might make sense to use
> varnish as explained above but which would round-robin across all servers.
> That way you make the cache layer and the server layer independant of each
> other. But this can imply complex replication strategies.
>
> As you see, there is no single response, you really need to define how you
> want your architecture to work and to scale first.
>
> Regards,
> Willy
>
>
>


-- 
Best Regards,
Rerngvit Yanggratoke

Reply via email to