Hello Bhaskar,

On Wed, Oct 23, 2013 at 06:15:31PM -0400, Bhaskar Maddala wrote:
> Hello,
> 
>   Apologies for the delay in responding. The trouble is largely to do with
> being able to reproduce the results with the test harness. I made a couple
> of changes to the test harness you provided, for standard deviation and
> variance, the results are at [1]. The sheet titles hopefully make sense,
> lmk in case they don't

Great, could you please post the updated source somewhere, it would
help others contribute and test with their logs.

> It took a while to find the data that we based our decision on, the last
> sheet contains this data. The std dev for SDBM was 47 and for DJB2 was 30,
> this difference correlates to connections from varnish to our application
> backends as [2]. I realize this is a second order metric however it makes
> sense in that the smaller standard deviation relates to a smoother
> distribution of loads to varnish from haproxy which in turn relates to
> connection counts to application webs converging across the varnish pool.
> 
> I spent some time looking for how this data was obtained, and just found
> out that it was done by performing http requests. I used this data against
> the test harness [3] and immediately noticed that the difference in std dev
> results.
> 
> I will as a next step use the same methodology used previously (performing
> GET requests) to determine the efficacy of of the algorithms, however are
> there substantial differences between the test harness and the code in
> haproxy that would in anyway explain this difference? Any other thoughts?

Yes there is a simple reason which is that some host names probably cause
many more requests than other ones. So if you did like me (sort -u on the
host names), you're considering that they're all used at the exact same
frequency which is not true. That said, I still think that this should not
be considered when designing a hash function for the job. If a hash
algorithm is perfect, you will still see differences caused by the variations
between the smallest and largest sites you're hosting.

In my opinion, the best solution should probably be to hash the full request
and not just the host. But with many blogs I suspect that if one of them
becomes very popular, it's still possible to see differences in cache loads.

One solution when you know that *some* blogs are very popular is to dedicate
them a backend section in which you use another algorithm, probably round
robin. Indeed, if a few host names constitute more than a few percent of
your traffic, it makes sense to spread them over all caches and allow their
contents to be replicated over the caches.

I have not yet tried to apply murmurhash3 nor siphash, both of which seem
very promising. Maybe you would like to experiment with them ?

Best regards,
Willy


Reply via email to