Hello,

    This is a little long, please bear with me. I was able to run more
tests the results are at [1]. As I mentioned before I was not able to
reproduce the distribution we had seen when we implemented our patch in
Haproxy using hashcount.c yesterday, the only remaining difference was
actually having the requests be served via an instance of haproxy.

   I broke down the tests into the following 3 categories
   (a) ~10K request
   (b) All unique host names obtained from 1MM requests that were grepped
from haproxy production logs, there were 48259 blogs
   (c) 1MM requests

   Case (a) used the data set we had based our decision on, case (b) and
(c) used a data set I obtained at the beginning of this conversation.

    I then ran these 3 scenarios via both haproxy and hashcount.c for DJB2
and SDBM. I was able to reproduce* the distribution we had based our
decision on when using haproxy. Interestingly the test harness
(hashcount.c) produces wildly different distribution when the results are
compared to an instance of Haproxy, more so for SDBM.

   As a sanity check, I verified that the results were reproducible under
repeated executions.

   My test haproxy configuration is available at [2]. With the goal to
eliminate as many variables as possible, I set up 17 back ends to all use a
single nginx server that served static files from the file system, nginx
config is also available at [2]

   A request flow would look like this
   curl with custom header -> Haproxy on 80 -> hash on custom header  to
select backend -> (all backends point to the same nginx instance) ->
request forwarded to backend nginx -> nginx responds

  Following/Similar [3] was executed from the machine housing haproxy &
nginx.

  Finally a note on the results [1]. Each sheet corresponds to each of the
cases (a)/(b)/(c). On the first sheet you will see the std dev numbers
under the heading "executing via haproxy" that we based our decision on.
The raw results are also available at [2]

  Let me know if you would like to see more numbers with a changed
methodology etc. or if you have an explanation for the difference is
distribution when using haproxy vs using hashcount.c after taking a look at
the config.

  As next steps, unless I hear otherwise from you, I am going to do the
following, implement the patch that I had suggested when I started this
thread, in which I will include wt_hash6 (correct?). I think these result
show that there is a justification for it. Once I have the patch building,
I will retest these for sanity checks. Following which I will try to
determine why there is the difference between the test harness and an
executing haproxy instance. My tests do not include avalanche because I am
testing only on 1.4.

Thanks
Bhaskar

[1] http://tinyurl.com/ke4g4ph
[2] https://gist.github.com/maddalab/7148672
[3] cat 1m_requests.log | xargs -I hh curl -s -H "Host: hh"
http://localhost/index.html >> results.txt


On Thu, Oct 24, 2013 at 9:15 AM, Bhaskar Maddala <[email protected]> wrote:

> Hello,
>
>    I took into account the uniqueness of the hostnames/actual traffic in
> the test. That is to say, I did a test using both unique and non filtered
> hostnames. The results you see in the spread sheet are from the case where
> I did not filter the names. We also made sure to capture the logs from 3
> different perimeter proxies each servicing a different dns record type (*.
> tumblr.com, A-records ex: ablog.com, cnames  ex: willy.tarreau.com).
>
>    On blogs that get extremely popular, we do use an entirely different
> backend of varnish nodes with consistent hashing over the whole uri. We
> have a model that is used to determine these blogs and traffic to these
> blogs is redirected to the "special" backend by use of ACL in haproxy
> config. At the time I grabbed these logs there were no blogs being directed
> to the heavy traffic backend.
>
>   Even in the scenario, that a blog were getting more traffic that others,
> I would expect that the algorithm with more standard deviation when using
> unique hostnames would have worse outcomes when testing with non unique
> hostnames.
>
>   I will let you know once I run these through a haproxy and capture from
> the logs the backend was selected, probably later today. Code is at [1]
>
>
> Thanks
> -Bhaskar
>
> [1] https://gist.github.com/maddalab/7136792
>
>
> On Thu, Oct 24, 2013 at 2:01 AM, Willy Tarreau <[email protected]> wrote:
>
>> Hello Bhaskar,
>>
>> On Wed, Oct 23, 2013 at 06:15:31PM -0400, Bhaskar Maddala wrote:
>> > Hello,
>> >
>> >   Apologies for the delay in responding. The trouble is largely to do
>> with
>> > being able to reproduce the results with the test harness. I made a
>> couple
>> > of changes to the test harness you provided, for standard deviation and
>> > variance, the results are at [1]. The sheet titles hopefully make sense,
>> > lmk in case they don't
>>
>> Great, could you please post the updated source somewhere, it would
>> help others contribute and test with their logs.
>>
>> > It took a while to find the data that we based our decision on, the last
>> > sheet contains this data. The std dev for SDBM was 47 and for DJB2 was
>> 30,
>> > this difference correlates to connections from varnish to our
>> application
>> > backends as [2]. I realize this is a second order metric however it
>> makes
>> > sense in that the smaller standard deviation relates to a smoother
>> > distribution of loads to varnish from haproxy which in turn relates to
>> > connection counts to application webs converging across the varnish
>> pool.
>> >
>> > I spent some time looking for how this data was obtained, and just found
>> > out that it was done by performing http requests. I used this data
>> against
>> > the test harness [3] and immediately noticed that the difference in std
>> dev
>> > results.
>> >
>> > I will as a next step use the same methodology used previously
>> (performing
>> > GET requests) to determine the efficacy of of the algorithms, however
>> are
>> > there substantial differences between the test harness and the code in
>> > haproxy that would in anyway explain this difference? Any other
>> thoughts?
>>
>> Yes there is a simple reason which is that some host names probably cause
>> many more requests than other ones. So if you did like me (sort -u on the
>> host names), you're considering that they're all used at the exact same
>> frequency which is not true. That said, I still think that this should not
>> be considered when designing a hash function for the job. If a hash
>> algorithm is perfect, you will still see differences caused by the
>> variations
>> between the smallest and largest sites you're hosting.
>>
>> In my opinion, the best solution should probably be to hash the full
>> request
>> and not just the host. But with many blogs I suspect that if one of them
>> becomes very popular, it's still possible to see differences in cache
>> loads.
>>
>> One solution when you know that *some* blogs are very popular is to
>> dedicate
>> them a backend section in which you use another algorithm, probably round
>> robin. Indeed, if a few host names constitute more than a few percent of
>> your traffic, it makes sense to spread them over all caches and allow
>> their
>> contents to be replicated over the caches.
>>
>> I have not yet tried to apply murmurhash3 nor siphash, both of which seem
>> very promising. Maybe you would like to experiment with them ?
>>
>> Best regards,
>> Willy
>>
>>
>

Reply via email to