Replying to myself :) I think I spotted a bug in HAProxy as well. For some reasons, when I run HAProxy in debug more, I never ever have the issue (all my servers are properly populated and maintained).
I did a strace of the process running in daemon mode in the container, and I can confirm the following behavior: - first request sent honoring the edns extension (to get big responses), up to 8K - first response comes back with the fill list (around 2K of data) - second request is sent with default accepted payload size (around 1K) - second response comes back with partial records - mess up is starting Now I can reproduce the bug, I'm going to investigate what's happening and provide a fix asap. Thanks a gain Mike for reporting!!! Baptiste On Mon, Feb 12, 2018 at 10:17 AM, Baptiste <bed...@gmail.com> wrote: > Continuing on my investigation I found an other interesting piece of > information: > I run haproxy and my consul environment in a docker host, through > docker-compose and I can reproduce the same issue as you. > Basically, I have a service delivered by 20 containers, and HAProxy in > docker can see only 10 of them and switches all their IPs all the time... > That said, if I run the same HAProxy binary on my laptop, pointing it's > DNS resolvers to the consul client running in my docker host, everything > works smoothly!!! > > In my case, there is one thing that might happen: docker drops too big DNS > responses (UDP) and my HAProxy failover to 512 bytes only where only 10 SRV > records could stand (consul also returns A and TXT records for each SRV > response). > > I tested both latest 1.8 and 1.9-dev and can report same issue in both > cases. > > Could you tell me more about your environment (drop the ML if there are > too many sensitive information) > > Baptiste > > > On Mon, Feb 12, 2018 at 9:25 AM, Baptiste <bed...@gmail.com> wrote: > >> First, I confirm the following bug in consul 1.0.5: >> - start a X instances of a service >> - scale the service to X+Y (with Y > 1) >> ==> then consul crashes... >> From time to time, I also saw HAProxy getting only 10 servers from 20 for >> a given service. >> >> I'll revert to 1.0.2 for now. >> >> The order of the returned SRV records is ignored by HAProxy. >> Can you confirm the number of servers associated to the service ' >> mfm-monitor-opentsdb' in consul? >> On the HAProxy box, can you run the following command and return the >> output (obfuscating the IPs and other sensible information) >> dig +notcp @127.0.0.1 -p 8600 -t SRV _mfm-monitor-opentsdb._tcp.ser >> vice.consul >> >> Baptiste >> >> >> >> On Mon, Feb 12, 2018 at 8:27 AM, Чепайкин Михаил <mchepay...@gmail.com> >> wrote: >> >>> Im on Consul 1.0.2. >>> >>> Why do you think this issue is about serving SRV over UDP, rather than >>> about different order of SRV or A records returned by Consul DNS with >>> consecutive requests? >>> >>> On 11 February 2018 at 18:46, Baptiste <bed...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> What consul version are you using? >>>> I'm facing the same issue in my consul lab. That said, it seems to be a >>>> bug in consul, not able to serve too many SRV records over UDP. >>>> I even triggered a consul crash (using 1.0.5 version). >>>> I'm still investigating this issue and will come back to you as soon as >>>> I have more reliable information. >>>> >>>> Note: please ensure the number of server created by server-template >>>> directive (5 in your case) is above the expected number of server available >>>> in your service. >>>> >>>> Baptiste >>>> >>>> >>>> >>>> On Thu, Feb 8, 2018 at 12:32 AM, Чепайкин Михаил <mchepay...@gmail.com> >>>> wrote: >>>> >>>>> Hi >>>>> >>>>> I’ve changed configuration as you suggested: >>>>> >>>>> backend tsdb_backend_query >>>>> server-template tsdb_query 5 >>>>> _mfm-monitor-opentsdb._tcp.service.mfmconsul:4242 check resolvers dns >>>>> inter 1000 >>>>> >>>>> Logs are kinda different - backend servers now go UP and DOWN, but >>>>> seems the same - ip addresses changing in the same way: >>>>> >>>>> time="2018-02-08T02:12:53+03:00" level=info msg="[WARNING] 038/021253 >>>>> (18208) : Server tsdb_backend_query/tsdb_query1 is going DOWN for >>>>> maintenance (No IP for server ). 2 active and 0 backup servers left. 0 >>>>> sessions active, 0 requeued, 0 remaining in queue." >>>>> job=mfm-monitor-haproxy pid=18208 >>>>> time="2018-02-08T02:12:53+03:00" level=info msg="[WARNING] 038/021253 >>>>> (18208) : tsdb_backend_query/tsdb_query1 changed its IP from >>>>> 10.182.161.223 to 10.182.161.211 by DNS cache." job=mfm-monitor-haproxy >>>>> pid=18208 >>>>> time="2018-02-08T02:12:53+03:00" level=info msg="[WARNING] 038/021253 >>>>> (18208) : Server tsdb_backend_query/tsdb_query1 administratively READY >>>>> thanks to valid DNS answer." job=mfm-monitor-haproxy pid=18208 >>>>> time="2018-02-08T02:12:53+03:00" level=info msg="[WARNING] 038/021253 >>>>> (18208) : Server tsdb_backend_query/tsdb_query1 >>>>> ('0ab6a1d3.addr.dc1.mfmconsul') is UP/READY (resolves again)." >>>>> job=mfm-monitor-haproxy pid=18208 >>>>> time="2018-02-08T02:12:55+03:00" level=info msg="[WARNING] 038/021255 >>>>> (18208) : Server tsdb_backend_query/tsdb_query3 is going DOWN for >>>>> maintenance (No IP for server ). 2 active and 0 backup servers left. 0 >>>>> sessions active, 0 requeued, 0 remaining in queue." >>>>> job=mfm-monitor-haproxy pid=18208 >>>>> time="2018-02-08T02:12:55+03:00" level=info msg="[WARNING] 038/021255 >>>>> (18208) : tsdb_backend_query/tsdb_query3 changed its IP from >>>>> 10.182.161.98 to 10.182.161.223 by DNS cache." job=mfm-monitor-haproxy >>>>> pid=18208 >>>>> time="2018-02-08T02:12:55+03:00" level=info msg="[WARNING] 038/021255 >>>>> (18208) : Server tsdb_backend_query/tsdb_query3 administratively READY >>>>> thanks to valid DNS answer." job=mfm-monitor-haproxy pid=18208 >>>>> time="2018-02-08T02:12:55+03:00" level=info msg="[WARNING] 038/021255 >>>>> (18208) : Server tsdb_backend_query/tsdb_query3 >>>>> ('0ab6a1df.addr.dc1.mfmconsul') is UP/READY (resolves again)." >>>>> job=mfm-monitor-haproxy pid=18208 >>>>> time="2018-02-08T02:12:57+03:00" level=info msg="[WARNING] 038/021257 >>>>> (18208) : Server tsdb_backend_query/tsdb_query3 is going DOWN for >>>>> maintenance (No IP for server ). 2 active and 0 backup servers left. 0 >>>>> sessions active, 0 requeued, 0 remaining in queue." >>>>> job=mfm-monitor-haproxy pid=18208 >>>>> time="2018-02-08T02:12:57+03:00" level=info msg="[WARNING] 038/021257 >>>>> (18208) : tsdb_backend_query/tsdb_query3 changed its IP from >>>>> 10.182.161.223 to 10.182.161.98 by DNS cache." job=mfm-monitor-haproxy >>>>> pid=18208 >>>>> time="2018-02-08T02:12:57+03:00" level=info msg="[WARNING] 038/021257 >>>>> (18208) : Server tsdb_backend_query/tsdb_query3 administratively READY >>>>> thanks to valid DNS answer." job=mfm-monitor-haproxy pid=18208 >>>>> time="2018-02-08T02:12:57+03:00" level=info msg="[WARNING] 038/021257 >>>>> (18208) : Server tsdb_backend_query/tsdb_query3 >>>>> ('0ab6a162.addr.dc1.mfmconsul') is UP/READY (resolves again)." >>>>> job=mfm-monitor-haproxy pid=18208 >>>>> time="2018-02-08T02:13:01+03:00" level=info msg="[WARNING] 038/021301 >>>>> (18208) : Server tsdb_backend_query/tsdb_query1 is going DOWN for >>>>> maintenance (No IP for server ). 2 active and 0 backup servers left. 0 >>>>> sessions active, 0 requeued, 0 remaining in queue." >>>>> job=mfm-monitor-haproxy pid=18208 >>>>> time="2018-02-08T02:13:01+03:00" level=info msg="[WARNING] 038/021301 >>>>> (18208) : tsdb_backend_query/tsdb_query1 changed its IP from >>>>> 10.182.161.211 to 10.182.161.223 by DNS cache." job=mfm-monitor-haproxy >>>>> pid=18208 >>>>> time="2018-02-08T02:13:01+03:00" level=info msg="[WARNING] 038/021301 >>>>> (18208) : Server tsdb_backend_query/tsdb_query1 administratively READY >>>>> thanks to valid DNS answer." job=mfm-monitor-haproxy pid=18208 >>>>> time="2018-02-08T02:13:01+03:00" level=info msg="[WARNING] 038/021301 >>>>> (18208) : Server tsdb_backend_query/tsdb_query1 >>>>> ('0ab6a1df.addr.dc1.mfmconsul') is UP/READY (resolves again)." >>>>> job=mfm-monitor-haproxy pid=18208 >>>>> time="2018-02-08T02:13:05+03:00" level=info msg="[WARNING] 038/021305 >>>>> (18208) : Server tsdb_backend_query/tsdb_query2 is going DOWN for >>>>> maintenance (No IP for server ). 2 active and 0 backup servers left. 0 >>>>> sessions active, 0 requeued, 0 remaining in queue." >>>>> job=mfm-monitor-haproxy pid=18208 >>>>> time="2018-02-08T02:13:05+03:00" level=info msg="[WARNING] 038/021305 >>>>> (18208) : tsdb_backend_query/tsdb_query2 changed its IP from >>>>> 10.182.161.163 to 10.182.161.211 by DNS cache." job=mfm-monitor-haproxy >>>>> pid=18208 >>>>> >>>>> Any thoughts? >>>>> >>>>> On 8 February 2018 at 01:25, Baptiste <bed...@gmail.com> wrote: >>>>> >>>>> Hi >>>>>> >>>>>> You're not using SRV records and that may be the root cause of your >>>>>> issue. >>>>>> Please try something like this: >>>>>> >>>>>> backend tsdb_backend_query >>>>>> server-template tsdb_query 5 >>>>>> _mfm-monitor-opentsdb._tcp.service.mfmconsul:4242 check resolvers dns >>>>>> inter 1000 >>>>>> >>>>>> if "mfm-monitor-opentsdb" is your service name in consul. >>>>>> >>>>>> Baptiste >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Feb 7, 2018 at 2:52 PM, Чепайкин Михаил <mchepay...@gmail.com >>>>>> > wrote: >>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> I have a Consul as service discovery tool and HAProxy as load >>>>>>> balancer. >>>>>>> >>>>>>> In Consul registered a service running on a number of servers, and >>>>>>> this service can be scaled by adding and removing nodes and by moving >>>>>>> nodes >>>>>>> from one server to another. >>>>>>> >>>>>>> Consul has DNS service which randomizes responses for services like >>>>>>> that: >>>>>>> >>>>>>> [bux] michep@bux:~$ dig +short mfm-monitor-opentsdb.service.mfmconsul >>>>>>> 10.182.161.239 >>>>>>> 10.182.161.152 >>>>>>> 10.182.161.240 >>>>>>> 10.182.161.92 >>>>>>> [bux] michep@bux:~$ dig +short mfm-monitor-opentsdb.service.mfmconsul >>>>>>> 10.182.161.92 >>>>>>> 10.182.161.152 >>>>>>> 10.182.161.240 >>>>>>> 10.182.161.239 >>>>>>> >>>>>>> In HAProxy 1.8.3 im using server-template configuration, like that: >>>>>>> >>>>>>> resolvers dns >>>>>>> nameserver dns1 ${HAPROXY_NAMESERVER} >>>>>>> hold valid 2s >>>>>>> >>>>>>> backend tsdb_backend_query >>>>>>> server-template tsdb_query 5 >>>>>>> mfm-monitor-opentsdb.service.mfmconsul:4242 check resolvers dns inter >>>>>>> 1000 >>>>>>> >>>>>>> And in that case I get alot of warinings in haproxy log: >>>>>>> >>>>>>> time="2018-02-02T15:44:32+03:00" level=info msg="[WARNING] 032/154432 >>>>>>> (32983) : tsdb_backend_query/tsdb_query1 changed its IP from >>>>>>> 10.182.161.240 to 10.182.161.239 by DNS cache." job=mfm-monitor-haproxy >>>>>>> pid=32983 >>>>>>> time="2018-02-02T15:44:42+03:00" level=info msg="[WARNING] 032/154442 >>>>>>> (32983) : tsdb_backend_query/tsdb_query1 changed its IP from >>>>>>> 10.182.161.239 to 10.182.161.240 by DNS cache." job=mfm-monitor-haproxy >>>>>>> pid=32983 >>>>>>> time="2018-02-02T15:44:46+03:00" level=info msg="[WARNING] 032/154446 >>>>>>> (32983) : tsdb_backend_query/tsdb_query3 changed its IP from >>>>>>> 10.182.161.152 to 10.182.161.239 by DNS cache." job=mfm-monitor-haproxy >>>>>>> pid=32983 >>>>>>> time="2018-02-02T15:44:50+03:00" level=info msg="[WARNING] 032/154450 >>>>>>> (32983) : tsdb_backend_query/tsdb_query2 changed its IP from >>>>>>> 10.182.161.92 to 10.182.161.152 by DNS cache." job=mfm-monitor-haproxy >>>>>>> pid=32983 >>>>>>> time="2018-02-02T15:44:52+03:00" level=info msg="[WARNING] 032/154452 >>>>>>> (32983) : tsdb_backend_query/tsdb_query3 changed its IP from >>>>>>> 10.182.161.239 to 10.182.161.92 by DNS cache." job=mfm-monitor-haproxy >>>>>>> pid=32983 >>>>>>> time="2018-02-02T15:44:56+03:00" level=info msg="[WARNING] 032/154456 >>>>>>> (32983) : tsdb_backend_query/tsdb_query1 changed its IP from >>>>>>> 10.182.161.240 to 10.182.161.239 by DNS cache." job=mfm-monitor-haproxy >>>>>>> pid=32983 >>>>>>> time="2018-02-02T15:45:00+03:00" level=info msg="[WARNING] 032/154500 >>>>>>> (32983) : tsdb_backend_query/tsdb_query3 changed its IP from >>>>>>> 10.182.161.92 to 10.182.161.240 by DNS cache." job=mfm-monitor-haproxy >>>>>>> pid=32983 >>>>>>> time="2018-02-02T15:45:02+03:00" level=info msg="[WARNING] 032/154502 >>>>>>> (32983) : tsdb_backend_query/tsdb_query3 changed its IP from >>>>>>> 10.182.161.240 to 10.182.161.92 by DNS cache." job=mfm-monitor-haproxy >>>>>>> pid=32983 >>>>>>> time="2018-02-02T15:45:04+03:00" level=info msg="[WARNING] 032/154504 >>>>>>> (32983) : tsdb_backend_query/tsdb_query2 changed its IP from >>>>>>> 10.182.161.152 to 10.182.161.240 by DNS cache." job=mfm-monitor-haproxy >>>>>>> pid=32983 >>>>>>> time="2018-02-02T15:45:06+03:00" level=info msg="[WARNING] 032/154506 >>>>>>> (32983) : tsdb_backend_query/tsdb_query1 changed its IP from >>>>>>> 10.182.161.239 to 10.182.161.152 by DNS cache." job=mfm-monitor-haproxy >>>>>>> pid=32983 >>>>>>> time="2018-02-02T15:45:10+03:00" level=info msg="[WARNING] 032/154510 >>>>>>> (32983) : tsdb_backend_query/tsdb_query3 changed its IP from >>>>>>> 10.182.161.92 to 10.182.161.239 by DNS cache." job=mfm-monitor-haproxy >>>>>>> pid=32983 >>>>>>> time="2018-02-02T15:45:18+03:00" level=info msg="[WARNING] 032/154518 >>>>>>> (32983) : tsdb_backend_query/tsdb_query3 changed its IP from >>>>>>> 10.182.161.239 to 10.182.161.92 by DNS cache." job=mfm-monitor-haproxy >>>>>>> pid=32983 >>>>>>> time="2018-02-02T15:45:20+03:00" level=info msg="[WARNING] 032/154520 >>>>>>> (32983) : tsdb_backend_query/tsdb_query2 changed its IP from >>>>>>> 10.182.161.240 to 10.182.161.239 by DNS cache." job=mfm-monitor-haproxy >>>>>>> pid=32983 >>>>>>> >>>>>>> This isn’t really break the service, but I think this is not quite >>>>>>> normal. >>>>>>> >>>>>>> Any advise on how to resolve this issue? >>>>>>> >>>>>> >>> -- >>> Mike Chepaykin >>> >>> >> >