Re: [pfSense] DNS resolution issues under heavy load

David Noel Wed, 19 Mar 2014 17:56:24 -0700

Thanks.

I do recall seeing some notifications in the Web Configurator about
sync failing when I had everything up and running. I'm pretty sure
there's still an issue with either the modem or line itself though.
I've plugged the servers directly into the modem, run the crawlers,
and DNS queries still fail. If it were purely an ALIX or pfSense issue
bypassing them should have fixed it. It's strange that only DNS
queries fail... once the addresses resolve the throughput is fine. At
any rate I contacted my provider and they agreed to send out a newer,
heavier-duty modem to try. Hopefully that fixes it.


-David

On 3/19/14, Chris Buechler <c...@pfsense.com> wrote:
> It sounds like you don't have state sync enabled on the secondary, it
> won't accept the primary's states without that.
>
> Depending on how much load you're generating with the crawlers, you
> could be hitting the limits of the ALIX in new connections per sec.
> I've seen with one customer where they were blasting out 10K+ emails
> (and 10K+ SMTP connections) in less than a second, which put adequate
> load on their ALIX pair that it failed over CARP because the primary
> was under too much load to send its advertisements.
>
> Though the modem theory is just as plausible, especially if the modem
> is doing any kind of NAT or filtering. If you're not hitting it so
> hard you're failing over CARP, that points to it being something other
> than the firewall. Packet capture on WAN filtered on port 53 would be
> more telling. If you see DNS queries leaving there that get no reply
> back, it's not the firewall.
>
>
> On Wed, Mar 19, 2014 at 9:50 AM, David Noel <david.i.n...@gmail.com> wrote:
>> Well, it may not be the ALIX boards after all. I connected the servers
>> directly to the modem, ran the crawlers, and I'm still getting
>> UnknownHostException's. I'm guessing my modem's to blame... I'll have
>> to upgrade it and find out.
>>
>> On 3/18/14, David Noel <david.i.n...@gmail.com> wrote:
>>> Well, I bumped Maximum State Table from the default of 23,000 to
>>> 75,000, and now it's throwing fewer UnknownHostException's. But
>>> they're still being thrown. My resource utilization is getting pretty
>>> high though. I don't think these ALIX boards can handle much more of a
>>> load, and I still have 2 more servers I need to scale these crawlers
>>> out to. I do see there's a "Firewall Adaptive Timeouts" setting in the
>>> web configurator.. this seems like it might be useful. Can anyone
>>> recommend any settings I should try to free up some system resources?
>>> I'm not clear on the consequences of purging pf state entries and
>>> whether that's something I'd want to do though.
>>>
>>> The state table on my primary router (alix1) is at roughly 50%
>>> utilization, or 40,000 states. The state table on my secondary router
>>> (alix2) is at 0%, roughly 250 states. This seems odd. Is this to be
>>> expected under CARP? Why is the load not distributed evenly?
>>>
>>> Memory usage on my primary router (alix1) is hovering around 55% (of
>>> 235MB). On my backup (alix2) it's pushing 85-90%. Does this make sense
>>> to anyone? Top output looks roughly the same... and now alix2 has gone
>>> down. 95% packet loss. Web Configurator unresponsive. ... It's back up
>>> but throwing "500 - Internal Server Error"s periodically. I've ssh'd
>>> in to alix2 and am looking at top output.. tcpdump seems to be running
>>> for pflog purposes.. and it's hogging quite a bit of CPU. Is this
>>> necessary? Can I disable it somehow?
>>>
>>> -David
>>>
>>> On 3/18/14, David Noel <david.i.n...@gmail.com> wrote:
>>>> I've encountered a strange issue while scaling a Java project that I'm
>>>> not quite sure how to resolve. Any thoughts would be appreciated.
>>>>
>>>> The code is a crawler that uses HTMLUnit to crawl a bunch of pages
>>>> concurrently. It uses HTMLUnits getPage method to do the crawling. I'm
>>>> running 100 threads per instance. When I have 1 instance up and
>>>> running on 1 machine everything is fine. When I scale it to a second
>>>> machine though I start having trouble. Calls to getPage keep throwing
>>>> UnknownHostException's (DNS resolution error). With 2 servers running,
>>>> roughly 1 out of every 20 calls to getPage throw this exception. For
>>>> some reason it's unable to resolve domain names.. and it's not just
>>>> the crawlers, my entire network starts to bug on DNS queries. On
>>>> different systems on the same network I get 'unable to resolve host'
>>>> errors in my web browser periodically when loading URL's. Usually when
>>>> I retry it goes through, but it keeps happening sporadically as long
>>>> as the crawlers are running.
>>>>
>>>> So many things could be going wrong here. Thinking maybe it was my
>>>> provider throttling DNS queries I've tried changing DNS servers, but
>>>> that's done nothing. Thinking it might be a bandwidth issue I checked
>>>> systat, but the cumulative load is well under what my line can handle.
>>>> What else could be causing this? My network is pretty simple: Provider
>>>> <--> modem <--> 2 ALIX boards running pfSense <--> Servers and
>>>> workstations. The servers are running FreeBSD, and the workstations
>>>> run FreeBSD, Windows, and OSX.
>>>>
>>>> Has anyone encountered this before? Does anyone have any thoughts on
>>>> what might be causing it?
>>>>
>>>> My only other thought is that maybe pfSense is doing something strange
>>>> so if I can't come up with any better ideas I'll try plugging the
>>>> servers directly into the modem. I'd rather have them behind the
>>>> routers though, so this would be a less-than-ideal solution.
>>>>
>>>> UPDATE: Ok, so it seems to be a pfSense issue. I launched the crawlers
>>>> on 2 servers as before and waited for UnknownHostException's to be
>>>> thrown. I then took a spare laptop and connected it directly into my
>>>> modem, bypassing my 2 pfSense routers. All DNS queries have gone
>>>> through without a hitch, so something strange is going on with
>>>> pfSense. Can anyone think of what might be causing this? I'm guessing
>>>> there's some tunable that needs to be tweaked, but I'm not sure where
>>>> to start. I also might have configured pfSense incorrectly, but I
>>>> think that's less likely to be the case than some default tunable
>>>> being set too low because at low volumes all DNS queries go through
>>>> just fine. If it were a configuration error it seems more likely that
>>>> no DNS queries would be going through. If it's relevant, with 200
>>>> active threads I'm probably querying DNS a minimum of 10 times per
>>>> second.
>>>>
>>>> Can anyone think of anything I might have done wrong setting up
>>>> pfSense? Does anyone know of any tunables that might causing this
>>>> error?
>>>>
>>>
>> _______________________________________________
>> List mailing list
>> List@lists.pfsense.org
>> https://lists.pfsense.org/mailman/listinfo/list
>
_______________________________________________
List mailing list
List@lists.pfsense.org
https://lists.pfsense.org/mailman/listinfo/list

Re: [pfSense] DNS resolution issues under heavy load

Reply via email to