Ciprian Marius Vizitiu wrote: > Hi listers, > > I have a strange firewall problem with Bacula 2.2.6 running on RHEL4 > (2.6.9-67 but it happens on other RHEL4 kernels too) clients and CentOS5 > server. The description of the problem is... long and ugly so I've > managed to narrow it down to the following easy (for me) to reproduce > scenario: > > 1. One RHEL4 Bacula 2.2.6 client, 192.168.1.25. Relevant iptables in > this client: > > -A RH-Firewall-1-INPUT -p tcp --dport 9101:9103 -j ACCEPT > -A RH-Firewall-1-INPUT -p udp --dport 9101:9103 -j ACCEPT > > 2. One Bacula 2.2.6 server, 192.168.1.48. Relevant iptables in this server: > > -A RH-Firewall-1-INPUT -p tcp --dport 9101:9103 -j ACCEPT > -A RH-Firewall-1-INPUT -p udp --dport 9101:9103 -j ACCEPT > > Although there is no 3Com router involved "Hearbeat Interval" is set to > 60s. > > Now, simply start a 23GB restore (full plus a differential) consisting > of ~70.000 files on the client... everything works as expected for like > 30 minutes during which the client writes 23GB. Then things start to go > strange: > > 1. On the client there is no activity > 2. On the server bacula-sd is busy on CPU and I/O most likely searching > through the 10 x 200GB disk volumes for the differential files to restore. > > This "state" will last for another ~30 minutes during which a tcpdump > will only hear the pings from the heartbeat. Depending on whether the > firewalls are started or not the end can be one of the following: > > No firewall: restore job always ends successfully. > No firewall: Depending on the positions of the planets either the job > will succeed THREE HOURS later =:-o or (more likely...) it'll fail with > a "no route to host" error. Tcpdump started when baculs-sd's job is > nearing the end will clearly show the culprit: > > [... Heartbeat...] > > 18:32:01.504760 IP server.gbif.org.9103 > client.gbif.org.32776: P > 1560794395:1560794427(32) ack 1414218623 win 181 <nop,nop,timestamp > 4070418385 22509939> > 18:32:01.504801 IP client.gbif.org > server.gbif.org: icmp 92: host > client.gbif.org unreachable - admin prohibited > 18:32:01.505214 IP server.gbif.org.9103 > client.gbif.org.32776: . > 32:1480(1448) ack 1 win 181 <nop,nop,timestamp 4070418386 22509939> > 18:32:01.505231 IP client.gbif.org > server.gbif.org: icmp 556: host > client.gbif.org unreachable - admin prohibited > 18:32:01.505236 IP server.gbif.org.9103 > client.gbif.org.32776: . > 1480:2928(1448) ack 1 win 181 <nop,nop,timestamp 4070418386 22509939> > 18:32:01.505249 IP client.gbif.org > server.gbif.org: icmp 556: host > client.gbif.org unreachable - admin prohibited > > To me it looks like the essence of the problem is the fact that the > restore session has a long "network idle" period and somehow the RELATED > mechanism of the firewall no longer works. WHY would this happen? And > more important, isn't this what HeartBeat was supposed to prevent in the > first place? One more detail: if the client is RHEL5 everything works > perfectly. > > Has anyone seen something like this before? Any ideas will be > appreciated! :-| >
not sure fo 100% but looks a bit like TCP TTL dont think FW will wait that long and it has nothing to do with heartbeat. will say/guess as your FW treat it as session closed or timed out cos of idle time check if you can manage TTL for TCP on FW. -- bEsT rEgArDs | "Confidence is what you have before you tomasz dereszynski | understand the problem." -- Woody Allen | Spes confisa Deo | "In theory, theory and practice are much numquam confusa recedit | the same. In practice they are very | different." -- Albert Einstein ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users