Hi, as previously written, we 've seen a lot of problems / unavalibility / strange behavior wrt. our Internet aka more or less public related services running on Solaris machine after upgrading to S10u4 or higher.
Finally (in July 2009, i.e. almost 2 years later!!!) it turned out, that the state table size is by far too small - see fr_statemax in ipf -T list | awk '/fr_state/ { print $1, $7 }' So the sun case engineer explained, if ipf can not insert an entry into the state table, it just _continues_ evaluating the rules that follow. I couldn't believe my eyes!!! What a crap!!! So if e.g. one has: ### pass in quick proto tcp from any to port = 22 flags S keep state keep frag block return-rst in proto tcp all ### ipf renders the machine to be inaccessible/locked AND THIS WITHOUT ANY NOTICE !!! Other horror scenarios come into mind... So IMHO Solaris can't be considered to be enterprise grade/ready, if one uses ipf with 'keep state' flags. Its like writing data to a disk and if it is full, data gets silently discarded without any notice. BTW: the case engineer responded, that it is required to monitor/dtrace such a "complex" firewall like ipf and depending on the outcome, one needs to adjust the fr_statemax in a try and error manner! Hello? For what else should we build new apps to monitor state, because the OS is incapable to issue a single notice/self-tune? He's probably thinking, that we've no other things to do/the TCO of Solaris is that low, that one can push this burden to the owner of the machine? Wrt. a required syslog message he respond, that a counter increment (ipfstat: packet state*lost) costs 2 cycles on sparc, only, but a syslog message 2000 cycles and would cause ipf to "hang"/be unusable, and closed the case. So 1) what about Solaris' so called self-tuning capabilities? 2) the '2000 cycles' makes as much sense as: even on a 1.4 GHz machine 2000 cycles take ~ 1.5µs or still 700,000 pkt/s ~ 5.6 Gbps with an average paket size of 1KB - so even with such a poor impl., what's the deal? 3) Actually I would expect, that there is some kind of global SW register (perhaps called log indicator table), where one could also add a "state table full counter", which gets incremeted by ipf and I assume, that there is also a kernel log daemon or even user space logger, which can read this table in certain intervalls and log the problem. So the 2 vs. 2000 cycle reason for making Solaris users live harder than necessary is IMHO a very poor one / implies a not very well thought SW design (at least in the eyes of a normal human beeing ;-) having not much ipf insights because of shallow documentation). Usually, an admin always looks in /var/adm/messages first, if the cause of a problem can not be determined/is not really reproducable. 4) if `ipfstat -s | awk '/active/ { print $1 * 1.05 }'` > `ipf -T fr_statemax | awk '{ print $NF }'` it is obvious, that fr_statemax is to small. But 'active' is a "snapshot" value and thus I've seen it resulting into "true" very rarely on 1 server, only. However, "ipfstat | grep '^packet'" cleary indicates, that there is a problem. E.g. from 2 production servers: fr_statemax 40129 40129 active 31091 39542 in lost 366 365 out lost 146 103320 So is this, because at some time the state table was full, or is this, because ipf tried to insert a state, which is already present in the table? Sure, ipf's behavior to processs the rules list like 'the rule is ignored' is for my taste more than a minor security issue, but anyway, should one rise fr_statemax the value to make it bigger and bigger 'til one finds out, that actually ipf is having a problem? And what is also not clear: are the 'lost' counters also snapshots (for what intervall/time), or is this an accumulation from when ipf got started/refreshed? 5) And last but not least: ipf documentation is very shallow and FAQs have the taste of being out of date. So should fr_statemax be a prime number or is it more or less negligible? And should one still set fr_statesize to 0.7 * fr_statemax? Regards, jel. BTW: Why shows 'ipfstat -t' so many entries with negative ttls. It appears, that if the min? value of -59:-59 is reached, ttl gets reset to 0:00 and restarts decrementing it ... - strange -- Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 39106 Magdeburg, Germany Tel: +49 391 67 12768 _______________________________________________ networking-discuss mailing list networking-discuss@opensolaris.org