On Fri, Nov 24, 2006 at 05:22:05PM +0100, H?kan Olsson wrote:
> 5. the selected SPI (or "larval" SA state) on the local system is
> updated with the keying material, timeouts etc - i.e the "real" SA is
> finalized
>
> This continues until all negotiations are complete -- however there
> is a limit on how long this "larval" SA lives in the kernel... as you
> may guess it's 60 seconds. (The idea being if a negotiation has not
> completed in 60 seconds something has probably failed.)
>
> Since the hosts seems to be a bit slow in running IKE negotiations,
> you hit the 60 second limit before all negotiations are complete, all
> remaining "larval" SAs are dropped and when isakmpd tries to "update"
> them into real SAs this of course fails. ("No such process" approx
> means "no SA found" here.)
Thank you for that very clear description.
Is this 60 second timeout a tunable? Or can you point me to where it's
defined in the kernel? I'd like to try increasing it.
However, at this stage I don't really understand why setting -D 5=99, which
generates copious logs, makes it work. In fact I can get to 3,000 tunnels
(6,000 flows) within a couple of minutes with this flag set. Perhaps this
extra logging delays the starts of some of the negotations, somehow
spreading the workload.
(Maybe having a workload spreading option, so that no more than N
outstanding exchanges are present at once, would be a useful control anyway)
> PS
> When I tried between two ~700Mhz P-III machines a while back, setting
> up 4096 (or was it 8k) SAs was no problem. Another developer had a
> scenario setting up 40960 SAs over loopback on his laptop -- mainly a
> test of kernel memory usage, but he did not hit the 60s larval-SA
> time limit there either.
I can think of several possibilities as to why some negotiations are taking
more than 60 seconds. For instance:
(1) The Cisco 7301 may be slow to respond. It does have a VAM2+ crypto
accelerator installed, but I don't know if it's used for isakmp exchanges,
or just for symmetric encryption/decryption. (However, 'show proc cpu
history' suggests CPU load is no more than about 25%)
(2) There may be packet loss and retransmissions, maybe due to some network
buffer overflowing, either on OpenBSD or Cisco.
The OpenBSD box is using a nasty rl0 card, because that's the only spare
interface I had available to go into the test LAN. Having said that,
watching with 'top' I don't see the interrupt load go above 10%.
I'm not sure how to probe deeper to get a handle on what's actually
happening though. Perhaps isakmpd -L logging might shed some light, although
I don't fancy decoding QM exchanges by hand :-(
Regards,
Brian.