Re: Fragment Reassembly and "Wormhole Routing" for pf

Aaron Suen Tue, 15 Jul 2003 12:00:40 -0700

> The other question that came to my mind was, what do you do about overlapping
> or duplicate fragments?  I thought about this a little more and realized that
> what you're doing is very similar to one of fragment crop or fragment
> drop-ovl, but you haven't specified which.  Once you choose a definite method
> for dealing with fragmentation oddities, then your procedure reduces to:


This is another option that's probably better left to sysadmins.  Basically,
"wormhole" reassembles enough fragments to make filtration decisions reliably. 
After than, it effectively falls back on EITHER crop or drop, which should be
up to the firewall's admin to choose (since it's equally simple to implement
either).  So you'd have something like "wormhole reassemble-drop-ovl" or
"wormhole reassemble-crop-ovl".

As for arguments that these kinds of tiny frags, such as ones too small to
contain proper headers or overlapping frags, cannot be generated by legitimate
applications, I strongly urge you all not to underestimate human capacity for
utter stupidity.  Don't forget that it violates standards to generate fragments
with don't fragment sent, yet there is a significant entity (whose name need
not be mentioned) which has nevertheless accomplished this amazing feat of
utter brainlessness.  I do recall watching a certain amount of debate ensuing
as people's dependence on this service caused inconveniences to pf users and
developers alike.  By remaining more flexible, we stand to be less affected by
3rd-party idiocy.

> But this is almost the same as fragment crop and fragment drop-ovl; all
> you've
> done is add the TCP segment part to the front, and that part is the same as
> fragment reassemble.
> 
> I am not a pf hacker, and I know nothing about the internals, but I suspect
> that it would be fairly easy to implement this.  Furthermore I don't think
> that it carries any additional risks on top of those already in place for
> fragment crop and fragment drop-ovl, in fact it should decrease them
> somewhat.
> 
> > Wormhole reassembly is mainly intended for low-throughput low-latency
> networks,
> > where store-and-forward time can be greatly reduced.
> 
> Well, yes, but it also assumes you're on a medium-to-high trust network.  If
> you have a firewalled network with a VPN concentrator inside, and you have a
> separate pf firewall between the VPN concentrator and the rest of your
> network, then you can trust (hopefully) the first firewall to clear out most
> IP layer oddities.  But it's a bad idea to expose this sort of scheme to the
> outside world.

I'm sorry I didn't specify this in my first posting, but the fact that this is
intended ONLY for medium trust networks is a consideration every administrator
needs to take into account before selecting any normalization method.  If your
network is low-trust and you expect users to try to punch through firewalls or
attack services using layer-3 or -4 techniques such as fragmentation, you
really have no reason to use anything other than fragment reassemble.  In fact,
if you're on a very low trust network, you should probably consider some kind
of TCP proxy for intense normalization of traffic all the way up to layer 4 or
5.  I can't name any specific solution like this (other than inetd+nc, which is
probably a stupid idea in a production environment) but I'm certain solutions
exist.

> Furthermore, one of your implicit assumptions is that both directions of the
> TCP stream are slow.  You gave an example of a star-configuration VPN each of
> whose endpoints is a dialup link; but the VPN concentrator itself is probably
> not on a dialup!  In that case the benefit provided by this method is mostly
> negated; either the sending side will be fast, so all the fragments will
> arrive quickly, or the receiving side will be fast, so the reassembled
> segment can be sent quickly.

This is not necessarily so.  The major factor to consider here is the *path*
latency and throughput.  The throughput achieved at any endpoint is only the
throughput of the path.  The latency of the path is the sum of the latencies of
each segment of the path, and is largely determined by the size of the data
quantum (the smallest unit that can be forwarded as soon as it's finished being
stored).  The sole reason for using wormholing is to reduce latency by reducing
the effective data quantum.

When I say my idea "doesn't compromise security," I refered to it not allowing
attackers to access any point beyond a firewall they would not be able to
access using fragment reassembly, and I still believe that is true.

> You don't need a high-throughput connection for a memory exhaustion attack
> like I'm proposing to do damage; it depends mostly on how the receiving
> system
> handles fragments.  I can't remember when it was fixed--I think 3.2--but pf
> used to try to use an infinite amount of space to reassemble fragments.  But
> kernel memory can't be paged out, so if an attacker sent enough fragments,
> pf would panic.  Even if the attacker didn't send that many packets, pf could
> easily steal 64M of memory that it couldn't give back, and that was without
> 64M of data being sent; that's enough to seriously hurt a system!  A more
> modern pf will start dropping incomplete segments, but if you send enough
> fragments, you'll be able to attack legitimately fragmented segments.

In this case, wormhole reassembly does not really affect pf's response to an
assault of floating fragments (although less memory is now consumed, it wasn't
an issue to begin with) except for one thing.  Now the system behind pf will be
expected to cache some reasonable-sized non-overlapping consecutive fragments
(hopefully, a smaller number than were originally received by pf).  The system
being protected *will* be loaded down cacheing these fragments, but a good
sysadmin should test the application to determine whether this risk is
significant.  If the throughput is low enough, the fragments on the protected
machine will time out quick enough that the fragments never reach a high enough
level to cause problems.

> It sounds to me like you have in mind some specific application for this
> technique, but I'm not sure what it is.  If you shared what it was, we might
> be able to find a better solution.

Yes, I do.  I'm constructing a star-configuration VPN out of a series of home
LAN's, each on a DSL or cablemodem link.  I know that star is probably not the
best configuration for performance reasons, but other configurations would be
too complex to organize (since I'm managing all these home LAN's alone, I need
some simplicity).  Also, there are not too many LAN's, so I'm not too worried
about throughput issues; I don't need much, I just really want the virtual
privacy.

Now the problem comes down to latency.  Since all the LAN's will be on ADSL or
cablemodem links, each will only be able to transmit to the other at its slower
upstream speed (some of these have extreme asymmetry, like 3M vs 128k).  So if
one non-central LAN tries to transmit to another non-central LAN, the
throughput of the connection will probably be reasonable for my needs, but
latency would be double that if one of the parties were the central LAN.  I
would like to achieve consistent (and preferrably minimal) latency between each
endpoint.

An attack using heavily fragmented packets can NOT be done, since fragment
reassembly is already performed by the local OpenBSD router on each VPN.  This
means that a single packet may pass through this route:

transmitter -> VPN client pf -> VPN master pf -> VPN client pf -> receiver

The first client pf will reassemble this packet and forward it to the master. 
But the master reassembles all fragments by default, to ensure that these
packets can be filtered, even if any VPN client router is compromised.  The
master then forwards to the other client, who has to reassemble before sending
to the receiver.

The major latency issue is caused by the master needing to cache the full
packet before forwarding to the other client.  If I could eliminate this, I
would have acceptable latencies througout my VPN.  Switching the master, or
even all systems over to wormholing would improve latency between endpoints,
and yet I would be certain that all IP-layer and transport-layer filtration
rules in place were still being followed.

Even if a DoS attack *were* possible through this configuration, I would not be
concerned as most of my VPN clients are either OpenBSD systems, with their own
fragment reassembly firewalls, or Windows systems, which do a fine job of
crashing on their own without an attacker :-D.

As you can see, wormholing is not necessary for most applications, but for that
one system which is forwarding packets between two narrow links, latency can be
much improved.

So why don't I use drop or crop on that middle router, instead of suggesting
this new scrub form?  As I understand it, scrub rules are fairly simple in
their implementation, and there's not an extreme amount of flexibility in being
able to decide which fragments need to go into which scrubber.  I don't think I
could create a ruleset that scrubs all packets using the optimal scrubber and
yet does not create leaks.

There are some other reasons too.  I want to ensure that, at least, the first
fragment is normalized for remote filters, but I trust my pf's and client
stacks to handle fragments after that reasonably well.  I also think that the
min-frag argument could be useful for reducing overhead with minimal sacrifice
to latency; I'll need to do some tuning, but I'll probably strike a good
balance between latency and throughput at some point.

> I don't think that you could practically time things well enough; I think
> there are too many random variables here.  Even if you can, I'm not sure of
> the utility of this method.

Let's consider this.  You have an SSH server on a 100-base-TX LAN.  You
habitually remote-admin the system from a location far away (fairly high
latency).  An attacker wishes to comprimise the root account, but has only
managed to acquire a non-root user's account.  Also, this same attack has found
a way onto the LAN (very low, fairly predictable latency) although not close
enough to the server itself.  Now, in the course of trying to root the system,
your attacker does not want you (root) to be able to intervene; he does not
care whether he's detected in the process, so he can be as un-subtle as
necessary.  I know that this is a rather odd scenario (why can't the security
guards just throw him out?) but there are possible cases where this could
happen.  You need to connect to sshd as root to lock the system down, go
securelevel->2, kick the user, whatever.

Now, this clever person has been testing your firewall, and has figured out the
logic that goes into connection timeouts (you just connect, then send a test
packet after a certain interval, and see if the connection has timed out; can
be automated).  The attacker also knows that a linear function is used to
reduce state timeouts for each connection that's established.  And the attacker
also has a good guess as to the latency you'll be connecting from (say, he can
guess you have at least 100ms of round trip time).  Of course, the sysadmin has
state max enabled for sshd, to prevent an attacker from exhausting all sshd
connections and blocking out the admin, right?

So this attacker opens a single SSH connection to the machine and logs in (now
the state is established, and timeout reductions will never reduce the timeout
to less than the attacker's keepalive interval).  The attacker then runs a
"script" (though it would have to be well-coded indeed) to occupy all remaining
states available to sshd.  When all states are occupied, naturally, pf reduces
the timeouts for these connections, and all connections start to drop.  Legit
clients may be disconnected over time (esp. if the attacker finds some way to
increase their latency), and the attacker need only maintain his state by
sending keepalives fast enough.

Now, the attacker's "plug" connections start timing out.  So what can the
attacker do?  By timing the sending of a short burst of SYN's, based on his
latency and what he knows about the timeouts, he can keep almost all states
filled.  Though this does not prevent the administrator from creating a state,
the attacker can drive state timeouts very low indeed, below the 100ms
round-trip time the sysadmin needs.  So by the time the SYN/ACK arrives at the
sysadmin's ssh client and his ACK arrives at pf, the state is gone.  The
sysadmin could always try to forge an ACK timed to arrive just after 50ms (half
round-trip time) at the firewall and try to sneak in, but thanks to state
modulation, this won't work :-D.  Although you may think that the latency
involved is unpredictable, one-way latencies on most networks, I find, have a
small variance.  And besides, the sysadmin only needs to manage to get a single
connection and log in to stop the attacker.

Now, suppose pf does not allow timeouts to be reduced so close to 0.  Let's say
that timeouts reach a certain "floor" value when the number of states fills up
(such as 1 second).  How does this affect the attacker?  He can now occupy ALL
states for up to a second.  Since he has gathered timeout information, he
should have an excellent guess when his connection will time out, and, being on
a high-throughput low-latency network, he can time his burst of SYN packets to
shut that tiny opening very fast.  As the network conditions approach ideal
this window gets smaller and smaller, and even on real networks, it can be
extremely small (100 microseconds? less?).  The sysadmin will have a much more
difficult time timing SYN bursts to snatch up one of these states (although, if
he gets a SYN through, he'll probably make the timeout and get in).

Thus, the attacker likely has quite some time to get his root kit working
before the admin can sneak in and catch him.

So, now we add timeout modulation to pf.  Nothing fancy.  Just that each time a
state is created, a random value percentage value is generated, between 0 and a
user specified value (probably 0-10%).  When setting a new timeout for a state
(e.g. creation, or traffic has just arrived), instead of setting the state's
timeout to the value equal to its stock timeout (e.g. instead of setting the
timeout to 86400 for an established TCP optimization=normal), you set the
timeout to that value minus the percentage for that state, and start counting
down normally from there (that same state may end up at 81257, for instance). 
Instead of percent, one could use absolute seconds instead.

Now, instead of knowing, within a couple hundreds or tens of milliseconds when
a timeout will expire, your attacker is faced with a probabilistic function. 
Although he may know the long-term behavior of the system (e.g. what the
average timeout is, the variance, or even the probability distribution) he
can't time a burst very well now since timeouts can be entire seconds off his
guess.  So the sysadmin, using otherwise normal methods, can force a connection
after a reasonable number of attempts (I wouldn't surprised if he made it
within 100 or even 10 attempts, which could be made very close together).  This
means that if the root kit attempts take longer than the attacker expected,
he's in quite a bit of trouble.

The first time I thought about using probabilistic approaches to foil state
exhaustion attacks was when I was researching the question "why aren't there
syncookies for *BSD?"  As it turns out, BSD TCP/IP stacks have a much simpler
and more effective solution.  When being SYNflooded, the default behavior
[according to some postings I googled up a long time ago] is to drop one random
old half-open connection for each new one received.  Consider the alternatives.
 A stupid method would be to ignore all new connections altogether, which means
that an attacker can totally close off a system by maintaining enough of a
synflood.  Another method is syncookies, which can eat up a lot of CPU power,
which a server may not be able to spare.  Another method is method 1 with
adaptive timeout reduction (as per pf right now) which works much better, but
can behave similar to method 1, if the attacker has enough power (i.e. is close
enough to the victim) and can predict timeout values.  The best method I've
seen implemented so far is the "drop a random old connection" method, which is
the only system that does not necessarily punish clients for having a high
latency -- after all, your sysadmin may have much longer round-trip time than
your attacker.

In fact, the timing attack I mentioned above can even be used on higher-latency
connections (though high-latency usually implies high-latency-variance, it's
still possible).  However, I'm pretty sure high-throughput is necessary, since
SYNs within the burst need to be closely spaced, and also the attacker's SSH
client connection needs to be able to work around SYN bursts.

So why couldn't an attacker SYNflood the server to a point where timeouts are
low enough that the attacker can't generate SYN's fast enough, and the sysadmin
can get in?  Other than the sysadmin's own latency, there is the problem of
state quantization.  If there are a max of 100 states occupied, and one drops,
then there are 99 states occupied.  This means that not all timeout values are
possible, but only a certain discreet set.  This set could be such that at 99
states, the timeouts are less than the admin's round trip time (50ms) but not
less than the attacker's ability to generate SYN's (the attacker could easily
generate SYN bursts at 50ms intervals).

The goal here, remember, isn't an all-out SYN flood to take down the server,
but merely to hammer out legit connections while the attacker pokes around in
an otherwise healthy server.

So, without state timeout modulation, there is a certain degree of doubt as to
whether a sysadmin can always get in to his sshd in a time of crisis.  But with
state timeout modulation, the doubt is now on the shoulders of the attacker,
who can't effectively predict state timeouts.  It seems to me that it should
always be the attacker doubting.  Total predictability [determinism] is
generally a bad thing in attack resistance, and can often be exploited for SOME
devious purpose, regardless of where it's found.  After all, I didn't always
wear a white hat.

I might also add that an alternative, or perhaps complimentary solution to
state timeout modulation (it may be even more effective if combined) is random
early drop for states.  After reaching a threshold, new connections could
randomly be denied, with a probability proportional to the number of remaining
states occupied, reaching 100% when all states are full.  This means that an
attacker must create exponentially [I think] larger SYNbursts to occupy each
additional state.  All three methods (state timeout modulation, random state
drop, and state timeout modulation with random early drop) are all effective at
stopping the attack I've described, though they all have slightly different
effects, and are probably most effective when properly combined.

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

Re: Fragment Reassembly and "Wormhole Routing" for pf

Reply via email to