First, I want to clear up what the intention of this email is.  This is NOT a
"feature request" (I know you all hate those).  This is a "feature idea," i.e.
an idea I had but am not yet skilled enough to implement, but which I am postin
only because I think developers out there may find interesting and relevant.  I
don't want to put any pressure on anybody to do anything, and you may feel free
to stop reading at any time if you think this is dumb.

My knowledge of pf does not extend much further than reading all the
documentation, listening in on discussions, and applying some common sense.  I
haven't RTFS, so sorry if some of my assumptions are wrong.

What I intend to propose in this posting is a new method of handling badly
fragmented IP datagrams which will reduce the latency of a store-and-forward
solution like "fragment reassemble", but will NOT sacrifice any security in
order to do so.

Currently, there are two major ways to handle fragmented IP datagrams in pf:
"fragment reassembly," and "those other ones."  I say "those other ones"
because fragment reassembly is [seems to be] the recommended method of handling
fragments, since only a fully reassembled fragment is guaranteed to contain
enough header information to filter properly.  For instance, nmap has a command
line option that will chop packets up into ridiculously small fragments, not
one of which contains enough header information to sufficiently filter.  So if
you demand high security, you have to use fragment reassemble, right?

Well, there are some problems with fragment reassembly (other than the memory
exhaustion issues).  Let's say you have a 64k ISDN line (we'll use 6.4 kbyte/s
of bandwidth, with an MTU of 1500 for mathematical simplicity).  Some person
[idiot] writes an application that needs to send UDP datagrams of maximum
length (64kbyte) across this wire.  So you send about 44 fragments of about
1500 bytes each across the wire.  The total transmission time is the time from
the first fragment going out to the last fragment arriving, which is 10
seconds.

Now, you insert an OpenBSD firewall in the middle of the ISDN line (assume it
doesn't affect latency or throughput of the wire), and configure it to be able
to reassemble and filter these 64k datagrams.  Now you start your transmission,
and the entire packet needs to be stored and forwarded through the OpenBSD
system.  Your total throughput of 64k packets is not affected (you can send the
2nd to the firewall while the first is still in transit) but your latency has
doubled, jumping to 20 seconds.

Of course, this is a very extreme example, but you can understand how some
network topographies would suffer additional latency because of this.

Can we fix this latency problem?  Yes, we could switch to drop or crop
scrubbing, but there are serious security implications of doing this.  pf will
only be able to guarantee filtration of IP-layer data, such as IP addresses,
and will not necessarily be able to filter, for instance, TCP or UDP packets.

But is there an alternative method we could use, which will give us the latency
and memory consumption advantages of drop and crop methods, while giving us the
security of reassembly?

Well, my theory is that we adapt the concept of "wormhole routing" as "wormhole
defragmentation and filtration" for pf.  Basically, wormhole routing means that
information is stored in a routing device only until sufficient data has
arrived to make a routing decision, then all remaining data is forwarded
immediately as it arrives.  This works because all of the information needed to
route a packet [almost always] arrives first.

So, how can this be used to improve pf?  I suggest that we add a "fragment
wormhole reassemble" or similar directive to the "scrub" rule, as an
alternative to "fragment reassemble", "fragment drop-ovl", etc.

The way wormhole reassembly works is this:

First, each fragment reassembly buffer contains these variables:
A. a "frag->filter" state, which points to a rule which this packet fragment
has been determined to match (initially, this points nowhere).
B. a "start" value, which is the offset within the final packet of the first
byte of the fragment in this buffer.  For normal reassembly, this is implicitly
always 0, and starts at 0 for wormhole reassembly.

So, when a frame arrives on our interface, we first check to see if it's a full
packet and can be filtered, as we already do, and if it's a fragment, we look
for a buffer it matches to.  If it matches a buffer, we add the data into the
buffer (offset_of_frag_in_buffer = offset_of_frag - offset_of_buffer, crop as
necessary).

Now, after updating the reassembly buffer, we check to see if the packet has
already been matched to a filter rule.  If it hasn't, then we check to see if
it contains all headers that are necessary to filter the packet to the full
ability of pf (i.e. is there a full IP and TCP header?).  If we fail this
condition too, then we can't do anything and will have to wait for the next
fragment.

If the buffer doesn't match a rule action, but does have enough data in it to
filter, then we can send the fragment through the pf filter engine the same way
we would if it were a full packet.  Any rules [including translation and
queueing] it matches are remembered, and the chain of actions to perform on the
packet is saved in the frag->filter structure attached to the fragment buffer. 
Then the buffer can be flushed into the outgoing queue, and the offset of the
buffer set to start on the byte following the last byte just sent.

As more fragments arrive into a buffer that has already been passed through the
ruleset, the actions that were performed on the first fragment can be performed
on these remaining fragments, and they can immediately be forwarded to the
right interface.

It might also be a good idea to add a parameter to scrub rules like "min-frag
x", which will force fragments to be buffered until a fragment of at least x
bytes can be sent (to reduce interface overhead), except, of course, the last
fragment, which is just whatever the rest of the datagram is.

So, back to the example of the pf between 64k ISDN lines...  If this firewall
were now switched to fragment wormhole reassembly, then the first fragment of
1500 bytes would contain enough data to match rules to this fragment, and could
be retransmitted immediately.  Each subsequent fragment will match fragment
state and also be forwarded with the first fragment immediately.  So now,
instead of quantizing transmissions by entire packets, you can use fragments
instead (which makes sense, since packets *have* to be fragmented for this
application on this interface).  The latency for transmitting an entire packet
is now equal to the line latency plus the latency of only a single fragment,
rather than all fragments.  So it would take about 10.2 seconds for the first
packet to arrive, which should be roughly the latency of fragment crop or
drop-ovl scrub rules.  Compare this to 20 seconds, from before. Yet we have
guaranteed that we are filtering based on ALL transport-layer headers, similar
to fragment reassembly.  Also, the pf firewall never has to cache more than, at
most, 1500 bytes for each packet it's reassembling for a total of 4x1500=6000
bytes, rather than 4x65536=262144 bytes for this one link.

I'm sure there are things I haven't thought through clearly enough when
considering this topic, but it sounds like it will work to me.  And, though the
payoff for a "typical" computer network is not spectacular, there are potential
benefits for this kind of approach.  It could, effectively, render all current
fragment handling methods obsolete (am I being too ambitious?).  
So, please tell me what you think, and feel free to direct any flames to my
other account on /dev/null.

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

Reply via email to