On Thu, Jul 21, 2005 at 05:55:55PM +0400, Alexey Lobanov scratched on the wall:
> Does anyone know a way for additional optimization of raw netflow
> records, by merging all events during the *specified* period (i.e., 1
> hour) having same src, dst and ports?
Assuming you want to maintain the same definition of "flow", you'd
need to match on a lot more than that... src/dest IPs, protocol,
src/dst ports (if applicable), and src/dst interfaces (very important
in the spec) for starters. You'd also want to look for things like
TCP FIN flags, just like the exporter does.
So assuming you want to keep the same semantics for a flow, we need to
look at why a flow is exported. In general, there are four reasons
why a flow is exported:
1. Network transport protocol transaction ends (e.g. TCP FIN flags seen)
2. Max-lifetime timeout
3. No activity timeout.
4. Netflow cache overflow (auto reduction of inactivity
timer, on most exporters).
We can ignore the first one, since a flow is a flow and it is
supposed to end if that happened. The last one is just hard luck and
means you're not running a device configuration that is matched to
your traffic patterns (or are under a big DOS). While you could
work around this, it would be better to add memory and/or upgrade
your devices.
That leaves the middle two issues: maximum cache entry lifetime
timeout, and the no-activity timeout.
> The aim is to save disk space not
> loosing important information regarding traffic details. Actually, same
> operation is done inside of cisco box - but the aggregation time is too
> small in most cases. And further optimisation in a dedicated
> high-performance computer seems to be quite feasible.
I think you can address this problem much more easily by adjusting
the timeout values in your netflow exporters. As you say, some of
the aggregation is already done on the exporters; if you don't
like the timouts, it is easiest to just change the timeouts rather
than trying to post-process around them.
IIRC, by default on Cisco exporters, the max-lifetime timeout is 30
min., and the no-activity timeout is on the order of 15 seconds.
First, lets look at the max-lifetime timer.
While I obviously can't speak for your traffic situation, in our flow
records the max-lifetime timer is hit by considerably less than 1% of
our flows (although a lot of this depends on your uplink speed--
slower links will have a larger number of longer flows). In
addition, of those flows that do tend to hit the max-lifetime timer,
a large percentage are connections that are constant and more or less
never-ending (streaming data, research data transfer from off-site
facilities, etc., in our case). In other words, adjusting the
max-lifetime timer up even higher to 60 or 90 minutes is not likely to
significantly reduce the total number of flows that hit the
max-lifetime timeout. It may catch a few long-lived downloads
(e.g. 40 min. downloading a new ISO image), but not some of the
larger stuff.
So two points: first, the total number of flows hitting this timer is
usually very very small. On top of that, adjusting the timer upwards
is not likely to greatly reduce or zero out these timeouts (although
adjusting it downwards to five min. or so will show a noticeable
increase in the number of exported flows).
The end result is very very little savings from "stitching" these
types of flows back together (and a whole lot of work to do so).
Even if you open up your window to 60 min., you're only seeing a
savings of 50% in those (very) few flows that fall into this
category.
OK, that leaves just the no-activity timeouts.
Now here you have something. If you have a lot of stuttered traffic,
there is some possibility to stitch these flows back together,
although it will alter some of the statistics packages that make
assumptions about the max size of a "hole" in the flow (e.g. periods
of inactivity, which effect averaging of flows, bytes, and packets
over the lifetime of the flow; although that's some pretty advanced
analysis, but we've got some people doing research along those lines).
Adjustment of the timer up to something like 60 seconds is likely to
catch a fair number of long-duration flows that are very start/stop,
but not all of them. Stuff like ssh sessions that may stay open for
days, with just the occasional keep-alive packet, are always going to
get broken up.
There's a catch to raising this timer, however-- IIRC, with the exception
of TCP flows (which can look for the FIN flag), all other protocols use
the inactivity timer to "end" and export the flow. If you crank up
the no-activity timer to something like 60 seconds, every DNS lookup
(for example) will keep two flow records in the cache for 60 seconds.
Taken from the default 15 seconds, this has the capability of
quadrupling the resource requirements of the cache. The end result
is that you need a lot of memory in your netflow exporter, and if
you've got a major uplink, you're going to need a HUGE on-device cache.
It is a bit more difficult to predict how much savings this would
result in. We can get a rough idea by looking at TCP flows, since we
can look at the headers to see how many "complete" flows there are.
That's not a strictly valid sub-sample, since protocols with very
short expected lifetimes (i.e. DNS) are build on top of UDP-- on one
hand, short stuff isn't expected to be effected by the inactivity
timers, but on the other, UDP depends on the inactivity to flush from
the cache. Still, these numbers should give us some ballpark figures.
So... let's look at just the TCP flows for some random hour:
Total TCP flows: 6469234
(not filtering out flows with
a duration of 30 min-- i.e. flows
that hit the max lifetime timer--
since this is less than 1%)
Flows with SYN and FIN flags ("full" flows): 3486305 (53% of total)
Flows with just SYN flag ("start" flows): 1602225 (25% of total)
Flows without SYN or FIN ("middle" flows): 1023851 (16% of total)
Flows with just FIN flag ("end" flows): 356853 ( 6% of total)
--------------
100% of total
Flows with RST (reset) flag: 716783 (11% of total)
What this says is that around 53% of flows are "complete"-- i.e. the
TCP transaction runs start to end within one flow. The rough balance
is made up of "start", "middle", and "end" flows. While this looks
like it points to at best 22% savings for stitching all those middle
and end flows together with their start flows, I'm not sure that's the
case. Things are muddled by the fact that the number of "start" and
"end" flows don't match (even if you add resets to ends); further,
the large number of reset flags mess with the what is going on--
especially since resets can be found in every combo of SYN and FIN
flags. Then again, this is pushing my understanding of TCP just a bit.
TCP is roughly 66% of our flows. I think it unlikely that this kind
of stitching would reduce the UDP stuff very much. The type of
traffic patterns seen in UDP just aren't the same.
So our rough numbers are that TCP will show something on the order of
20% reduction, which might be about 15% total (since you'll get some
UDP savings). I'll be the first to say there's a lot of fluff in that
number, but it shouldn't be radically off from the true value
(assuming your traffic looks more or less like ours). Returns will
be further reduced if you use a static window (e.g. one hour, every
hour) rather than a sliding window. The first is much easier to
program, but it won't give you as much reduction as a window.
Although numbers are difficult to predict, if you have a healthy
amount of memory on your export device and your flow cache is not
very full, you might try increasing the inactivity timer and see
what kind of flow reductions you get from that. You aren't going to
see the full 15% savings (assuming that's a reasonable number), but
you might see 10ish or 12ish percent. Bumping up the timer may give
you a better picture of what kind of returns you can get from a
stitching program without having to go out and build it. I'm guessing
you'll see most of any possible returns from that timer adjustment,
and assuming it doesn't overload your export cache, that seems a lot
easier than engineering some complex system to stitch flows back
together. Personally, the returns just don't seem worth it, unless
you have some radically strange traffic patterns that generate a large
number of "broken" flows.
You might also consider that the only practical way of doing this for
larger chucks of diverse flows is to hold the whole thing in memory.
While I have no idea what your traffic patterns are like, we can
easily see 5GB of raw binary flow data per hour. Assuming data
structure overhead, you're looking at around 8GB or more to hold the
data structure required to stitch all that together. I question if
that can be done in virtual memory (since you need to do everything
within an hour), so you're looking at perhaps 4 to 6GB of RAM (if not
actually 8 or more). That's a pretty serious investment of several
thousand dollars (not only for the RAM, but a nice system that will
hold that much), and for what? Assuming your traffic is really
strange and you save 50% of your disk space and have turned that
5GB into 2.5GB, you've saved a whopping $5 or so in Fibre Channel RAID
space, or about $2 in traditional hard drive space. Obviously
you're data volume may not be that big or require that much memory,
but if that's the case, your savings will be much smaller as well.
In the whole chain of expenses-- the export router, the collection
server, and the disk attached to it-- the cheapest item is usually
going to be the disks, unless you get into a nicer and larger RAID.
There's also the thought that if you're running things so close to
the edge that you need to save 10% to 15% just to get it on the disk,
you're in no condition to absorb a DOS or worm. We can see huge
fluctuations in our flow export rates when new worms hit the net, or
when our own network is under a DOS (from the inside OR the outside).
Unless you've got a very controlled network with some type of IDS/IPS
in front of your netflow export device, you need the ability to
absorb those kind of network events. When things are running the way
I want, our disk are always 20% or more free "just in case"-- and I'm
taking about a setup that can save a fair fraction of a year.
It is also worth saying that something as simple as gzip'ing the
flow files will give you a ~70% reduction. All that costs is a bit
of CPU time and access speed. Plus, it is simple and easy to automate.
In our case we gzip flow files after two weeks, when they go into a
quasi-archive state for a few months until we delete them.
* * *
I'll be the first to admit these numbers our uniquely ours. I've got
a lot of netflow experience on our net, but I've not had much time to
look at numbers and traffic patterns for a "typical" business.
University networks, especially large ones, are unique beasts. I'm
not going to say "you will NOT" see no significant reduction in total
flows by stitching flows, but I'd be doubtful. Maybe you can run
some of these numbers on your own data and get a better idea of
how it would work out for you.
We have actually looked at writing something that cross-correlated
flows and tried to match flows from the same network transaction back
together, but our goal in that had more to do with traffic pattern
analysis and some heavy-duty research into traffic shapes and such
nonsense. In some cases, we actually wanted to reduce the inactivity
timer to just a few seconds so that we could more clearly see "holes"
in the traffic (and then use a stitching program to recreate the
whole flow pattern). We never wrote it; first, because the complexity
and size of the problem was just too out of control for the amount of
data we generate, and second, the people working on the research
(unofficially) never really had the time to setup a proper research
proposal. But it did sound like interesting stuff.
I'd love to see some of these numbers for a different style of
network.
-j
--
Jay A. Kreibich | CommTech, Emrg Net Tech Svcs
[EMAIL PROTECTED] | Campus IT & Edu Svcs
<http://www.uiuc.edu/~jak> | University of Illinois at U/C
_______________________________________________
Flow-tools mailing list
[EMAIL PROTECTED]
http://mailman.splintered.net/mailman/listinfo/flow-tools