On Tue, Oct 26, 2004 at 05:09:04PM -0400, Ari Leichtberg scratched on the wall:
> I'm wondering if anybody has any experience running flow-tools on a
> Linux cluster. I have a dedicated Sun box running flow-capture,
> collecting from around 60 Cisco's campus wide, totaling over 16 GB of
> level-6 compressed data per day. The flows are written to the
> collector's local storage, and I have enough space to hold around 12
> day's worth of data.
We're only collecting off our exit routers and do ~14GB per day,
although that's uncompressed.
> My plan is to have a separate Linux cluster, nfs mounted to the
> collector's storage, which runs daily and hourly flow-reports,
> flow-dscans, and other analyses. It's not uncommon for a router to
> collect over 2GB per day, so the flow-report processes get pretty IO and
> memory heavy.
Consider this: what requires more disk I/O, the collector, which has
an hour to do one pass on one hour's worth of data; or the analyzers,
that have one hour to do all of your reports. Often reports require
multiple passes and ideally don't take the whole hour.
With that in mind, if you are going to write everything to disk and
then do post-analysis, put the disk on the analyzers, not the
collectors. They do even more I/O and will benefit a lot more from
the direct disk attachment. You definitely don't want the collector
wasting lots of resources doing NFS server traffic!
In the bigger picture, one of the problems with clusters for flow
analysis is volume of data involved. Most people run reports that
are fairly simple, so they tend to be I/O bound (or compression
processing bound) on any modern machine. This is worst case for
cluster stuff since clusters inherently add I/O inefficiencies, so
for I/O bound stuff you can actually make everything run slower
on a cluster, although the compression helps a little there.
> Has anybody ever tried this with Mosix, or any other ideas for a
> clustering solution?
Because of these problems, one of the things we're looking at is putting
in a SAN infrastructure with a clustered file system so that multiple
machines can access the same fibre-channel attached filesystem at
the same time. The collector writes and everything else does what it
needs. More or less what you're talking about, but using
fibre-channel rather than NFS. Once you remove most of the file
transport problems, how you want to split up or distribute your
computation is up to you. We're looking at static assignments, not
load balanced clustering, mostly because we aren't looking at process
migration type stuff.
The other option is to pre-distribute the data. Have one (or more)
collectors with big disk that is your main collectors are archivers.
Configure them to filter and split their data streams up to multiple
machines in the cluster. Have each cluster node keep only one to two
hours worth of data, or better yet, do the reports in real time so
they need almost no storage at all. The constant data rates are not
exciting-- even with spikes you're only looking at like 45Mb/s. If
you back-channel multicast the data across the cluster, that's no
problem. If you pre-filter it, so each cluster node only services a
few routers, it is even easier.
There are lots of games to play here, but the big thing is to
remember that collection data rates are almost always smaller than
required analysis data rates.
I should also say that we use a custom collector and tool set, so I
have no idea how easy/hard it would be to do some of these things
with the public tools.
-j
--
Jay A. Kreibich | Comm. Technologies, R&D
[EMAIL PROTECTED] | Campus IT & Edu. Svcs.
<http://www.uiuc.edu/~jak> | University of Illinois at U/C
_______________________________________________
Flow-tools mailing list
[EMAIL PROTECTED]
http://mailman.splintered.net/mailman/listinfo/flow-tools