On Tue, Oct 26, 2004 at 05:09:04PM -0400, Ari Leichtberg scratched on the wall:

> I'm wondering if anybody has any experience running flow-tools on a
> Linux cluster.  I have a dedicated Sun box running flow-capture,
> collecting from around 60 Cisco's campus wide, totaling over 16 GB of
> level-6 compressed data per day.  The flows are written to the
> collector's local storage, and I have enough space to hold around 12
> day's worth of data.

  We're only collecting off our exit routers and do ~14GB per day,
  although that's uncompressed.

> My plan is to have a separate Linux cluster, nfs mounted to the
> collector's storage, which runs daily and hourly flow-reports,
> flow-dscans, and other analyses.  It's not uncommon for a router to
> collect over 2GB per day, so the flow-report processes get pretty IO and
> memory heavy.

  Consider this: what requires more disk I/O, the collector, which has
  an hour to do one pass on one hour's worth of data; or the analyzers,
  that have one hour to do all of your reports.  Often reports require
  multiple passes and ideally don't take the whole hour.

  With that in mind, if you are going to write everything to disk and
  then do post-analysis, put the disk on the analyzers, not the
  collectors. They do even more I/O and will benefit a lot more from
  the direct disk attachment.  You definitely don't want the collector
  wasting lots of resources doing NFS server traffic!

  In the bigger picture, one of the problems with clusters for flow
  analysis is volume of data involved.  Most people run reports that
  are fairly simple, so they tend to be I/O bound (or compression
  processing bound) on any modern machine.  This is worst case for
  cluster stuff since clusters inherently add I/O inefficiencies, so
  for I/O bound stuff you can actually make everything run slower
  on a cluster, although the compression helps a little there.

> Has anybody ever tried this with Mosix, or any other ideas for a
> clustering solution? 

  Because of these problems, one of the things we're looking at is putting
  in a SAN infrastructure with a clustered file system so that multiple
  machines can access the same fibre-channel attached filesystem at
  the same time.  The collector writes and everything else does what it
  needs.  More or less what you're talking about, but using
  fibre-channel rather than NFS.  Once you remove most of the file
  transport problems, how you want to split up or distribute your
  computation is up to you.  We're looking at static assignments, not
  load balanced clustering, mostly because we aren't looking at process
  migration type stuff.

  The other option is to pre-distribute the data.  Have one (or more)
  collectors with big disk that is your main collectors are archivers.
  Configure them to filter and split their data streams up to multiple
  machines in the cluster.  Have each cluster node keep only one to two
  hours worth of data, or better yet, do the reports in real time so
  they need almost no storage at all.  The constant data rates are not
  exciting-- even with spikes you're only looking at like 45Mb/s.  If
  you back-channel multicast the data across the cluster, that's no
  problem.  If you pre-filter it, so each cluster node only services a
  few routers, it is even easier.

  There are lots of games to play here, but the big thing is to
  remember that collection data rates are almost always smaller than
  required analysis data rates.

  I should also say that we use a custom collector and tool set, so I
  have no idea how easy/hard it would be to do some of these things
  with the public tools.

   -j

-- 
                     Jay A. Kreibich | Comm. Technologies, R&D
                        [EMAIL PROTECTED] | Campus IT & Edu. Svcs.
          <http://www.uiuc.edu/~jak> | University of Illinois at U/C
_______________________________________________
Flow-tools mailing list
[EMAIL PROTECTED]
http://mailman.splintered.net/mailman/listinfo/flow-tools

Reply via email to