Re: [Flow-tools] Flow-tools aggregation perfomance

Jay A. Kreibich Sun, 04 Jul 2004 21:22:59 -0700

On Thu, Jul 01, 2004 at 09:32:03AM -0400, Joe Loiacono scratched on the wall:
> 
> 
> For some insight see:
> 
> http://mailman.splintered.net/pipermail/flow-tools/2004-March/002019.html
> 
> I *think* Jay does flow collection for Abilene.


  Actually, no.  The flow data we collect is from the DMZ router that
  feeds only our campus-- that doesn't even include data to/from NCSA
  and the rest of the world (they've got a really big Abilene pipe).

> I have a question regarding practical usage of flow-tools with high
> volumes of traffic :)

> And, perfomance it too low to calculate my volumes.
> 15-minutes binary calculated for at least 18 minutes (100Mb binary), and
> at maximum -
> more then one hour (300Mb binary) on one-processor P4 1800 with 256M RAM.
> Of course, in working configuration machine could be little stronger :)

  Unless you are doing fantastically complex analysis (including things
  like flow reassembly and stuff of that nature), you will almost
  always be I/O bound.  If not, the other problem would be running out
  of RAM if you are aggregating at a detailed level.  That's going to
  cause lots of paging, which gets back to I/O being a serious problem.
  Non-linear scaling of analysis time makes me thing you might be
  paging, but I'd have to understand more about what you are trying to
  do to say that with any confidence.

> Is there any ways to optimize aggregation? For example, to tag flows by
> exporter ip-address on one machine, then flow-send it to another, and then
> aggregate by networks? :) 

  If you did that, you may as well have the router send it to the
  individual machines directly (unless you want a central repository
  for all original flow data)-- although a fan-out to "network
  aggregaters" is a definite possibility.

> Is there someones, who uses flow-tools for big
> volumes calculation?

  The other thought would be to do the aggregation in real time, and
  not as post analysis.  I'm not sure that will help if it takes 15+
  minutes to look at 15 min. of data, but I also assume you're
  collecting the next 15 min. on the same machine at the same time.
  Only having to deal with one data stream at a time may recover enough
  resources to let you do this, although you're walking a pretty thin
  line.  On the other hand, if that line is I/O, it could be a large
  win.

  If you are having a memory problem, you could save out smaller
  snapshots-- perhaps 5 min. or even 1 min.-- and then just reassemble
  those into whatever time slice you want (realtime or post-processing).
  If you are indeed having a RAM shortage caused by a large aggregation
  tree, that might help.  Or, obviously, get boatloads more RAM.

  When we first started to do data analysis on our flowdata, we looked
  at a lot of different aggregation patterns and calculations.  Beyond
  a few simple 24 hour summaries, we pretty much stopped doing it.  It
  was too much data for anyone to look at, and it seemed that every
  time we did want to look at the data, we needed a new aggregation
  or analysis pattern for whatever question we were looking to answer.
  Whatever we setup last month did not address the questions we had
  this month.

  In the end, we've found it is easier to just let most of the data sit
  where it is.  If we have a question that needs to be answered, we
  will setup and run the queries by hand, as needed.  Saves us machine
  time, programming, and maintenance, and since every question seems to
  require a custom query, it doesn't really take us any longer.

  The big exception to this is if you are building a billing system,
  and really do need to run the same analysis over and over and over.
  We're not (yet) doing that.

  I should also point out that all the software we use for flow
  management is in-house custom stuff that was designed from day one to
  deal with very high volumes of traffic.  It isn't always pretty, has
  some limited functionality, but it tends to be very careful about
  memory management and performance issues.

   -j

-- 
                     Jay A. Kreibich | Integration & Software Eng.
                        [EMAIL PROTECTED] | Campus IT & Edu. Svcs.
          <http://www.uiuc.edu/~jak> | University of Illinois at U/C
_______________________________________________
Flow-tools mailing list
[EMAIL PROTECTED]
http://mailman.splintered.net/mailman/listinfo/flow-tools

Re: [Flow-tools] Flow-tools aggregation perfomance

Reply via email to