i agree with ari's assessment about the way tools like flow-report and flow-stat handle temporary storage of data. the issue here is that they hold it in memory until the processing is complete.
now the issue is that what if this memory runs out? (although, i presume this is not very likely given the limited number of possible IP,port and other combinations out there. but, it can happen). a strategy i was toying around with before was to do partial runs on each set of data. i would like to take advance of the fact that flow-tools has a beautiful pipe friendly design (thanks mark). with this we can let each node run a report on a partial data set. then the data can be aggregated later on. a quick a dirty solution would be to do flow-reports on each sub-set of data and they use perl (or some simple scripting language) to aggregate the reports. a more elegant solution would be hack flow-report to allow it to dump its table into a file and have a another flow-report load and aggregate dumped tables. my $0.02 Quoting Ari Leichtberg <[EMAIL PROTECTED]>: > Thanks Jay. > > Regarding the question of disk placement, that would probably depend on > how hard you're pushing the machines. In our case, we have a single > collector handling 60 routers. That's been working just fine, but the > collector is doing plenty of time-sensitive I/O, so it might make sense > to leave the disk there. Also, our nodes will be connected with Gb > Ethernet. > > The bigger question is how to manage the cluster's workload. We > considered the static approach, as you suggested below, where each node > is assigned to a dedicated list of routers. The concern is scalability > since router loads change over time. The process-migration or > load-balancing solutions are nicer because they distribute load > dynamically, but I guess they're unreasonable with I/O. We might go > with the static approach after all. > > As far as you know, are all clustering solutions inherently I/O > inefficient, or are some of them ok? Did you check out openSSI? > > By the way, we're running reports on daily aggregations. So on some of > the routers we can have 2GB of compressed data for a single report. > Sometimes these processes run out of memory and sometimes they take 4 > hours. I'm probably pushing this system more than I should, but the > results are usually pretty good, I just need more processing power. > > On that note, does anybody know about the inner workings of flow-report? > My general understanding is that it loads up a huge hashtable (or other > data structure) in memory and then basically dumps out quick stats. Not > very cpu intensive. Is that accurate? > > Ari > > > -----Original Message----- > From: Jay A. Kreibich [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 27, 2004 4:12 PM > To: Ari Leichtberg > Cc: [EMAIL PROTECTED] > Subject: Re: [Flow-tools] Flow-tools on linux cluster (Mosix) > > On Tue, Oct 26, 2004 at 05:09:04PM -0400, Ari Leichtberg scratched on > the wall: > > > I'm wondering if anybody has any experience running flow-tools on a > > Linux cluster. I have a dedicated Sun box running flow-capture, > > collecting from around 60 Cisco's campus wide, totaling over 16 GB of > > level-6 compressed data per day. The flows are written to the > > collector's local storage, and I have enough space to hold around 12 > > day's worth of data. > > We're only collecting off our exit routers and do ~14GB per day, > although that's uncompressed. > > > My plan is to have a separate Linux cluster, nfs mounted to the > > collector's storage, which runs daily and hourly flow-reports, > > flow-dscans, and other analyses. It's not uncommon for a router to > > collect over 2GB per day, so the flow-report processes get pretty IO > and > > memory heavy. > > Consider this: what requires more disk I/O, the collector, which has > an hour to do one pass on one hour's worth of data; or the analyzers, > that have one hour to do all of your reports. Often reports require > multiple passes and ideally don't take the whole hour. > > With that in mind, if you are going to write everything to disk and > then do post-analysis, put the disk on the analyzers, not the > collectors. They do even more I/O and will benefit a lot more from > the direct disk attachment. You definitely don't want the collector > wasting lots of resources doing NFS server traffic! > > In the bigger picture, one of the problems with clusters for flow > analysis is volume of data involved. Most people run reports that > are fairly simple, so they tend to be I/O bound (or compression > processing bound) on any modern machine. This is worst case for > cluster stuff since clusters inherently add I/O inefficiencies, so > for I/O bound stuff you can actually make everything run slower > on a cluster, although the compression helps a little there. > > > Has anybody ever tried this with Mosix, or any other ideas for a > > clustering solution? > > Because of these problems, one of the things we're looking at is > putting > in a SAN infrastructure with a clustered file system so that multiple > machines can access the same fibre-channel attached filesystem at > the same time. The collector writes and everything else does what it > needs. More or less what you're talking about, but using > fibre-channel rather than NFS. Once you remove most of the file > transport problems, how you want to split up or distribute your > computation is up to you. We're looking at static assignments, not > load balanced clustering, mostly because we aren't looking at process > migration type stuff. > > The other option is to pre-distribute the data. Have one (or more) > collectors with big disk that is your main collectors are archivers. > Configure them to filter and split their data streams up to multiple > machines in the cluster. Have each cluster node keep only one to two > hours worth of data, or better yet, do the reports in real time so > they need almost no storage at all. The constant data rates are not > exciting-- even with spikes you're only looking at like 45Mb/s. If > you back-channel multicast the data across the cluster, that's no > problem. If you pre-filter it, so each cluster node only services a > few routers, it is even easier. > > There are lots of games to play here, but the big thing is to > remember that collection data rates are almost always smaller than > required analysis data rates. > > I should also say that we use a custom collector and tool set, so I > have no idea how easy/hard it would be to do some of these things > with the public tools. > > -j > > -- > Jay A. Kreibich | Comm. Technologies, R&D > [EMAIL PROTECTED] | Campus IT & Edu. Svcs. > <http://www.uiuc.edu/~jak> | University of Illinois at U/C > > _______________________________________________ > Flow-tools mailing list > [EMAIL PROTECTED] > http://mailman.splintered.net/mailman/listinfo/flow-tools > ------------------------------------------------------- William Emmanuel S. Yu Department of Information Systems and Computer Science Ateneo de Manila University email : wyu at ateneo dot edu web : http://CNG.ateneo.net/cng/wyu/ phone : +63(2)4266001-4186 GPG : http://CNG.ateneo.net/cng/wyu/wyy.pgp _______________________________________________ Flow-tools mailing list [EMAIL PROTECTED] http://mailman.splintered.net/mailman/listinfo/flow-tools
