Re: [pmacct-discussion] Classification

Jaime Nebrera Thu, 16 Nov 2006 23:44:52 -0800

  Hi Sven and the rest,

> > A direct flow to sql translation is a dead end no matter threads or
> no
> >  threads, the database wont be able to support this.
> 
> this conclusion is a bit too fast for me. What is a "direct flow to
> sql
> translation"? In my terminology, whatever is written to the database
> in
> the end is a flow by definition. Of course, you can aggregate several
> flows to a new flow (for example by reducing their dimensions or time
> resolution), but it stays a flow in the end.


  OK, I will try to explain this better. IMHO just inserting flows in the 
database as they arrive will just kill performance. We found some OSS projects 
doing this and were not nice. So we decided to compute data every X minutes (in 
our case 5 min) and try to reduce the number of rows in it.

  As you say, clearly the level of reduction is very much affected by the kind 
of traffic you are trying to monitor. If in essence every flow is quite unique, 
then you cant reduce it too much. Besides the things I will discuss bellow, one 
of the things we did was to remove or aggregate the "unprivileged port" 
information in the flows, so a typicall browser opening multiple threads 
becomes just one (yes, multiple instances arrive but they are then translated 
to 1 unpriviliged -> 80)

> Mhh, ok, you are probably talking about reducing time resolution.
> This
> works great if you have mainly persistent flows. But if you have high
> flow
> fluctuation (in the extreme case every flow is unique), you don't
> reduce
> the data by reducing the time resolution. Reality is somewhere between
> and
> depends on the type of traffic of course. But I remember, that I was
> disappointed, when I had the idea to reduce data that way. I think it
> was
> like going from 5 min intervals to 1 hour intervals just reduced the
> number of flows by just the half. Not exciting.

  This is one of the things we do. More above and bellow.

> What works a lot better in general is removing the small flows. You
> can
> remove about 95% of the flows by aggregating only 5% of the
> small-flow
> traffic to one single flow (by my own observation). What you loose is
> the
> detailed information about the "noise", but for the normal network
> engineering this is probably not as important.

  Well, this is preciselly what we do. This is called Pareto :) From dta we 
have collected at clients sites, we confirm 95% of the volume of the traffic is 
condensed in ~5% of the flows (in our case already processed a bit)

  We configure a fixed number of entries that can enter the database as they 
are, the bigger X flows (again, prior computing is doe using flow stat I think 
is the command). For example for a typical central office (around 50Mbps) we 
can get around 40.000 entries, from here we just allow 3000 as they are and the 
rest are "procesed"

  The remaining entries (remember they are already not direct flows) are then 
further reduced: Internet IP is converted to 0.0.0.0 and aggregated. If the 
level of entries is still too big, we take the internal IP and transform it in 
"its network" to reduce it even further. This way you loose detail on the 
precise IP but still application and network information is valid and 100% 
precise.

  Of course, Paolo showed us some papers from ATT and you just sent us this 
paper. They are provably much more "mathematically correct" but ours is not 
working bad at this time.

> Two guys from the UCSD developed an algorithm with adaptive filters
> to
> identify the big flows in real time:
> http://www.cs.ucsd.edu/~cestan/papers/estan-elephantsandmice.pdf

  OK, we will give this a look. Thanks for the reference.

> But no matter what method you use to reduce the data, it should
> certainly
> not happen in the SQL plugin. Data reduction can also be useful for
> other
> exports, like Netflow.

  I think I said that. One thread to "receive data", maybe another to compute 
the information, surelly a different one to use any kind of classification 
stuff (I mean pattern matching and such). From here a different process could 
read data and summarize it and a different thread store or send it (be it in 
RAM, NetFlow, SQL, ...)

  So I agree with you those "reduction" techniques are not only valid for SQL 
but in general, but also think in the probe the standard "sampling techniques" 
might suffice (I mean those already available in pmaccet based on ATT docs).

  We are really looking forward porting our solution to pmacct so we can really 
contribute to this great project. We specially like the fact is gaining quite a 
bit of momentum and seems Paolo is going to get quite a bit of help and ideas :)

  Regards

--------------------------------------------
Jaime Nebrera - [EMAIL PROTECTED]
Consultor TI - ENEO Tecnologia SL
Pol. PISA - C/ Manufactura 6, P1, 3B
Mairena del Aljarafe - 41927 - Sevilla
Telf.- (+34) 955 60 11 60 / 619 04 55 18

_______________________________________________
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists

Re: [pmacct-discussion] Classification

Reply via email to