Hi again,

>Right. I forgot to mention the SQL Pre-processing layer, which importance is
>often underestimated - sql_preprocess directive in CONFIG-KEYS. Specifically,
>'fss' and 'fsrc' keys are those of interest here: both are based on papers
>coming out AT&T Labs about smart sampling of network data.

  Yep, we did them out of our ideas and you based this in already existing and 
much more validation investigation.

>Flow stitching - i think we can safely summarize the feature with this term.
>It's rather consolidated approach in the field of network security, pmacct
>is still missing it, but it has prime position in my todo list.

  Mmm, from a security point of view this might not be desirable as you loose 
detail (for example this might effect the chance to detect P2P using L3/4 
information). So in future work we plan to only apply this for tables not in 
"close to real time" like say the second table you use in detail.

  One thing I forgot to mention and that further reduces the number of rows is 
to "consolidate" small flows as comming to or from "the internal network" 
instead of the precise IP inside that network. In this case you loose the 
details at the computer level but still have precise data at the network level. 
This is configurable and should not be used in the "first" table.

  All of this is based on Pareto ideas. He said, 20% of the rows represent 80% 
of the traffic. From our experience we have seen most of the times less than 
10% of the rows represent more than 90% of the traffic.

  ATT papers are very well thought from a mathematical basis. They do loose 
information but are able to know the expected error. In our case, we decided to 
loose certain data but maintain higher level precision, so you loose the detail 
of the particular internal IP but maintain the precision of the total in the 
network and application, you loose details on the exact "non privileged port" 
but still the flow data is consistant.

  Also from our experience, and as part of the ideas to port our system to 
pmacct, IMHO most of this "reductions" should be applied in second precision 
table and forward. That is, as data lasts longer you are quite interested in 
applying different techniques to reduce the number of rows, but for inmediate 
data you wont want to loose detail at all (to be able to use this with a 
correlation system, be able to detect P2P, security, etc). If that first table 
becomes too big for your environment and resources, you have to choose: improve 
hardware or reduce the duration of such table (instead of 4h say 1h)

  Regards

_______________________________________________
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists

Reply via email to