Hi again, >Right. I forgot to mention the SQL Pre-processing layer, which importance is >often underestimated - sql_preprocess directive in CONFIG-KEYS. Specifically, >'fss' and 'fsrc' keys are those of interest here: both are based on papers >coming out AT&T Labs about smart sampling of network data.
Yep, we did them out of our ideas and you based this in already existing and much more validation investigation. >Flow stitching - i think we can safely summarize the feature with this term. >It's rather consolidated approach in the field of network security, pmacct >is still missing it, but it has prime position in my todo list. Mmm, from a security point of view this might not be desirable as you loose detail (for example this might effect the chance to detect P2P using L3/4 information). So in future work we plan to only apply this for tables not in "close to real time" like say the second table you use in detail. One thing I forgot to mention and that further reduces the number of rows is to "consolidate" small flows as comming to or from "the internal network" instead of the precise IP inside that network. In this case you loose the details at the computer level but still have precise data at the network level. This is configurable and should not be used in the "first" table. All of this is based on Pareto ideas. He said, 20% of the rows represent 80% of the traffic. From our experience we have seen most of the times less than 10% of the rows represent more than 90% of the traffic. ATT papers are very well thought from a mathematical basis. They do loose information but are able to know the expected error. In our case, we decided to loose certain data but maintain higher level precision, so you loose the detail of the particular internal IP but maintain the precision of the total in the network and application, you loose details on the exact "non privileged port" but still the flow data is consistant. Also from our experience, and as part of the ideas to port our system to pmacct, IMHO most of this "reductions" should be applied in second precision table and forward. That is, as data lasts longer you are quite interested in applying different techniques to reduce the number of rows, but for inmediate data you wont want to loose detail at all (to be able to use this with a correlation system, be able to detect P2P, security, etc). If that first table becomes too big for your environment and resources, you have to choose: improve hardware or reduce the duration of such table (instead of 4h say 1h) Regards _______________________________________________ pmacct-discussion mailing list http://www.pmacct.net/#mailinglists
