Mark,
> On 20 Dec 2017, at 13:25, Mark Petronic <[email protected]> wrote: > > I am running with nprobe 8.2 in collector mode. I am currently designing a > collection infrastructure so I want to try to understand what nprobe is doing > internally as to better understand how data is being processed. I a number of > questions in regard to this. I have read the latest version of the user guide > PDF but still have some questions. I tried to organize my questions in blocks > to hopefully allow for easier commenting on each question. This is fairly > long but I figured asking this all together, in context, would be better. > Thanks in advance to whoever takes this on - I really appreciate it. :) > > Is there any detailed documentation on what is going on internally with > nprobe. In particular, I am using it as a collector to forward UDP netflow v9 > from our Cisco routers to Kafka. I am particularly interesting in > understanding some of these stats and what they "infer" is happening under > the hood: > > 19/Dec/2017 13:36:09 [nprobe.c:3202] Average traffic: [0.00 pps][All Traffic > 0 b/sec][IP Traffic 0 b/sec][ratio -nan] > 19/Dec/2017 13:36:09 [nprobe.c:3210] Current traffic: [0.00 pps][0 b/sec] > 19/Dec/2017 13:36:09 [nprobe.c:3216] Current flow export rate: [1818.5 > flows/sec] > 19/Dec/2017 13:36:09 [nprobe.c:3219] Flow drops: [export queue too > long=0][too many flows=0][ELK queue flow drops=0] > 19/Dec/2017 13:36:09 [nprobe.c:3224] Export Queue: 0/512000 [0.0 %] > 19/Dec/2017 13:36:09 [nprobe.c:3229] Flow Buckets: > [active=92792][allocated=92792][toBeExported=0] > 19/Dec/2017 13:36:09 [nprobe.c:3235] Kafka [flows exported=366299/1818.5 > flows/sec][msgs sent=366299/1.0 flows/msg][send errors=0] > 19/Dec/2017 13:36:09 [nprobe.c:3260] Collector Threads: [167203 pkts@0] > 19/Dec/2017 13:36:09 [nprobe.c:3052] Processed packets: 0 (max bucket search: > 8) > 19/Dec/2017 13:36:09 [nprobe.c:3035] Fragment queue length: 0 > 19/Dec/2017 13:36:09 [nprobe.c:3061] Flow export stats: [0 bytes/0 pkts][0 > flows/0 pkts sent] > 19/Dec/2017 13:36:09 [nprobe.c:3068] Flow collection: [collected pkts: > 167203][processed flows: 4561802] > 19/Dec/2017 13:36:09 [nprobe.c:3071] Flow drop stats: [0 bytes/0 pkts][0 > flows] > 19/Dec/2017 13:36:09 [nprobe.c:3076] Total flow stats: [0 bytes/0 pkts][0 > flows/0 pkts sent] > 19/Dec/2017 13:36:09 [nprobe.c:3087] Kafka [flows exported=366299][msgs > sent=366299/1.0 flows/msg][send errors=0] > > For these two stats: > > Flow collection: [collected pkts: 167203][processed flows: 4561802] > Kafka [flows exported=366299][msgs sent=366299/1.0 flows/msg][send errors=0] > > I am thinking they mean that 167203 UDP packets where received from routers > comprising a total of 4561802 individual flow records. However, is see only > 366299 flows exported to Kafka. So, am I correct in assuming that nprobe is > doing some internal aggregation of flow records that is essentially squashing > the 4561802 received flow records into 366299 aggregates? Yes it does some internal aggregations based on the 5-tuple and the timeouts configured (see. --lifetime-timeout and --idle-timeout). If you want to disable any internal aggregation use option --disable-cache > > A follow on question to this, then, is related to: > > Flow Buckets: [active=92792][allocated=92792][toBeExported=0] > > What are these and how are they utilized? Again, I am assuming these are hash > buckets used for internal aggregation per the user's guide. Correct > I have seen warning indicating that the allotment of these buckets are too > small and to expect drops. So, my guess is, based on flows/sec ingested, > these have to be sized appropriately to support the flow volume. Is that a > correct assumption? Correct. Increase it with --hash-size if necessary. > > I also notice that, when I start up nprobe in collector mode publishing to > Kafka, it takes about 30 or more seconds before any flows actually are > published to Kafka. This leads me to believe internal aggregations are > occurring that are delaying publishing of data. Yes, this depend on the timeouts configured so you should use -disable-cache if you want nProbe to simply act as a transparent proxy > If I crank up the --verbose to 2, I can see UDP packets being processed and > then, after some time, I start to see log messages indicating flows are being > exported to Kafka. It is not as much the latency issue I am concerned with > here but rather just understanding what is happening so that I can properly > monitor and configure/size the system. > > Do these parameters impact the utilization of the flow buckets in collector > mode or just when running in sniffer mode? I ask because, I know the routers > are already doing aggregations meaning, accumulation counts for flows over > time before emitting a flow record that is active. Does it mean that nprobe > is then doing the same thing again for these flows and essentially > aggregating already aggregated flow records coming from my routers? > > [--lifetime-timeout|-t] <timeout> It specifies the maximum (seconds) flow > lifetime [default=120] > [--idle-timeout|-d] <timeout> It specifies the maximum (seconds) flow > idle lifetime [default=30] > [--queue-timeout|-l] <timeout> It specifies how long expired flows > (queued before delivery) are emitted [default=30] > > > Also, based on the assumption of aggregating already aggregated data and the > type of traffic on the network I am monitoring (lots of short-lived > transactions, like credit card swipe processing by vendors and DNS lookups), > does it even make sense to have nprobe aggregating this traffic that I know > is NOT going to consist of more than one flow record anyway? > > The user document does not mention anything about monitoring nprobe > programmatically. What is the best way to monitor nprobe for internal packet > drops? I can get various OS stats from /proc/xxx, like UDP queue size, drops, > etc, but I need nprobe internal stats to round out the picture. I see that > there is information like this on stdout: > > Flow drops: [export queue too long=0][too many flows=0][ELK queue flow > drops=0] use -b=1 to have traffic stats periodically output to the terminal or log file. > > However, I want to monitor my nprobe instances with Nagios and generate > alerts on threshold checks as well as track utilization over time by posting > periodic stats to our InfluxDB/Grafana setup. Is there some way (other than > parsing stdout in a log) to gain programmatic access to these stats for > monitoring tools to use? > > Regarding Kafka, the producer has many configuration options but only very > few are exposed for configuration in nprobe. Let me ask these one by one: > > batch.size, linger.ms <http://linger.ms/>, buffer.memory - These are > essential to controlling batching in Kafka. nprobe has options > --kafka-enable-batch and--kafka-batch-len. However, these end up wrapping N > messages into a JSON array of size N and publishing that to Kafka. I feel > this is a wrong approach. Consider the downstream Kafka consumer. It expects > to receive a series of message off a topic. The format of those message > should not change due to batching. When batching is not enabled in nprobe, > the consumer sees a series of JSON dictionaries - each a single flow record. > When batching is enabled, the consumer now sees a series of JSON arrays, each > with N JSON dictionaries. IMO, the proper way to do this is to use the Kafka > configuration values to control batching. In that case, the producer simply > queues up messages (each a dictionary) and, when configured thresholds are > met, emits those messages. This results in a batch of dictionaries being sent > and the consumer ONLY sees dictionaries. Changing the message structure due > to batching complicates things for consumers and is not a typical pattern in > Kafka processing. Our experiments proved a 4x speedup with the implemented batching. > Options topic - Your documentation does not even mention this (nprobe --help > does) but I don't understand what it means? What is a Kafka options topic? It contains messages for the netflow option templates > Partitioning - If we want to perform stream process of netflow data, then we > want to ensure that all flow records from a given n-tuple are placed on the > same Kafka partition. We need to partition the data because it is the only > way to scale consumers in Kafka. If I want to perform some aggregations on > the data stream then I have to be sure that all netflow records for a given > conversation, for example, are on the same topic partition. A simple example > that will make that happen would be to use the IPV4_SRC_ADDR field of the > flow record as the partition key. Or, maybe an N-tuple of (IPV4_SRC_ADDR, > IPV4_DST_ADDR, L4_SRC_PORT, L4_SRC_PORT) as the partition key. In Java, a > producer would do this by hashing the string that comprises the partition key > desired then doing a hash % num-partitions to figure out the partition to > send the message on. I am guessing that nprobe relies on the default > partitioning scheme in the producer which is a simple round-robin approach > based on the number of partitions that exist for the topic being used. This, > however, would randomly distribute flow records for a given conversation > across multiple partitions and, therefore across multiple consumers in a > downstream consumer group. That would break the aggregations. So, my request > is that you consider allowing a configuration option that enables the user to > define the partition key. This might be done, for example, by allowing the > user to define a CSV list of template fields to use to form the partition key > string. You could just concatenate them together and hash that value then > modulo divide by the number of partitions for the topic being used and use > that to enable the producer to publish on the appropriate topic partition. > The gives the user the freedom to define the partition key while making the > implemention in nprobe fairly generic. Maybe this could also be done via some > sort of "partition plugin" to make it even more extensible? How you > considered any such capability. Without such a capability, we will have to > initially publish all flows on a say "netflow-raw" topic (using round-robin) > then consume this topic in a consumer group only to republish it by > repartitioning it (as described above using some N-tuple of fields) only be > then be consumer by another consumer group who will be doing the aggregations > and enrichments needed. Sure, we can make it would but partitioning should > "really" be done at the source. The approach I just described necessarily > doubles our broker traffic which I would not like to have to do. I see. Being able to control which data ends up into which shard will gives a lot more flexibility. Currently this feature is not implemented but if you contact us privately we can try and work together toward a more controllable hashing of the flows for kafka. Simone > Producer Options in General - Why not just make them all configurable? For > example, allow the user to define a name=value config file using any > supported producer configuration options and provide the path to the file as > an nprobe Kafka configuration option. Then, when you instantiate the producer > in nprobe, read in those configuration values and pass them into the > producer. This gives the users access to all options available and not just > the current topic, acks, and compression values. > > Miscellaneous Notes: > The v8.1 users guide lists "New Options --kafka-enable-batch > and--kafka-batch-len to batch flow export to kafka" but does not provide any > detailed documentation on these. Looks like someone forgot to add the > description of these later in the document > nprobe --help show this under the Kafka options: "<options topic> Flow > options topic" but the v8.1 user's guide gives no mention to it. I have no > idea what an options topic is. > _______________________________________________ > Ntop-misc mailing list > [email protected] > http://listgateway.unipi.it/mailman/listinfo/ntop-misc
_______________________________________________ Ntop-misc mailing list [email protected] http://listgateway.unipi.it/mailman/listinfo/ntop-misc
