Thank you for all the responses and suggestions. As you requested, I will open tracking tickets for the issues that you suggested. Again, very much appreciate the timely and thorough responses - they are very helpful.
On Mon, Jan 1, 2018 at 12:28 PM, Luca Deri <[email protected]> wrote: > Hi Mark, > sorry for the late reply but we;ve been in vacation lately > > Please see below > > On 20 Dec 2017, at 13:25, Mark Petronic <[email protected]> wrote: > > I am running with nprobe 8.2 in collector mode. I am currently designing a > collection infrastructure so I want to try to understand what nprobe is > doing internally as to better understand how data is being processed. I a > number of questions in regard to this. I have read the latest version of > the user guide PDF but still have some questions. I tried to organize my > questions in blocks to hopefully allow for easier commenting on each > question. This is fairly long but I figured asking this all together, in > context, would be better. Thanks in advance to whoever takes this on - I > really appreciate it. :) > > Is there any detailed documentation on what is going on internally with > nprobe. In particular, I am using it as a collector to forward UDP netflow > v9 from our Cisco routers to Kafka. I am particularly interesting in > understanding some of these stats and what they "infer" is happening under > the hood: > > 19/Dec/2017 13:36:09 [nprobe.c:3202] Average traffic: [0.00 pps][All > Traffic 0 b/sec][IP Traffic 0 b/sec][ratio -nan] > > 19/Dec/2017 13:36:09 [nprobe.c:3210] Current traffic: [0.00 pps][0 b/sec] > > 19/Dec/2017 13:36:09 [nprobe.c:3216] Current flow export rate: [1818.5 > flows/sec] > > 19/Dec/2017 13:36:09 [nprobe.c:3219] Flow drops: [export queue too > long=0][too many flows=0][ELK queue flow drops=0] > > 19/Dec/2017 13:36:09 [nprobe.c:3224] Export Queue: 0/512000 [0.0 %] > > 19/Dec/2017 13:36:09 [nprobe.c:3229] Flow Buckets: > [active=92792][allocated=92792][toBeExported=0] > > 19/Dec/2017 13:36:09 [nprobe.c:3235] Kafka [flows exported=366299/1818.5 > flows/sec][msgs sent=366299/1.0 flows/msg][send errors=0] > > 19/Dec/2017 13:36:09 [nprobe.c:3260] Collector Threads: [167203 pkts@0] > > 19/Dec/2017 13:36:09 [nprobe.c:3052] Processed packets: 0 (max bucket > search: 8) > > 19/Dec/2017 13:36:09 [nprobe.c:3035] Fragment queue length: 0 > > 19/Dec/2017 13:36:09 [nprobe.c:3061] Flow export stats: [0 bytes/0 pkts][0 > flows/0 pkts sent] > > 19/Dec/2017 13:36:09 [nprobe.c:3068] Flow collection: [collected pkts: > 167203][processed flows: 4561802] > > 19/Dec/2017 13:36:09 [nprobe.c:3071] Flow drop stats: [0 bytes/0 pkts][0 > flows] > > 19/Dec/2017 13:36:09 [nprobe.c:3076] Total flow stats: [0 bytes/0 pkts][0 > flows/0 pkts sent] > > 19/Dec/2017 13:36:09 [nprobe.c:3087] Kafka [flows exported=366299][msgs > sent=366299/1.0 flows/msg][send errors=0] > > > For these two stats: > > Flow collection: [collected pkts: 167203][processed flows: 4561802] > > Kafka [flows exported=366299][msgs sent=366299/1.0 flows/msg][send > errors=0] > > > I am thinking they mean that 167203 UDP packets where received from > routers comprising a total of 4561802 individual flow records. However, is > see only 366299 flows exported to Kafka. So, am I correct in assuming that > nprobe is doing some internal aggregation of flow records that is > essentially squashing the 4561802 received flow records into 366299 > aggregates? > > Yes your assumption is correct. If you want to avoid that please use > --disable-cache > > > A follow on question to this, then, is related to: > > Flow Buckets: [active=92792][allocated=92792][toBeExported=0] > > > What are these and how are they utilized? Again, I am assuming these are > hash buckets used for internal aggregation per the user's guide. I have > seen warning indicating that the allotment of these buckets are too small > and to expect drops. So, my guess is, based on flows/sec ingested, these > have to be sized appropriately to support the flow volume. Is that a > correct assumption? > > > When you see these messages we need to investigate. This happens when too > many flows fall into the same hash bucket for instance. Enlarging the hash > (-w) can help if too small compared to the number of collected flows, but > for replying more in detail I need some extra context > > > I also notice that, when I start up nprobe in collector mode publishing to > Kafka, it takes about 30 or more seconds before any flows actually are > published to Kafka. This leads me to believe internal aggregations are > occurring that are delaying publishing of data. If I crank up the --verbose > to 2, I can see UDP packets being processed and then, after some time, I > start to see log messages indicating flows are being exported to Kafka. It > is not as much the latency issue I am concerned with here but rather just > understanding what is happening so that I can properly monitor and > configure/size the system. > > Yes correct. By default flows are aggregated in the cache and as you write > below the minimum timeout is 30 sec > > > Do these parameters impact the utilization of the flow buckets in > collector mode or just when running in sniffer mode? I ask because, I know > the routers are already doing aggregations meaning, accumulation counts for > flows over time before emitting a flow record that is active. Does it mean > that nprobe is then doing the same thing again for these flows and > essentially aggregating already aggregated flow records coming from my > routers? > > [--lifetime-timeout|-t] <timeout> It specifies the maximum (seconds) > flow lifetime [default=120] > [--idle-timeout|-d] <timeout> It specifies the maximum (seconds) > flow idle lifetime [default=30] > [--queue-timeout|-l] <timeout> It specifies how long expired flows > (queued before delivery) are emitted [default=30] > > They affect the cache regardless of the mode (collector or probe). As you > use the cache (unless --disable-cache is used) these defaults also apply > to you > > > Also, based on the assumption of aggregating already aggregated data and > the type of traffic on the network I am monitoring (lots of short-lived > transactions, like credit card swipe processing by vendors and DNS > lookups), does it even make sense to have nprobe aggregating this traffic > that I know is NOT going to consist of more than one flow record anyway? > > > The answer depends on the environment you are monitoring. > > > The user document does not mention anything about monitoring nprobe > programmatically. What is the best way to monitor nprobe for internal > packet drops? I can get various OS stats from /proc/xxx, like UDP queue > size, drops, etc, but I need nprobe internal stats to round out the > picture. I see that there is information like this on stdout: > > Flow drops: [export queue too long=0][too many flows=0][ELK queue flow > drops=0] > > However, I want to monitor my nprobe instances with Nagios and generate > alerts on threshold checks as well as track utilization over time by > posting periodic stats to our InfluxDB/Grafana setup. Is there some way > (other than parsing stdout in a log) to gain programmatic access to these > stats for monitoring tools to use? > > > Nobody has asked this before so in short no API is available. Instead > people use --dump-stats to generate dumps, or the /proc stats. If they > are not enough please file a ticket on https://github.com/ntop/ > nProbe/issues and explain what you you need. Please one ticket per > request. > > > Regarding Kafka, the producer has many configuration options but only very > few are exposed for configuration in nprobe. Let me ask these one by one: > > > 1. batch.size, linger.ms, buffer.memory - These are essential to > controlling batching in Kafka. nprobe has options --kafka-enable-batch > and--kafka-batch-len. However, these end up wrapping N messages into a JSON > array of size N and publishing that to Kafka. I feel this is a wrong > approach. Consider the downstream Kafka consumer. It expects to receive a > series of message off a topic. The format of those message should not > change due to batching. When batching is not enabled in nprobe, the > consumer sees a series of JSON dictionaries - each a single flow record. > When batching is enabled, the consumer now sees a series of JSON arrays, > each with N JSON dictionaries. IMO, the proper way to do this is to use the > Kafka configuration values to control batching. In that case, the producer > simply queues up messages (each a dictionary) and, when configured > thresholds are met, emits those messages. This results in a batch of > dictionaries being sent and the consumer ONLY sees dictionaries. Changing > the message structure due to batching complicates things for consumers and > is not a typical pattern in Kafka processing. > 2. Options topic - Your documentation does not even mention this > (nprobe --help does) but I don't understand what it means? What is a Kafka > options topic? > 3. Partitioning - If we want to perform stream process of netflow > data, then we want to ensure that all flow records from a given n-tuple are > placed on the same Kafka partition. We need to partition the data because > it is the only way to scale consumers in Kafka. If I want to perform some > aggregations on the data stream then I have to be sure that all netflow > records for a given conversation, for example, are on the same topic > partition. A simple example that will make that happen would be to use the > IPV4_SRC_ADDR field of the flow record as the partition key. Or, maybe an > N-tuple of (IPV4_SRC_ADDR, IPV4_DST_ADDR, L4_SRC_PORT, L4_SRC_PORT) as the > partition key. In Java, a producer would do this by hashing the string that > comprises the partition key desired then doing a hash % num-partitions to > figure out the partition to send the message on. I am guessing that nprobe > relies on the default partitioning scheme in the producer which is a simple > round-robin approach based on the number of partitions that exist for the > topic being used. This, however, would randomly distribute flow records for > a given conversation across multiple partitions and, therefore across > multiple consumers in a downstream consumer group. That would break the > aggregations. So, my request is that you consider allowing a configuration > option that enables the user to define the partition key. This might be > done, for example, by allowing the user to define a CSV list of template > fields to use to form the partition key string. You could just concatenate > them together and hash that value then modulo divide by the number of > partitions for the topic being used and use that to enable the producer to > publish on the appropriate topic partition. The gives the user the freedom > to define the partition key while making the implemention in nprobe fairly > generic. Maybe this could also be done via some sort of "partition plugin" > to make it even more extensible? How you considered any such capability. > Without such a capability, we will have to initially publish all flows on a > say "netflow-raw" topic (using round-robin) then consume this topic in a > consumer group only to republish it by repartitioning it (as described > above using some N-tuple of fields) only be then be consumer by another > consumer group who will be doing the aggregations and enrichments needed. > Sure, we can make it would but partitioning should "really" be done at the > source. The approach I just described necessarily doubles our broker > traffic which I would not like to have to do. > 4. Producer Options in General - Why not just make them all > configurable? For example, allow the user to define a name=value config > file using any supported producer configuration options and provide the > path to the file as an nprobe Kafka configuration option. Then, when you > instantiate the producer in nprobe, read in those configuration values and > pass them into the producer. This gives the users access to all options > available and not just the current topic, acks, and compression values. > > > Miscellaneous Notes: > > 1. The v8.1 users guide lists "New Options --kafka-enable-batch > and--kafka-batch-len to batch flow export to kafka" but does not provide > any detailed documentation on these. Looks like someone forgot to add the > description of these later in the document > 2. nprobe --help show this under the Kafka options: "<options topic> > Flow options topic" but the v8.1 user's guide gives no mention to it. I > have no idea what an options topic is. > > As of the above notes on Kafka, I let my colleague Simone answer you who > is the kafka export in our team. > > Simone can you please answer Mark, and if there are changed to be made (I > think so from what I understand) file individual tickets? > > Thanks Luca > > > _______________________________________________ > Ntop-misc mailing list > [email protected] > http://listgateway.unipi.it/mailman/listinfo/ntop-misc > > > > _______________________________________________ > Ntop-misc mailing list > [email protected] > http://listgateway.unipi.it/mailman/listinfo/ntop-misc >
_______________________________________________ Ntop-misc mailing list [email protected] http://listgateway.unipi.it/mailman/listinfo/ntop-misc
