Re: [Ntop-misc] General questions and documentation of nprobe internals

Mark Petronic Tue, 02 Jan 2018 05:56:55 -0800

Thank you for all the responses and suggestions. As you requested, I will
open tracking tickets for the issues that you suggested. Again, very much
appreciate the timely and thorough responses - they are very helpful.


On Mon, Jan 1, 2018 at 12:28 PM, Luca Deri <[email protected]> wrote:

> Hi Mark,
> sorry for the late reply but we;ve been in vacation lately
>
> Please see below
>
> On 20 Dec 2017, at 13:25, Mark Petronic <[email protected]> wrote:
>
> I am running with nprobe 8.2 in collector mode. I am currently designing a
> collection infrastructure so I want to try to understand what nprobe is
> doing internally as to better understand how data is being processed. I a
> number of questions in regard to this. I have read the latest version of
> the user guide PDF but still have some questions. I tried to organize my
> questions in blocks to hopefully allow for easier commenting on each
> question. This is fairly long but I figured asking this all together, in
> context, would be better. Thanks in advance to whoever takes this on - I
> really appreciate it. :)
>
> Is there any detailed documentation on what is going on internally with
> nprobe. In particular, I am using it as a collector to forward UDP netflow
> v9 from our Cisco routers to Kafka. I am particularly interesting in
> understanding some of these stats and what they "infer" is happening under
> the hood:
>
> 19/Dec/2017 13:36:09 [nprobe.c:3202] Average traffic: [0.00 pps][All
> Traffic 0 b/sec][IP Traffic 0 b/sec][ratio -nan]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3210] Current traffic: [0.00 pps][0 b/sec]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3216] Current flow export rate: [1818.5
> flows/sec]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3219] Flow drops: [export queue too
> long=0][too many flows=0][ELK queue flow drops=0]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3224] Export Queue: 0/512000 [0.0 %]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3229] Flow Buckets:
> [active=92792][allocated=92792][toBeExported=0]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3235] Kafka [flows exported=366299/1818.5
> flows/sec][msgs sent=366299/1.0 flows/msg][send errors=0]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3260] Collector Threads: [167203 pkts@0]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3052] Processed packets: 0 (max bucket
> search: 8)
>
> 19/Dec/2017 13:36:09 [nprobe.c:3035] Fragment queue length: 0
>
> 19/Dec/2017 13:36:09 [nprobe.c:3061] Flow export stats: [0 bytes/0 pkts][0
> flows/0 pkts sent]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3068] Flow collection:   [collected pkts:
> 167203][processed flows: 4561802]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3071] Flow drop stats:   [0 bytes/0 pkts][0
> flows]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3076] Total flow stats:  [0 bytes/0 pkts][0
> flows/0 pkts sent]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3087] Kafka [flows exported=366299][msgs
> sent=366299/1.0 flows/msg][send errors=0]
>
>
> For these two stats:
>
> Flow collection:   [collected pkts: 167203][processed flows: 4561802]
>
> Kafka [flows exported=366299][msgs sent=366299/1.0 flows/msg][send
> errors=0]
>
>
> I am thinking they mean that 167203 UDP packets where received from
> routers comprising a total of 4561802 individual flow records. However, is
> see only 366299 flows exported to Kafka. So, am I correct in assuming that
> nprobe is doing some internal aggregation of flow records that is
> essentially squashing the 4561802 received flow records into 366299
> aggregates?
>
> Yes your assumption is correct. If you want to avoid that please use
> --disable-cache
>
>
> A follow on question to this, then, is related to:
>
> Flow Buckets: [active=92792][allocated=92792][toBeExported=0]
>
>
> What are these and how are they utilized? Again, I am assuming these are
> hash buckets used for internal aggregation per the user's guide. I have
> seen warning indicating that the allotment of these buckets are too small
> and to expect drops. So, my guess is, based on flows/sec ingested, these
> have to be sized appropriately to support the flow volume. Is that a
> correct assumption?
>
>
> When you see these messages we need to investigate. This happens when too
> many flows fall into the same hash bucket for instance. Enlarging the hash
> (-w) can help if too small compared to the number of collected flows, but
> for replying more in detail I need some extra context
>
>
> I also notice that, when I start up nprobe in collector mode publishing to
> Kafka, it takes about 30 or more seconds before any flows actually are
> published to Kafka. This leads me to believe internal aggregations are
> occurring that are delaying publishing of data. If I crank up the --verbose
> to 2, I can see UDP packets being processed and then, after some time, I
> start to see log messages indicating flows are being exported to Kafka. It
> is not as much the latency issue I am concerned with here but rather just
> understanding what is happening so that I can properly monitor and
> configure/size the system.
>
> Yes correct. By default flows are aggregated in the cache and as you write
> below the minimum timeout is 30 sec
>
>
> Do these parameters impact the utilization of the flow buckets in
> collector mode or just when running in sniffer mode? I ask because, I know
> the routers are already doing aggregations meaning, accumulation counts for
> flows over time before emitting a flow record that is active. Does it mean
> that nprobe is then doing the same thing again for these flows and
> essentially aggregating already aggregated flow records coming from my
> routers?
>
> [--lifetime-timeout|-t] <timeout>   It specifies the maximum (seconds)
> flow lifetime [default=120]
> [--idle-timeout|-d] <timeout>       It specifies the maximum (seconds)
> flow idle lifetime [default=30]
> [--queue-timeout|-l] <timeout>      It specifies how long expired flows
> (queued before delivery) are emitted [default=30]
>
> They affect the cache regardless of the mode (collector or probe). As you
> use the cache (unless --disable-cache is used) these defaults also apply
> to you
>
>
> Also, based on the assumption of aggregating already aggregated data and
> the type of traffic on the network I am monitoring (lots of short-lived
> transactions, like credit card swipe processing by vendors and DNS
> lookups), does it even make sense to have nprobe aggregating this traffic
> that I know is NOT going to consist of more than one flow record anyway?
>
>
> The answer depends on the environment you are monitoring.
>
>
> The user document does not mention anything about monitoring nprobe
> programmatically. What is the best way to monitor nprobe for internal
> packet drops? I can get various OS stats from /proc/xxx, like UDP queue
> size, drops, etc, but I need nprobe internal stats to round out the
> picture. I see that there is information like this on stdout:
>
> Flow drops: [export queue too long=0][too many flows=0][ELK queue flow
> drops=0]
>
> However, I want to monitor my nprobe instances with Nagios and generate
> alerts on threshold checks as well as track utilization over time by
> posting periodic stats to our InfluxDB/Grafana setup. Is there some way
> (other than parsing stdout in a log) to gain programmatic access to these
> stats for monitoring tools to use?
>
>
> Nobody has asked this before so in short no API is available. Instead
> people use --dump-stats to generate dumps, or the /proc stats. If they
> are not enough please file a ticket on https://github.com/ntop/
> nProbe/issues and explain what you you need. Please one ticket per
> request.
>
>
> Regarding Kafka, the producer has many configuration options but only very
> few are exposed for configuration in nprobe. Let me ask these one by one:
>
>
>    1. batch.size, linger.ms, buffer.memory - These are essential to
>    controlling batching in Kafka. nprobe has options --kafka-enable-batch
>    and--kafka-batch-len. However, these end up wrapping N messages into a JSON
>    array of size N and publishing that to Kafka. I feel this is a wrong
>    approach. Consider the downstream Kafka consumer. It expects to receive a
>    series of message off a topic. The format of those message should not
>    change due to batching. When batching is not enabled in nprobe, the
>    consumer sees a series of JSON dictionaries - each a single flow record.
>    When batching is enabled, the consumer now sees a series of JSON arrays,
>    each with N JSON dictionaries. IMO, the proper way to do this is to use the
>    Kafka configuration values to control batching. In that case, the producer
>    simply queues up messages (each a dictionary) and, when configured
>    thresholds are met, emits those messages. This results in a batch of
>    dictionaries being sent and the consumer ONLY sees dictionaries. Changing
>    the message structure due to batching complicates things for consumers and
>    is not a typical pattern in Kafka processing.
>    2. Options topic - Your documentation does not even mention this
>    (nprobe --help does) but I don't understand what it means? What is a Kafka
>    options topic?
>    3. Partitioning - If we want to perform stream process of netflow
>    data, then we want to ensure that all flow records from a given n-tuple are
>    placed on the same Kafka partition. We need to partition the data because
>    it is the only way to scale consumers in Kafka. If I want to perform some
>    aggregations on the data stream then I have to be sure that all netflow
>    records for a given conversation, for example, are on the same topic
>    partition. A simple example that will make that happen would be to use the
>    IPV4_SRC_ADDR field of the flow record as the partition key. Or, maybe an
>    N-tuple of (IPV4_SRC_ADDR, IPV4_DST_ADDR, L4_SRC_PORT, L4_SRC_PORT) as the
>    partition key. In Java, a producer would do this by hashing the string that
>    comprises the partition key desired then doing a hash % num-partitions to
>    figure out the partition to send the message on. I am guessing that nprobe
>    relies on the default partitioning scheme in the producer which is a simple
>    round-robin approach based on the number of partitions that exist for the
>    topic being used. This, however, would randomly distribute flow records for
>    a given conversation across multiple partitions and, therefore across
>    multiple consumers in a downstream consumer group. That would break the
>    aggregations. So, my request is that you consider allowing a configuration
>    option that enables the user to define the partition key. This might be
>    done, for example, by allowing the user to define a CSV list of template
>    fields to use to form the partition key string. You could just concatenate
>    them together and hash that value then modulo divide by the number of
>    partitions for the topic being used and use that to enable the producer to
>    publish on the appropriate topic partition. The gives the user the freedom
>    to define the partition key while making the implemention in nprobe fairly
>    generic. Maybe this could also be done via some sort of "partition plugin"
>    to make it even more extensible? How you considered any such capability.
>    Without such a capability, we will have to initially publish all flows on a
>    say "netflow-raw" topic (using round-robin) then consume this topic in a
>    consumer group only to republish it by repartitioning it (as described
>    above using some N-tuple of fields) only be then be consumer by another
>    consumer group who will be doing the aggregations and enrichments needed.
>    Sure, we can make it would but partitioning should "really" be done at the
>    source. The approach I just described necessarily doubles our broker
>    traffic which I would not like to have to do.
>    4. Producer Options in General - Why not just make them all
>    configurable? For example, allow the user to define a name=value config
>    file using any supported producer configuration options and provide the
>    path to the file as an nprobe Kafka configuration option. Then, when you
>    instantiate the producer in nprobe, read in those configuration values and
>    pass them into the producer. This gives the users access to all options
>    available and not just the current topic, acks, and compression values.
>
>
> Miscellaneous Notes:
>
>    1. The v8.1 users guide lists "New Options --kafka-enable-batch
>    and--kafka-batch-len to batch flow export to kafka" but does not provide
>    any detailed documentation on these. Looks like someone forgot to add the
>    description of these later in the document
>    2. nprobe --help show this under the Kafka options:  "<options topic>
>    Flow options topic" but the v8.1 user's guide gives no mention to it. I
>    have no idea what an options topic is.
>
> As of the above notes on Kafka, I let my colleague Simone answer you who
> is the kafka export in our team.
>
> Simone can you please answer Mark, and if there are changed to be made (I
> think so from what I understand) file individual tickets?
>
> Thanks Luca
>
>
> _______________________________________________
> Ntop-misc mailing list
> [email protected]
> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
>
>
>
> _______________________________________________
> Ntop-misc mailing list
> [email protected]
> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
>

_______________________________________________
Ntop-misc mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-misc

Re: [Ntop-misc] General questions and documentation of nprobe internals

Reply via email to