Re: [Ntop-misc] General questions and documentation of nprobe internals

Simone Mainardi Tue, 02 Jan 2018 02:49:53 -0800

Mark,


> 
>> 
>> Regarding Kafka, the producer has many configuration options but only very 
>> few are exposed for configuration in nprobe. Let me ask these one by one:
>> 
>> batch.size, linger.ms <http://linger.ms/>, buffer.memory - These are 
>> essential to controlling batching in Kafka. nprobe has options 
>> --kafka-enable-batch and--kafka-batch-len. However, these end up wrapping N 
>> messages into a JSON array of size N and publishing that to Kafka. I feel 
>> this is a wrong approach. Consider the downstream Kafka consumer. It expects 
>> to receive a series of message off a topic. The format of those message 
>> should not change due to batching. When batching is not enabled in nprobe, 
>> the consumer sees a series of JSON dictionaries - each a single flow record. 
>> When batching is enabled, the consumer now sees a series of JSON arrays, 
>> each with N JSON dictionaries. IMO, the proper way to do this is to use the 
>> Kafka configuration values to control batching. In that case, the producer 
>> simply queues up messages (each a dictionary) and, when configured 
>> thresholds are met, emits those messages. This results in a batch of 
>> dictionaries being sent and the consumer ONLY sees dictionaries. Changing 
>> the message structure due to batching complicates things for consumers and 
>> is not a typical pattern in Kafka processing.
This is a good suggestion. Due to backward-compatibility reasons, when batching 
is enabled nProbe accumulates flows and then outputs them as a single kafka 
message. I agree that we should use the librdkafka batching features directly 
and make sure flows are just plain JSON dictionaries not concatenated into an 
array. Please, file an issue at https://github.com/ntop/nProbe/issues 
<https://github.com/ntop/nProbe/issues> and we will try to accommodate this 
request.
>> Options topic - Your documentation does not even mention this (nprobe --help 
>> does) but I don't understand what it means? What is a Kafka options topic?
This topic is used to send data records for the NetFlow Options Template: 
https://www.plixer.com/blog/network-traffic-analysis/netflow-overview-netflow-v9-options-template/
 
<https://www.plixer.com/blog/network-traffic-analysis/netflow-overview-netflow-v9-options-template/>
>> Partitioning - If we want to perform stream process of netflow data, then we 
>> want to ensure that all flow records from a given n-tuple are placed on the 
>> same Kafka partition. We need to partition the data because it is the only 
>> way to scale consumers in Kafka. If I want to perform some aggregations on 
>> the data stream then I have to be sure that all netflow records for a given 
>> conversation, for example, are on the same topic partition. A simple example 
>> that will make that happen would be to use the IPV4_SRC_ADDR field of the 
>> flow record as the partition key. Or, maybe an N-tuple of (IPV4_SRC_ADDR, 
>> IPV4_DST_ADDR, L4_SRC_PORT, L4_SRC_PORT) as the partition key. In Java, a 
>> producer would do this by hashing the string that comprises the partition 
>> key desired then doing a hash % num-partitions to figure out the partition 
>> to send the message on. I am guessing that nprobe relies on the default 
>> partitioning scheme in the producer which is a simple round-robin approach 
>> based on the number of partitions that exist for the topic being used.
Currently, nProbe sends to all available partitions in round-robin.
>> This, however, would randomly distribute flow records for a given 
>> conversation across multiple partitions and, therefore across multiple 
>> consumers in a downstream consumer group. That would break the aggregations. 
>> So, my request is that you consider allowing a configuration option that 
>> enables the user to define the partition key. This might be done, for 
>> example, by allowing the user to define a CSV list of template fields to use 
>> to form the partition key string. You could just concatenate them together 
>> and hash that value then modulo divide by the number of partitions for the 
>> topic being used and use that to enable the producer to publish on the 
>> appropriate topic partition. The gives the user the freedom to define the 
>> partition key while making the implemention in nprobe fairly generic. Maybe 
>> this could also be done via some sort of "partition plugin" to make it even 
>> more extensible? How you considered any such capability. Without such a 
>> capability, we will have to initially publish all flows on a say 
>> "netflow-raw" topic (using round-robin) then consume this topic in a 
>> consumer group only to republish it by repartitioning it (as described above 
>> using some N-tuple of fields) only be then be consumer by another consumer 
>> group who will be doing the aggregations and enrichments needed. Sure, we 
>> can make it would but partitioning should "really" be done at the source. 
>> The approach I just described necessarily doubles our broker traffic which I 
>> would not like to have to do.
As already said, this is an interesting feature that we are willing to consider 
for implementation. Feel free to contact us privately so we can try and work 
together to make the hashing controllable.
>> Producer Options in General - Why not just make them all configurable? For 
>> example, allow the user to define a name=value config file using any 
>> supported producer configuration options and provide the path to the file as 
>> an nprobe Kafka configuration option. Then, when you instantiate the 
>> producer in nprobe, read in those configuration values and pass them into 
>> the producer. This gives the users access to all options available and not 
>> just the current topic, acks, and compression values.
Yes, this is definitely useful as well. Use the same issue tracker mentioned 
above and we will prioritise the activity.

Regards,

Simone

>> 
>> Miscellaneous Notes:
>> The v8.1 users guide lists "New Options --kafka-enable-batch 
>> and--kafka-batch-len to batch flow export to kafka" but does not provide any 
>> detailed documentation on these. Looks like someone forgot to add the 
>> description of these later in the document
>> nprobe --help show this under the Kafka options:  "<options topic> Flow 
>> options topic" but the v8.1 user's guide gives no mention to it. I have no 
>> idea what an options topic is.
> As of the above notes on Kafka, I let my colleague Simone answer you who is 
> the kafka export in our team.
> 
> Simone can you please answer Mark, and if there are changed to be made (I 
> think so from what I understand) file individual tickets?
> 
> Thanks Luca
> 
> 
>> _______________________________________________
>> Ntop-misc mailing list
>> [email protected] <mailto:[email protected]>
>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc 
>> <http://listgateway.unipi.it/mailman/listinfo/ntop-misc>

_______________________________________________
Ntop-misc mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-misc

Re: [Ntop-misc] General questions and documentation of nprobe internals

Reply via email to