On 02/11/17 23:58, Azoff, Justin S wrote: > For an example of a purely broadcast use case, see > > scripts/base/frameworks/intel/cluster.bro > > You can see the crazy amount of complexity around the Intel::cluster_new_item > event.
That's right! Took me some time to figure out how data should be distributed. So, I am following this thread trying to keep up with the development. Right now I don't have new ideas to contribute but as the intel framework was mentioned multiple times as an example, I thought I might sketch its communication behavior, so that we have a more complete view of that use case. Let's assume in the beginning there is only the manager. The manager reads in the intel file and creates his "in-memory database", a quite complex table (DataStore), as well as a data structure that contains only the indicators for matching on the workers (MinDataStore). Now, when a worker connects, he receives the current MinDataStore, sent using send_id. (Side note: I am planning to replace the sets used in the MinDataStore by Cuckoo Filters. Not sure how serialization etc. will work out using broker but if I remember correctly there is a temporary solution for now.) If the worker detects a match, he triggers match_no_items on the manager, who generates the hit by combining the seen data of the worker and the meta data of the DataStore. At this point, if the manager functionality is distributed across multiple data nodes, we have to make sure, that every data node has the right part of the DataStore to deal with the incoming hit. One could keep the complete DataStore on every data node but I think that would lead to another scheme in which a subset of workers send all their requests to a specific data node, i.e. each data node serves a part of the cluster. Back to the current implementation. So far not that complex but there are two more cases to deal with: Inserting new intel items and removing items. A new item can be inserted on the manager or on a worker. As a new item might be just new meta data for an already existing indicator (no update of the MinDataStore needed), the manager is the only one who can handle the insert. So if inserted on a worker, the worker triggers a cluster_new_item event on the manager, who proceeds like he inserted the item. Finally, the manager only triggers cluster_new_item on the workers if the inserted item was a new indicator that has to be added to the worker's MinDataStores. Some of the complexity here is due to the fact that the same event, cluster_new_item, is used for communication in both directions (worker2manager and manager2worker). The removal of items works more or less the same with the only difference that for each direction there is a specific event (remove_item and purge_item). Long story short: I think the built in distribution across multiple data nodes you discussed is a great idea. The only thing to keep in mind would be a suitable way of "initializing" the data nodes with the corresponding subset of data they need to handle. I guess in case of the intel framework the manager will still handle reading the intel files and might make use of the same mechanisms the workers use to distribute the ingested data to the data nodes. The only thing I am not sure about is how we can/should handle dynamic adding and removing of the data nodes. And just to avoid misunderstandings: We won't be able to get rid of the @if (Cluster::local_node_type() != Cluster::MANAGER/DATANODE) statements completely as different node types have different functionality. It's just about the communication API, right? I hope this helps when thinking about the API design :) Jan _______________________________________________ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev