Re: [Bro-Dev] Broker data store use case and questions

2018-05-14 Thread Azoff, Justin S

> On May 14, 2018, at 10:12 AM, Jon Siwek  wrote:
> 
> A short-lived cache, separate from the data store, still has problems like 
> the above: there can be times where the local cache contains the key and the 
> master store does not and so you may miss some (re)insertions.

I see what you mean.. I can almost see a solution involving using create_expire 
and expire_func to trigger a re-submit when the local cache expires, but that 
may cause the opposite problem.  This would mean that a record would be sent 
the first time it was seen and then at most once again N minutes after that.  
If N minutes after that is 00:03 the entry would be logged on the following day 
even if it was not seen yet.  I suppose if the value in the cache table was the 
network_time of the last time seen that could used to fill in the HostInfo 
record.



— 
Justin Azoff


___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Broker data store use case and questions

2018-05-14 Thread Jon Siwek


On 5/11/18 6:33 PM, Michael Dopheide wrote:

> First, can Cluster::default_master_node be changed to default to the 
> name of the current manager node rather than specifying the name as 
> 'manager'?

Maybe.  I'll try having broctl communicate that to Bro via a new 
environment variable.

> Easy to redef to the manager's name, but less easy when you 
> use the same code base on multiple clusters with different names.

If you don't want to wait for me to try the above fix, you could also 
try redef'ing it yourself with a call to getenv(), using an environment 
variable whose value you can set differently for each cluster.

> Second, when during startup should Bro know that it's persistent stores 
> exist via Cluster::stores() ?  It appears bro_init may be too soon, but 
> I'm still playing.

The comments for the Cluster::stores table may help in case you missed 
it -- Cluster::create_store() is intended to be called in bro_init() and 
will end up populating Cluster::stores.  Though, you can pre-populate 
and customize the Cluster::stores table via a redef and those will all 
automatically get picked up when during the Cluster::create_store() process.

> Also, it'd be nice if the persistence of built-in 
> stores (like known/hosts, known/certs, etc) were redef-able.

It should be possible like putting this in local.bro:

redef Cluster::stores += {
 [Known::host_store_name] = Cluster::StoreInfo($backend = 
Broker::SQLITE)
};

- Jon
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Broker data store use case and questions

2018-05-14 Thread Jon Siwek


On 5/11/18 1:38 PM, Azoff, Justin S wrote:
> 
>> On May 11, 2018, at 10:13 AM, Jon Siwek  wrote:
>>
>>
>> There's no check against the local cache to first see if the key exists
>> as going down that path leads to race conditions.
> 
> What sort of race conditions?

By "local cache", I mean the data store "clone" here.  And one race with 
checking for existence in the local clone could look like:

(1) master: delete an expired key, "foo", send notification to clones
(2) clone: check for existence of key, "foo" and find it exists locally, 
then suppress further logic based on that
(3) clone: receive expiry notification for key "foo"

In that way, you can miss an (re)insertion that should have taken place 
if the query/insertion were together in sequence directly on the master 
data set.

> Things are a bit better off now in that we can use a short lived cache, since 
> the cache doesn't need to be the actual data store anymore like the old known 
> hosts set was.

A short-lived cache, separate from the data store, still has problems 
like the above: there can be times where the local cache contains the 
key and the master store does not and so you may miss some (re)insertions.

The main goal I had when re-writing these was correctness: I can't know 
what network they will run on, and so don't want to assume it will be ok 
to miss an event here or there because "typically those should be seen 
frequently enough that it will get picked up soon after the miss".

If we can optimize the scripts that ship w/ Bro while still maintaining 
correctness, that would be great, else I'd rather sites decide for 
themselves what trade-offs are acceptable and write their own scripts to 
optimize for those.

- Jon
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Broker data store use case and questions

2018-05-11 Thread Michael Dopheide
Couple questions.   (Let me know if this isn't appropriate bro-dev content,
but I don't want to cause confusion on the normal bro list.)

First, can Cluster::default_master_node be changed to default to the name
of the current manager node rather than specifying the name as 'manager'?  Easy
to redef to the manager's name, but less easy when you use the same code
base on multiple clusters with different names.

 stderr.log
fatal error in /usr/local/bro/share/bro/base/frameworks/cluster/./main.bro,
lines 399-400: master node 'manager' for cluster store 'bro/known/certs'
does not exist

Second, when during startup should Bro know that it's persistent stores
exist via Cluster::stores() ?  It appears bro_init may be too soon, but I'm
still playing.  Also, it'd be nice if the persistence of built-in stores
(like known/hosts, known/certs, etc) were redef-able.

Thanks,
-Dop




On Fri, May 11, 2018 at 1:38 PM, Azoff, Justin S 
wrote:

>
> > On May 11, 2018, at 10:13 AM, Jon Siwek  wrote:
> >
> >
> > There's no check against the local cache to first see if the key exists
> > as going down that path leads to race conditions.
>
> What sort of race conditions?
>
> Right now I see a lot of events going around so it seems like there may be
> a bit of overhead in this area.
>
> For example, in about a minute one node in a two node cluster sent 2104
> Software::register events (so likely 4k total)
> In that time, only 7 new entries were logged to software.log.
>
> It's always good to reduce memory usage, but I think especially for things
> like known hosts which are generally kept at LOCAL_HOSTS
> the small amount of memory used for caching already seen hosts saves more
> resources than would be spent sending redundant events around.
> Especially if those events end up queued and buffered in ram anyway.
>
> Things are a bit better off now in that we can use a short lived cache,
> since the cache doesn't need to be the actual data store anymore like the
> old known hosts set was.
>
>
>
> —
> Justin Azoff
>
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Broker data store use case and questions

2018-05-11 Thread Azoff, Justin S

> On May 11, 2018, at 10:13 AM, Jon Siwek  wrote:
> 
> 
> There's no check against the local cache to first see if the key exists 
> as going down that path leads to race conditions.

What sort of race conditions?

Right now I see a lot of events going around so it seems like there may be a 
bit of overhead in this area.

For example, in about a minute one node in a two node cluster sent 2104 
Software::register events (so likely 4k total)
In that time, only 7 new entries were logged to software.log.

It's always good to reduce memory usage, but I think especially for things like 
known hosts which are generally kept at LOCAL_HOSTS
the small amount of memory used for caching already seen hosts saves more 
resources than would be spent sending redundant events around.
Especially if those events end up queued and buffered in ram anyway.

Things are a bit better off now in that we can use a short lived cache, since 
the cache doesn't need to be the actual data store anymore like the old known 
hosts set was.



— 
Justin Azoff


___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Broker data store use case and questions

2018-05-11 Thread Jon Siwek


On 5/11/18 10:15 AM, Michael Dopheide wrote:
> Let me clarify point 4, my goal is just to keep the knownhosts data 
> persistent across restarts.  (Or any data set in the general case.)  So 
> if HRW is the best way to keep data in memory I need a way to write it 
> out to disk on Bro exit so I can read it back in later.

Ok, maybe first try a single data store for persistence (what 
known-hosts will do by default now).

If you then find you overload the centralized manager/master node, my 
next suggestion is to partition the data set across proxies via HRW 
events and combine that with a data store configured for persistence on 
each proxy.

- Jon
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Broker data store use case and questions

2018-05-11 Thread Michael Dopheide
Let me clarify point 4, my goal is just to keep the knownhosts data
persistent across restarts.  (Or any data set in the general case.)  So if
HRW is the best way to keep data in memory I need a way to write it out to
disk on Bro exit so I can read it back in later.

On Fri, May 11, 2018 at 9:13 AM Jon Siwek  wrote:

>
>
> On 5/10/18 3:53 PM, Michael Dopheide wrote:
>
> > 1) My initial gut feeling was that all of the when() calls for insertion
> > could get really expensive on a brand new cluster before the store is
> > populated.
>
> I've not tried to explicitly measure differences yet, though my hunch is
> that the overhead of needing to use when() to drive data store
> communication patterns could be slightly more expensive than just using
> remote events (or  as the previous implementation used).
> I'm thinking of overhead more in terms of memory here, as it needs to
> save the state of the current frame making the call so it can resume
> later.
>
> Another difference is the data store implementation of known-hosts is
> that it does always require remote communication just to check for
> whether a given key is in the data store yet, which may be a bottleneck
> for some use-cases.
>
> You can also compare/contrast with another implementation of
> known-hosts.bro if you toggle the `Known::use_host_store = F` code path.
>   There, instead of using a data store, it sends remote events via
> Cluster::publish_hrw to uniformly partition the data set across proxies
> in a scalable manner but without persistence.
>
> Yet another idea for an implementation, if you need persistence +
> scalability, would be combining the HRW stuff with data stores.  e.g.
> partitioning the total data set across proxies while using a data store
> on each one for local storage instead of a table/set.
>
> I don't know if there's a general answer to which way is best.  Likely
> varies per use-case / network.
>
> > 2) Correct me if I'm wrong, but it seems like the check for a host
> > already being in known_hosts (now host_store) no longer exists.  As a
> > result, we try to re-insert the host, calling when(), every time we see
> > an established connection with a local host.
>
> Sounds right.
>
> Specifically, it's Broker::put_unique() that hides the following:
>
> (1) tell master data store "insert this key if it does not exist"
> (2) wait for master data store to tell us if the key was inserted, and
> thus did not exist before
>
> There's no check against the local cache to first see if the key exists
> as going down that path leads to race conditions.
>
> > 3) How do I retrieve values from the store to test for existence?
>
> Broker::exists() to just check existence of a key or Broker::get() to
> retrieve value at a key.  You can also infer existence from the result
> of Broker::get().
>
> Either requires calling inside 'when()'.  Generally, any function in the
> API you see return a Broker::QueryResult needs to use 'when()'.
>
> > 4) Assuming that requires another Broker call inside a when(), does it
> > make sense to pull the data store into memory at bro_init() and do
> > a Cluster::publish_hrw?
>
> Not sure I follow since, in the current implementation of known-hosts,
> the data store and Cluster::publish_hrw code paths don't interact
> (they're alternate implementations of the same thing as mentioned
> before).  If the question is just whether it makes sense to go the
> Cluster::publish_hrw route instead of using a data store: yes, just
> depends on what you prefer.  IMO, the data store approach has downsides
> that make it less preferable to me.
>
> - Jon
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Broker data store use case and questions

2018-05-11 Thread Jon Siwek


On 5/10/18 3:53 PM, Michael Dopheide wrote:

> 1) My initial gut feeling was that all of the when() calls for insertion 
> could get really expensive on a brand new cluster before the store is 
> populated.

I've not tried to explicitly measure differences yet, though my hunch is 
that the overhead of needing to use when() to drive data store 
communication patterns could be slightly more expensive than just using 
remote events (or  as the previous implementation used). 
I'm thinking of overhead more in terms of memory here, as it needs to 
save the state of the current frame making the call so it can resume 
later.

Another difference is the data store implementation of known-hosts is 
that it does always require remote communication just to check for 
whether a given key is in the data store yet, which may be a bottleneck 
for some use-cases.

You can also compare/contrast with another implementation of 
known-hosts.bro if you toggle the `Known::use_host_store = F` code path. 
  There, instead of using a data store, it sends remote events via 
Cluster::publish_hrw to uniformly partition the data set across proxies 
in a scalable manner but without persistence.

Yet another idea for an implementation, if you need persistence + 
scalability, would be combining the HRW stuff with data stores.  e.g. 
partitioning the total data set across proxies while using a data store 
on each one for local storage instead of a table/set.

I don't know if there's a general answer to which way is best.  Likely 
varies per use-case / network.

> 2) Correct me if I'm wrong, but it seems like the check for a host 
> already being in known_hosts (now host_store) no longer exists.  As a 
> result, we try to re-insert the host, calling when(), every time we see 
> an established connection with a local host.

Sounds right.

Specifically, it's Broker::put_unique() that hides the following:

(1) tell master data store "insert this key if it does not exist"
(2) wait for master data store to tell us if the key was inserted, and 
thus did not exist before

There's no check against the local cache to first see if the key exists 
as going down that path leads to race conditions.

> 3) How do I retrieve values from the store to test for existence?

Broker::exists() to just check existence of a key or Broker::get() to 
retrieve value at a key.  You can also infer existence from the result 
of Broker::get().

Either requires calling inside 'when()'.  Generally, any function in the 
API you see return a Broker::QueryResult needs to use 'when()'.

> 4) Assuming that requires another Broker call inside a when(), does it 
> make sense to pull the data store into memory at bro_init() and do 
> a Cluster::publish_hrw?

Not sure I follow since, in the current implementation of known-hosts, 
the data store and Cluster::publish_hrw code paths don't interact 
(they're alternate implementations of the same thing as mentioned 
before).  If the question is just whether it makes sense to go the 
Cluster::publish_hrw route instead of using a data store: yes, just 
depends on what you prefer.  IMO, the data store approach has downsides 
that make it less preferable to me.

- Jon
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] Broker data store use case and questions

2018-05-10 Thread Michael Dopheide
Maybe I'm jumping the gun a little bit, but I want to start wrapping my
head around the upcoming changes.  Let's start by stating my use case...  I
wanted to stop the repetitive reverse DNS queries caused by
ssh/interesting-hostnames.bro by rebuilding known-hosts.bro to include the
names, allowing a simple lookup*.  I started re-writing the old one and
Justin pointed me towards the 'new' version of known-hosts in the
topic/actor-system branch.

Looking at the new known-hosts.bro..

1) My initial gut feeling was that all of the when() calls for insertion
could get really expensive on a brand new cluster before the store is
populated.

2) Correct me if I'm wrong, but it seems like the check for a host already
being in known_hosts (now host_store) no longer exists.  As a result, we
try to re-insert the host, calling when(), every time we see an established
connection with a local host.

Which leads me to...

3) How do I retrieve values from the store to test for existence?

4) Assuming that requires another Broker call inside a when(), does it make
sense to pull the data store into memory at bro_init() and do
a Cluster::publish_hrw?

Thanks,
Dop


* - Yes, on the edges this breaks DNS TTLs, but saves thousands of when()
calls to lookup_addr() and our names don't change very frequently.
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev