Re: [Bro-Dev] "delete" of entire tables/sets

2018-12-18 Thread Jim Mellander
To satisfy my curiosity and to provide further info for anyone viewing list
archives, I benchmarked assigning a new table() vs. running clear_table()
on tables of various sizes & complexities.  Assigning a new table() was
clearly the winner by a huge margin.

It should be noted that these two methods are not completely equivalent. I
run a script where a current table is periodically moved to an old table,
and the current table cleared for further use.  So, with A & B being tables
of the same type:

B = A;
A = table();

works as intended, as B now contains the prior contents of A, and A is a
new empty table.

B = A;
clear_table(A);

clears both, as they both point to the same table.

On Thu, Dec 6, 2018 at 4:17 PM Vern Paxson  wrote:

> > I believe the clear_table() function still exists in zeek, as well...
>
> Hah^2!  Yeah, it does.  Well, glad I consulted the list before diving into
> some hacking :-P.
>
> Vern
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] "delete" of entire tables/sets

2018-12-06 Thread Jim Mellander
I believe the clear_table() function still exists in zeek, as well...

On Thu, Dec 6, 2018 at 3:12 PM Vern Paxson  wrote:

> > I guess re-assigning a new, empty table to the variable could be
> > analogous to deleting all entries and also avoids the iterator
> > invalidation problem ?
>
> Hah!, yeah, that's certainly a simple way to do it.  Maybe I'll
> change my hacking for now to just be adding this observation to
> the "delete" documentation :-).
>
> Vern
> ___
> bro-dev mailing list
> bro-dev@bro.org
> http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Consistent error handling for scripting mistakes

2018-11-12 Thread Jim Mellander
Along the same vein of sensible Bro script error handling, I'm resending an
issue I found in January:

I was tinkering with the sumstats code, and inadvertantly deleted the final
"}" closing out the last function.  When running the code, the misleading
error message is received:

error in bro/share/bro/base/frameworks/tunnels/./main.bro, line 8: syntax
error, at or near "module"

presumably due to the function still being open when the next policy script
is loaded.  Wouldn't it be more reasonable to check at the end of each
script when loaded that there are no dangling functions, expressions, etc.


==

There are also silent fails which probably should give a warning, such as
failing to include the fully-qualified event name silently preventing the
event from being triggered.


==

The above are more in the area of parsing vs runtime.

My idea on runtime scripting errors would be to apply a sensible default to
the offending expression (null or 0, as the case may be, might be
sufficient), log the error, and continue


Jim



On Mon, Nov 12, 2018 at 10:27 AM, Jon Siwek  wrote:

> Trying to broaden the scope of:
>
> https://github.com/bro/bro/issues/208
>
> I recently noticed there's a range of behaviors in how various
> scripting mistakes are treated.  They may (1) abort, as in case of bad
> subnet mask or incompatible vector element assignment (2) skip over
> evaluating (sub)expression(s), but otherwise continue current function
> body, as in case of non-existing table index access or (3) exit the
> current function body, as in the classic case of uninitialized record
> field access.
>
> 1st question: should these be made more consistent? I'd say yes.
>
> 2nd question: what is the expected way for these to be handled?  I'd
> argue that (3) is close to expected behavior, but it's still weird
> that it's only the *current function body* (yes, *function*, not
> event) that exits early -- hard to reason about what sort of arbitrary
> code was depending on that function to be fully evaluated and what
> other sort of inconsistent state is caused by exiting early.
>
> I propose, for 2.7, to aim for consistent error handling for scripting
> mistakes and that the expected behavior is to unwind all the way to
> exiting the current event handler (all its function bodies).  That
> makes it easier to explain how to write event handlers such that they
> won't enter too wild/inconsistent of a state should a scripting error
> occur: "always write an event handler such that it makes no
> assumptions about order/priority of other events handlers".  That's
> already close to current suggestions/approaches.
>
> One exception may be within bro_init(), if an error happens at that
> time, I'd say it's fine to completely abort -- it's unlikely or hard
> to say whether Bro would operate well if it proceeded after an error
> that early in initialization.
>
> Thoughts?
>
> - Jon
> ___
> bro-dev mailing list
> bro-dev@bro.org
> http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] Underflow Considered Harmful

2018-08-20 Thread Jim Mellander
Or: How 13 billion became 1.844674e+19 before becoming 0.

After sending amounts totaling over 13 billion thru Sumstats, a value of 0
came out the other end of the sausage factory, but only for one specific
data item.  Debugging this required lots of well placed print statements,
and wild speculation on my part as to what possibly could be broken

The values being thrown in to be summed are incremental differences from a
previous observation, which *should* be zero or a positive number in the
range of several K, so a 'count' variable was used.  However, for some
reason, this value came up negative (or should have) due to (in decreasing
likelihood) logic error in script, one of bro's dark corners, or bro bug.
The reason for the negativity is still TBD.  But, in the world of unsigned
64-bit values (aka bro 'count' variables) there is no negativity, only
positivity, and an unsigned underflow creates a number just below 2**64 ~
1.844674e+19 

Well, Sumstats tallies in doubles, and naturally this figure (1.844674e+19)
dominated the total.  In fact, additional increments to this total pushed
the total value to be greater than 2**64 (with loss of precision, as
doubles only keep 53 bits).

In the processing step at the Sumstats epoch, the value was converted back
to a count using the double_to_count() function which the cheatsheet warns
returns 0, if the double value is <0.0, but it actually returns
2**64-double (with a runtime error), and for values > 2**64 it returns 0
with no runtime error  :-(

So, there it is, a value that should have been about 13 billion became
1.844674e+19 and then became 0.

A few suggestions:
1. Conversion routines should saturate at respective minima/maxima of the
type being converted to (possibly with runtime error).

2. Underflow of the 'count' type is almost invariably a bug, and should
trigger a runtime error.  Overflow, similarly, although in practice it
seems much less likely to occur as most scripts are dealing with integers
considerably less than 2**64.  A similar argument could be made for 'int'.
With some operations, it is difficult to detect overflow/underflow, but for
simple add and subtract, it is relatively easy.

3. Documentation to match behavior.

Jim
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Use of 'any' type

2018-08-16 Thread Jim Mellander
Actually, the 'as' operator is useful, since it appears that 'any' can
currently only be cast into a string otherwise

On Thu, Aug 16, 2018 at 2:58 PM, Jon Siwek  wrote:

> In the master branch, there are also type checking/casting 'is' and
> 'as' operators [1] and type-based switch statement [2] that may be be
> useful.
>
> - Jon
>
> [1] https://www.bro.org/sphinx-git/script-reference/operators.html
> [2] https://www.bro.org/sphinx-git/script-reference/
> statements.html#keyword-switch
>
> On Thu, Aug 16, 2018 at 4:24 PM Jim Mellander  wrote:
> >
> > Thanks, Johanna - I think type_name() may suffice for the purposes I am
> envisioning.
> >
> > On Thu, Aug 16, 2018 at 1:57 PM, Johanna Amann  wrote:
> >>
> >> Hi Jim,
> >>
> >> On 16 Aug 2018, at 13:40, Jim Mellander wrote:
> >>
> >>> It would be most convenient if the 'any' type could defer type checking
> >>> until runtime at the script level.
> >>>
> >>> For instance, if both A & B are defined as type 'any', a compile time
> error
> >>>
> >>> "illegal comparison (A < B)"
> >>>
> >>> occurs upon encountering a bro statement
> >>>
> >>> if (A < B) do_something();
> >>>
> >>> even if the actual values stored in A & B at runtime are integral
> types for
> >>> which comparison makes sense.
> >>
> >>
> >> I think this is a bit hard to do with how things are set up at the
> moment internally - and it also does make type-checking at startup less
> possible-helpful.
> >>
> >> However...
> >>
> >>>
> >>> If the decision could be made at runtime (which could then potentially
> >>> throw an error), a number of useful generic functions could be created
> at
> >>> the script level, rather than creating yet-another-bif.  A useful
> >>> yet-another-bif would be 'typeof' to allow varying code paths based on
> the
> >>> type of value actually stored in 'any'.
> >>
> >>
> >> This already exists and I think you can actually use it to write code
> like that; you just have to cast your any-type to the correct type first.
> The function you want is type_name; it is e.g. used in base/utils/json.bro.
> >>
> >> Johanna
> >
> >
> > ___
> > bro-dev mailing list
> > bro-dev@bro.org
> > http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Use of 'any' type

2018-08-16 Thread Jim Mellander
Thanks, Johanna - I think type_name() may suffice for the purposes I am
envisioning.

On Thu, Aug 16, 2018 at 1:57 PM, Johanna Amann  wrote:

> Hi Jim,
>
> On 16 Aug 2018, at 13:40, Jim Mellander wrote:
>
> It would be most convenient if the 'any' type could defer type checking
>> until runtime at the script level.
>>
>> For instance, if both A & B are defined as type 'any', a compile time
>> error
>>
>> "illegal comparison (A < B)"
>>
>> occurs upon encountering a bro statement
>>
>> if (A < B) do_something();
>>
>> even if the actual values stored in A & B at runtime are integral types
>> for
>> which comparison makes sense.
>>
>
> I think this is a bit hard to do with how things are set up at the moment
> internally - and it also does make type-checking at startup less
> possible-helpful.
>
> However...
>
>
>> If the decision could be made at runtime (which could then potentially
>> throw an error), a number of useful generic functions could be created at
>> the script level, rather than creating yet-another-bif.  A useful
>> yet-another-bif would be 'typeof' to allow varying code paths based on the
>> type of value actually stored in 'any'.
>>
>
> This already exists and I think you can actually use it to write code like
> that; you just have to cast your any-type to the correct type first. The
> function you want is type_name; it is e.g. used in base/utils/json.bro.
>
> Johanna
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] Use of 'any' type

2018-08-16 Thread Jim Mellander
It would be most convenient if the 'any' type could defer type checking
until runtime at the script level.

For instance, if both A & B are defined as type 'any', a compile time error

"illegal comparison (A < B)"

occurs upon encountering a bro statement

if (A < B) do_something();

even if the actual values stored in A & B at runtime are integral types for
which comparison makes sense.

If the decision could be made at runtime (which could then potentially
throw an error), a number of useful generic functions could be created at
the script level, rather than creating yet-another-bif.  A useful
yet-another-bif would be 'typeof' to allow varying code paths based on the
type of value actually stored in 'any'.

Any comments?

Jim
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] Use of 'any' type

2018-08-16 Thread Jim Mellander
It would be most convenient if the 'any' type could defer type checking
until runtime at the script level.

For instance, if both A & B are defined as type 'any', a compile time error

"illegal comparison (A < B)"

occurs upon encountering a bro statement

if (A < B) do_something();

even if the actual values stored in A & B at runtime are integral types for
which comparison makes sense.

If the decision could be made at runtime (which could then potentially
throw an error), a number of useful generic functions could be created at
the script level, rather than creating yet-another-bif.  A useful
yet-another-bif would be 'typeof' to allow varying code paths based on the
type of value actually stored in 'any'.

Any comments?

Jim
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Writing SumStats plugin

2018-08-07 Thread Jim Mellander
It seems that there's some inconsistency in SumStats plugin usage and
implementation.  There appear to be 2 classes of plugins with differing
calling mechanisms and action:

   1. Item to be measured is in the Key, and the measurement is in
   Observation
  1. These include Average, Last X Observations, Max, Min, Sample,
  Standard Deviation, Sum, Unique, Variance
 1. These are exact measurements.
 2. Some of these have dependencies: StdDev depends on Variance,
 which depends on Average
 2. Item to be measured is in Observation, and the measurement is
   implicitly 1, and the Key is generally null
   1. These include HyperLogLog (number of Unique), TopK (top count)
  1. These are probabilistic data structures.

The Key is not passed to the plugin, but is used to allocate a table that
includes, among other things, the processed observations.  Both classes
call the epoch_result function once per key at the end of the epoch.  Since
class 2 plugins often use a null key, there is only one call to
epoch_result, and a special function is used to extract the results from
the probabilistic data structure (
https://www.bro.org/current/exercises/sumstats/sumstats-5.bro).  The
epoch_finished function is called when all keys have been returned to
finish up.  This is unneeded with this sort of class 2 plugin, since all
the work can be done in the sole call to epoch_result.  Multiple keys could
be used with class 2 plugins, which allows for groupings (
https://www.bro.org/current/exercises/sumstats/sumstats-4.bro).

I have a use case where I want to pass both a key and measurement to a
plugin maintaining a probabilistic data store [1].  I don't want to
allocate a table for each key, since many/most will not be reflected in the
final results.  Since the Observation is a record containing both a string
& a number, a hack would be to coerce the key to a string, and pass both in
the Observation to a class 2 plugin, with a null key - which is what I am
doing currently.

It would be nice to have a conversation on how to unify these two classes
of plugins.  A few thoughts on this:

   - Pass Key to the plugins - maybe Key could be added to the Observation
   structure.
   - Provide a mechanism to *not* allocate the table structure with every
   new Key (this and the previous can possibly be done with some hackiness
   with the normalize_key function in the reducer record)
   - Some sort of epoch_result factory function that by default just
   performs the class 1 plugin behavior.  For class 2 plugins, the function
   would feed the results one by one into epoch_result.

Incidentally, I think theres a bug in the observe() function:

These two lines are run in the loop thru the reducers:
   if ( r?$normalize_key )
key = r$normalize_key(copy(key));
which has the effect of modifying the key for subsequent loops, rather than
just for the one reducer it applies to.  The fix is easy and and obvious

Jim


[1] Implementation of algorithms 4&5 (with enhancements) of
https://arxiv.org/pdf/1705.07001.pdf



On Thu, Aug 2, 2018 at 4:44 PM, Jim Mellander  wrote:

> Hi all:
>
> I'm thinking of writing a SumStats plugin, probably with the initial
> implementation in bro scriptland, with a re-implementation as BIFs if
> initial tests successful.
>
> From examining several plugins, it appears that I need to:
>
>- Add NAME of my plugin as an enum to Calculation
>- Add optional tunables to Reducer
>- Add my data structure to ResultVal
>- In register_observe_plugins, register the function to take an
>observation.
>- In init_result_val_hook, add code to initialize data structure.
>- In compose_resultvals_hook, add code to merge multiple data
>structures
>- Create function to extract
>from data structure either at epoch_result, or epoch_finished
>
> Any thing else I should be aware of?
>
> Thanks in advance,
>
> Jim
>
>
>
>
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] Writing SumStats plugin

2018-08-02 Thread Jim Mellander
Hi all:

I'm thinking of writing a SumStats plugin, probably with the initial
implementation in bro scriptland, with a re-implementation as BIFs if
initial tests successful.

>From examining several plugins, it appears that I need to:

   - Add NAME of my plugin as an enum to Calculation
   - Add optional tunables to Reducer
   - Add my data structure to ResultVal
   - In register_observe_plugins, register the function to take an
   observation.
   - In init_result_val_hook, add code to initialize data structure.
   - In compose_resultvals_hook, add code to merge multiple data structures
   - Create function to extract
   from data structure either at epoch_result, or epoch_finished

Any thing else I should be aware of?

Thanks in advance,

Jim
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] Changing epoch timeout in SumStats

2018-06-11 Thread Jim Mellander
Hi all:

I'm working on a statistics gathering script using sumstats.  I would
dearly love to have it synchronized with the log rotation interval.  This
means that the first epoch interval would most likely be short, with future
epochs being at the rotation interval.  Prior Bros (2.5.2) had an
intriguing comment about possible on-demand call for sumstats, which
doesn't seem to be in the latest master.

So, at the moment, what I'm looking for is one of the two options:

1. Allow for a pure 'on-demand' option for sumstats results, allowing the
user script to manage the results, or
2. Allow redefining the epoch timeout on the fly.

Option 3 is hacking on the sumstats code..

Any comments?
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] Misleading error message

2018-01-29 Thread Jim Mellander
Hi all:

I was tinkering with the sumstats code, and inadvertantly deleted the final
"}" closing out the last function.  When running the code, the misleading
error message is received:

error in
/Users/melland/traces/bro/share/bro/base/frameworks/tunnels/./main.bro,
line 8: syntax error, at or near "module"

presumably due to the function still being open when the next policy script
is loaded.  Wouldn't it be more reasonable to check at the end of each
script when loaded that there are no dangling functions, expressions, etc.


Jim Mellander
ESNet
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Buggy bro sort() function

2018-01-24 Thread Jim Mellander
Turns out that a number of other BIFs are not 64-bit safe, rand(), order(),
to_int() are examples.

Filing a bug report.

On Tue, Jan 23, 2018 at 12:08 PM, Jim Mellander <jmellan...@lbl.gov> wrote:

> Hi all:
>
> The attached brogram demonstrates that the bro sort() function does not
> sort correctly under certain circumstances (tested on OS/X & Linux).  The
> behavior also occurs when using the common function idiom of sort(myvec,
> function(a: int, b: int): int { return a-b;});
>
> I haven't examined bro source code, but since some of the test values are
> larger than 32 bits, I surmise that there is a casting from 64 to 32 bits
> that could change the sign of the comparison, thus causing this problem.
>
> Mitigation is to use a function that returns the sign of subtraction
> results, rather than the actual subtraction results, something like
> sort(myvec, function(a: int, b: int): int { return ab ? 1 :
> 0);});
>
> Cheers,
>
> Jim Mellander
> ESNet
>
>
>
>
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] Buggy bro sort() function

2018-01-23 Thread Jim Mellander
Hi all:

The attached brogram demonstrates that the bro sort() function does not
sort correctly under certain circumstances (tested on OS/X & Linux).  The
behavior also occurs when using the common function idiom of sort(myvec,
function(a: int, b: int): int { return a-b;});

I haven't examined bro source code, but since some of the test values are
larger than 32 bits, I surmise that there is a casting from 64 to 32 bits
that could change the sign of the comparison, thus causing this problem.

Mitigation is to use a function that returns the sign of subtraction
results, rather than the actual subtraction results, something like
sort(myvec, function(a: int, b: int): int { return ab ? 1 :
0);});

Cheers,

Jim Mellander
ESNet


badsort.bro
Description: Binary data
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Sumstats bugs & fixes

2018-01-18 Thread Jim Mellander
Seems like either way would work - I've used the idea of keeping a list of
keys to delete after loop exit (not in bro, actually, but in awk, which has
a similar paradigm), but since we're peeling off keys, and discarding them
after usage, avoiding the additional loop seemed more natural/efficient -
although I suppose my way of looping over the keys, just to gather the
first item, and breaking out may not seem too natural or be particularly
efficient.

I'll try your changes to see if they accomplish the same things.  Based on
Justin's post, I'm now more concerned about running my code on a cluster,
but will tackle that next.


On Thu, Jan 18, 2018 at 9:50 AM, Jon Siwek <jsi...@corelight.com> wrote:

> On Wed, Jan 17, 2018 at 6:30 PM, Jim Mellander <jmellan...@lbl.gov> wrote:
>
> > Unfortunately, the reschedule
> > is:
> > schedule 0.01 secs { process_epoch_result(ss, now, data1) };
> > instead of:
> > schedule 0.01 secs { SumStats::process_epoch_result(ss, now, data1) };
> > so it silently fails after the first 50 results.
>
> Thanks, you're right about that.
>
> > Would be nice to have a
> > warning if a script schedules an event that doesn't exist.
>
> Right again, it would be nice since it has resulted in bugs like this,
> though I recall it might not be an easy change to the parser to clear
> up the differences in namespacing rules for event identifiers.
>
> > Attached please find hash_test.bro & (patched) non-cluster.bro
>
> Thanks for those.  I remember you pointing out the potential problem
> in the earlier mail and meant to respond to indicate we should fix it
> and I must have just forgot, so sorry for that.  I had a bit of a
> different idea on how to address the iterator invalidation that might
> be more understandable: keep a separate list of keys to delete later,
> outside the loop.  I have my version of your proposed fixes at [1].
> Can you take a look and let me know if that works for you?
>
> - Jon
>
> [1] https://github.com/bro/bro/commit/3495b2fa9d84e8105a79e24e4e9a2f
> 9181318f1a
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] Sumstats bugs & fixes

2018-01-17 Thread Jim Mellander
In a previous email, I asked the question: Does anyone use sumstats outside
of a cluster context?

In my case, the answer is yes, as I perform development on a laptop, and
run bro standalone to test new bro policies.

I found several different bugs in
share/bro/base/frameworks/sumstats/non-cluster.bro,
specifically SumStats::process_epoch_results

1. The policy returns sumstats results 50 at a time, and then reschedules
itself for the next 50 after 0.01 seconds.. Unfortunately, the reschedule
is:
schedule 0.01 secs { process_epoch_result(ss, now, data1) };
instead of:
schedule 0.01 secs { *SumStats::*process_epoch_result(ss, now, data1) };
so it silently fails after the first 50 results.  Would be nice to have a
warning if a script schedules an event that doesn't exist.

2.The serious issue with the policy, though, is that the 'for' loop over
the result table is the main loop, with up to 50 items processed and
deleted within the loop, the expectation being that the iteration will not
thus be disturbed.

The attached program (hash_test.bro) demonstrates that this is not the case
(should output 1000, 0, but the 2nd value comes back non-zero), in line
with the documented caveat: "Currently, modifying a container’s membership
while iterating over it may result in undefined behavior, so do not add or
remove elements inside the loop."  I didn't examine bro source code to
appreciate the reason, but surmise that table resizing and rehashing would
account for the issue.

The consequences of this issue are that, under certain circumstances:
* Not all data will be returned by SumStats at the epoch
* SumStats::finish_epoch may not be run.

To address the issue can be done via a rearrangement of the code, along the
lines of the following pseudocode (boundary conditions, etc. ignored)

original (modifies table inside 'for' loop):

i=50;
for (foo in bar)
{
process bar[foo];
delete bar[foo];
--i;
if (i == 0) break;
}

to (modifies table outside 'for' loop):

i=50;
while (i >0)
{
for (foo in bar)
{
break;
}
process bar[foo];
delete bar[foo]
--i;
}

...  there are a few other subtleties in the code (keeping a closure on the
result table so that sumstats can clear the original table & proceed to the
next epoch, and not running SumStats::finish_epoch if the result table was
empty to begin with).

A bit of rearrangement fixes the bugs while preserving the original
behavior, with the help of a wrapper event that checks for an empty result
table, and if not makes an explicit copy for further processing by the
actual event doing the work.

An additional 'for' loop around the result table could be used to keep it
all in one event, but looks too much like black magic (and still, albeit
probably in a safe way, depending on undefined behavior) - I prefer clear,
understandable code (ha!), rather than "dark corner" features.  Six months
later, when I look at the code, I won't be able to remember the clever
trick I was using :-)


Attached please find hash_test.bro & (patched) non-cluster.bro

Jim Mellander
ESNet


hash_test.bro
Description: Binary data


non-cluster.bro
Description: Binary data
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] 'for loop' variable modification

2018-01-08 Thread Jim Mellander
I got the following idea while perusing non_cluster.bro
SumStats::process_epoch_result

i=1;
while (i <= 1000 && |bar| > 0)
{
for (foo in bar)
{
break;
}
...
process bar[foo]
...
optional: baz[foo] = bar[foo] #If we need to preserve original data
delete bar[foo];
++i;
}

This will allow iteration thru the table as I originally desired, although
destroying the original table.

SumStats::process_epoch_result deletes the current item inside the for
loop, so is relying on undefined behavior, per the documentation:
"Currently, modifying a container’s membership while iterating over it may
result in undefined behavior, so do not add or remove elements inside the
loop."  The above example avoids that.  Does anyone use sumstats outside of
a cluster context?







On Fri, Jan 5, 2018 at 6:04 PM, Jim Mellander <jmellan...@lbl.gov> wrote:

> Thanks, Jon:
>
> I've decided to split the data (a table of IP addresses with statistics
> captured over a time period) based on a modulo calculation against the IP
> address (the important characteristic being that it can be done on the fly
> without an additional pass thru the table), which with an average
> distribution of traffic gives relatively equal size buckets, each of which
> can be processed during a single event, as I described.
>
> I like the idea of co-routines - it would help to address issues like
> these in a more natural manner.
>
> Jim
>
>
>
>
>
>
>
> On Fri, Jan 5, 2018 at 5:28 PM, Jon Siwek <jsi...@corelight.com> wrote:
>
>> On Fri, Jan 5, 2018 at 2:19 PM, Jim Mellander <jmellan...@lbl.gov> wrote:
>>
>> > I haven't checked whether my desired behavior works, but since its not
>> > documented, I wouldn't want to rely on it in any event.
>>
>> Yeah, I doubt the example you gave currently works -- it would just
>> change the local value in the frame without modifying the internal
>> iterator.
>>
>> > I would be interested in hearing comments or suggestions on this issue.
>>
>> What you want, the ability to split the processing of large data
>> tables/sets over time, makes sense.  I've probably also run into at
>> least a couple cases where I've been concerned about how long it would
>> take to iterate over a set/table and process all keys in one go.  The
>> approach that comes to mind for doing that would be adding coroutines.
>> Robin has some ongoing work with adding better support for async
>> function calls, and I wonder if the way that's done would make it
>> pretty simple to add general coroutine support as well.  E.g. stuff
>> could look like:
>>
>> event process_stuff()
>> {
>> local num_processed = 0;
>>
>> for ( local item in foo )
>> {
>> process_item(item);
>>
>> if ( ++num_processed % 1000 == 0 )
>> yield;  # resume next time events get drained (e.g. next
>> packet)
>> }
>>
>> There could also be other types of yield instructions, like "yield 1
>> second" or "yield wait_for_my_signal()" which would, respectively,
>> resume after arbitrary amount of time or a custom function says it
>> should.
>>
>> - Jon
>>
>
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] 'for loop' variable modification

2018-01-05 Thread Jim Mellander
Thanks, Jon:

I've decided to split the data (a table of IP addresses with statistics
captured over a time period) based on a modulo calculation against the IP
address (the important characteristic being that it can be done on the fly
without an additional pass thru the table), which with an average
distribution of traffic gives relatively equal size buckets, each of which
can be processed during a single event, as I described.

I like the idea of co-routines - it would help to address issues like these
in a more natural manner.

Jim







On Fri, Jan 5, 2018 at 5:28 PM, Jon Siwek <jsi...@corelight.com> wrote:

> On Fri, Jan 5, 2018 at 2:19 PM, Jim Mellander <jmellan...@lbl.gov> wrote:
>
> > I haven't checked whether my desired behavior works, but since its not
> > documented, I wouldn't want to rely on it in any event.
>
> Yeah, I doubt the example you gave currently works -- it would just
> change the local value in the frame without modifying the internal
> iterator.
>
> > I would be interested in hearing comments or suggestions on this issue.
>
> What you want, the ability to split the processing of large data
> tables/sets over time, makes sense.  I've probably also run into at
> least a couple cases where I've been concerned about how long it would
> take to iterate over a set/table and process all keys in one go.  The
> approach that comes to mind for doing that would be adding coroutines.
> Robin has some ongoing work with adding better support for async
> function calls, and I wonder if the way that's done would make it
> pretty simple to add general coroutine support as well.  E.g. stuff
> could look like:
>
> event process_stuff()
> {
> local num_processed = 0;
>
> for ( local item in foo )
> {
> process_item(item);
>
> if ( ++num_processed % 1000 == 0 )
> yield;  # resume next time events get drained (e.g. next
> packet)
> }
>
> There could also be other types of yield instructions, like "yield 1
> second" or "yield wait_for_my_signal()" which would, respectively,
> resume after arbitrary amount of time or a custom function says it
> should.
>
> - Jon
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] 'for loop' variable modification

2018-01-05 Thread Jim Mellander
Hi all:

I'm working on an interesting Bro policy, and I want to be able to begin a
'for loop' at some point where it previously left off.

Pseudo-code:

for (foo in bar)
{
if (foo == "baz")break;
.
process bar[foo]
.
}
.
.
Do other work (not changing bar)
.
.
first_time = T;
for (foo in bar)
{
if (first_time)
{
first_time = F;
foo = "baz";
}
.
process bar[foo]
.
}



If the loop variable can be reassigned in the loop, and the loop continued
from that point, it would facilitate some of the processing I'm doing.  The
above synthetic code could be refactored to avoid the issue, but

My real-world issue is that I have a large table to process, and want to
amortize the processing of it on the time domain:

A. Process first N items in table
B. Schedule processing of next N items via an event
C. When event triggers, pick up where we left off, and process next N
items, etc.

(There are inefficient ways of processing these that solve some, but not
all issues, such as putting the indices in a vector, then going thru that -
wont go into the problems with that right now)


I haven't checked whether my desired behavior works, but since its not
documented, I wouldn't want to rely on it in any event.

I would be interested in hearing comments or suggestions on this issue.

Jim
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Scientific notation?

2017-11-06 Thread Jim Mellander
How about a user redef'able format string for doubles in logs?  Even more
flexible would be to make it a function.  Let the user decide the format
they need, and adapt their scripts accordingly, with the default being the
current behavior.

On Mon, Nov 6, 2017 at 10:10 AM, Daniel Thayer 
wrote:

> On 11/6/17 8:16 AM, Seth Hall wrote:
> > Right now, Bro will print scientific notation in JSON logs but we've
> > always tended to avoid it in the standard Bro log format.  What does
> > everyone think about switching to allow scientific notation in the
> > standard log format?  Daniel recently did some exploration of various
> > versions of awk and they all support scientific notation (I think that
> > was part of my concern a long time ago).
> >
> > Thoughts?
> >
> > .Seth
>
> Actually, right now Bro uses scientific notation in JSON logs only
> for very large values (such as 3.1e+15).  For values very close to
> zero (such as 1.2e-7), Bro will write "0" to a JSON log.
> ___
> bro-dev mailing list
> bro-dev@bro.org
> http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Performance Enhancements

2017-10-14 Thread Jim Mellander
Yeh, the lockless implementation has a bug:

if (size)

s/b

if (size & 1)

I ended up writing an checksum routine that sums 32 bits at a time into a
64 bit register, which avoids the need to check for overflow - it seems to
be faster than the full 64 bit implementation - will test with Bro and
report results.

On Thu, Oct 12, 2017 at 3:08 PM, Azoff, Justin S <jaz...@illinois.edu>
wrote:

>
> > On Oct 6, 2017, at 5:59 PM, Jim Mellander <jmellan...@lbl.gov> wrote:
> >
> > I particularly like the idea of an allocation pool that per-packet
> information can be stored, and reused by the next packet.
> >
> > There also are probably some optimizations of frequent operations now
> that we're in a 64-bit world that could prove useful - the one's complement
> checksum calculation in net_util.cc is one that comes to mind, especially
> since it works effectively a byte at a time (and works with even byte
> counts only).  Seeing as this is done per-packet on all tcp payload,
> optimizing this seems reasonable.  Here's a discussion of do the checksum
> calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/
> -
>
> So I still haven't gotten this to work, but I did some more tests that I
> think show it is worthwhile to look into replacing this function.
>
> I generated a large pcap of a 3 minute iperf run:
>
> $ du -hs iperf.pcap
> 9.6Giperf.pcap
> $ tcpdump  -n -r iperf.pcap |wc -l
> reading from file iperf.pcap, link-type EN10MB (Ethernet)
> 7497698
>
> Then ran either `bro -Cbr` or `bro -br` on it 5 times and track runtime as
> well as cpu instructions reported by `perf`:
>
> $ python2 bench.py 5 bro -Cbr iperf.pcap
> 15.19 49947664388
> 15.66 49947827678
> 15.74 49947853306
> 15.66 49949603644
> 15.42 49951191958
> elapsed
> Min 15.18678689
> Max 15.7425909042
> Avg 15.5343231678
>
> instructions
> Min 49947664388
> Max 49951191958
> Avg 49948828194
>
> $ python2 bench.py 5 bro -br iperf.pcap
> 20.82 95502327077
> 21.31 95489729078
> 20.52 95483242217
> 21.45 95499193001
> 21.32 95498830971
> elapsed
> Min 20.5184400082
> Max 21.4452238083
> Avg 21.083449173
>
> instructions
> Min 95483242217
> Max 95502327077
> Avg 95494664468
>
>
> So this shows that for every ~7,500,000 packets bro processes, almost 5
> seconds is spent computing checksums.
>
> According to https://locklessinc.com/articles/tcp_checksum/, they run
> their benchmark 2^24 times (16,777,216) which is about 2.2 times as many
> packets.
>
> Their runtime starts out at about 11s, which puts it in line with the
> current implementation in bro.  The other implementations they show are
> between 7 and 10x faster depending on packet size.  A 90% drop in time
> spent computing checksums would be a noticeable improvement.
>
>
> Unfortunately I couldn't get their implementation to work inside of bro
> and get the right result, and even if I could, it's not clear what the
> license for the code is.
>
>
>
>
>
> —
> Justin Azoff
>
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Performance Enhancements

2017-10-09 Thread Jim Mellander
gt; than C, and anything garbage collected would need to be *really* carefully
> used, but ... it struck me as an idea that might be worth a look :)
>
>
> And yeah, generally speaking, most of the toolkits I've played with for
> software-based packet processing absolutely do use memory pools for the
> fast path.  They also use burst fetch tricks (to amortize the cost of
> fetching packets over X packets, rather than fetching one packet at a
> time), and I've also seen quite a bit of prefetch / SIMD to try to keep
> things moving quickly once the packets make it to the CPU.
>
>
> Things start to get pretty crazy as packet rates increase, though: once
> you hit about 10 - 15 Gbps, even a *cache miss* on a modern system is
> enough to force a drop ...
>
>
> For what it's worth ...
>
>
> -Gilbert
>
>
> --
> *From:* Azoff, Justin S <jaz...@illinois.edu>
> *Sent:* Monday, October 9, 2017 10:19:26 AM
> *To:* Jim Mellander
> *Cc:* Clark, Gilbert; bro-dev@bro.org
> *Subject:* Re: [Bro-Dev] Performance Enhancements
>
>
> > On Oct 6, 2017, at 5:59 PM, Jim Mellander <jmellan...@lbl.gov> wrote:
> >
> > I particularly like the idea of an allocation pool that per-packet
> information can be stored, and reused by the next packet.
>
> Turns out bro does this most of the time.. unless you use the next_packet
> event.  Normal connections use the sessions cache which holds connection
> objects, but new_packet has its own code path that creates the ip header
> from scratch for each packet.  I tried to pre-allocate PortVal objects, but
> I think I was screwing something up with 'Ref' and bro would just segfault
> on the 2nd connection.
>
>
> > There also are probably some optimizations of frequent operations now
> that we're in a 64-bit world that could prove useful - the one's complement
> checksum calculation in net_util.cc is one that comes to mind, especially
> since it works effectively a byte at a time (and works with even byte
> counts only).  Seeing as this is done per-packet on all tcp payload,
> optimizing this seems reasonable.  Here's a discussion of do the checksum
> calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/
> - this website also has an x64 allocator that is claimed to be faster than
> tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml  (note:
> I haven't tried anything from this source, but find it interesting).
> The TCP/IP Checksum - Lockless Inc
> <https://locklessinc.com/articles/tcp_checksum/>
> locklessinc.com
> The above obvious algorithm handles 16-bits at a time. Usually, if you
> process more data at a time, then performance is better. Since we have a
> 64-bit machine, we ...
>
> Benchmarks of the Lockless Memory Allocator
> <https://locklessinc.com/benchmarks_allocator.shtml>
> locklessinc.com
> The speed of various memory allocators is compared.
>
>
>
> I couldn't get this code to return the right checksums inside bro (some
> casting issue?), but if it is faster it should increase performance by a
> small percentage.  Comparing 'bro -b' runs on a pcap with 'bro -b -C' runs
> (which should show what kind of performance increase we would get if that
> function took 0s to run) shows a decent chunk of time taken computing
> checksums.
>
> > I'm guessing there are a number of such "small" optimizations that could
> provide significant performance gains.
>
> I've been trying to figure out the best way to profile bro.  So far
> attempting to use linux perf, or google perftools hasn't been able to shed
> much light on anything.  I think the approach I was using to benchmark
> certain operations in the bro language is the better approach.
>
> Instead of running bro and trying to profile it to figure out what is
> causing the most load, simply compare the execution of two bro runs with
> slightly different scripts/settings.  I think this will end up being the
> better approach because it answers real questions like "If I load this
> script or change this setting what is the performance impact on the bro
> process".  When I did this last I used this method to compare the
> performance from one bro commit to the next, but I never tried comparing
> bro with one set of scripts loaded to bro with a different set of scripts
> loaded.
>
> For example, the simplest and most dramatic test I came up with so far:
>
> $ time bro -r 2009-M57-day11-18.trace -b
> real0m2.434s
> user0m2.236s
> sys 0m0.200s
>
> $ cat np.bro
> event new_packet(c: connection, p: pkt_hdr)
> {
>
> }
>
> $ time bro -r 2009-M57-day11-18.trace -b np.bro
> real0m10.588s
> user0m10.3

Re: [Bro-Dev] Performance Enhancements

2017-10-06 Thread Jim Mellander
Interesting info.  The > order of magnitude difference in time between
BaseList::remove & BaseList::removenth suggests the possibility that the
for loop in BaseList::remove is falling off the end in many cases (i.e.
attempting to remove an item that doesn't exist).  Maybe thats whats broken.



On Fri, Oct 6, 2017 at 3:49 PM, Azoff, Justin S <jaz...@illinois.edu> wrote:

>
> > On Oct 6, 2017, at 5:59 PM, Jim Mellander <jmellan...@lbl.gov> wrote:
> >
> > I particularly like the idea of an allocation pool that per-packet
> information can be stored, and reused by the next packet.
> >
> > There also are probably some optimizations of frequent operations now
> that we're in a 64-bit world that could prove useful - the one's complement
> checksum calculation in net_util.cc is one that comes to mind, especially
> since it works effectively a byte at a time (and works with even byte
> counts only).  Seeing as this is done per-packet on all tcp payload,
> optimizing this seems reasonable.  Here's a discussion of do the checksum
> calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/
> - this website also has an x64 allocator that is claimed to be faster than
> tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml  (note:
> I haven't tried anything from this source, but find it interesting).
> >
> > I'm guessing there are a number of such "small" optimizations that could
> provide significant performance gains.
> >
> > Take care,
> >
> > Jim
>
> I've been messing around with 'perf top', the one's complement function
> often shows up fairly high up.. that, PriorityQueue::BubbleDown, and
> BaseList::remove
>
> Something (on our configuration?) is doing a lot of
> PQ_TimerMgr::~PQ_TimerMgr... I don't think I've come across that class
> before in bro.. I think a script may be triggering something that is
> hurting performance.  I can't think of what it would be though.
>
> Running perf top on a random worker right now with -F 1 shows:
>
> Samples: 485K of event 'cycles', Event count (approx.): 26046568975
> Overhead  Shared Object Symbol
>   34.64%  bro   [.] BaseList::remove
>3.32%  libtcmalloc.so.4.2.6  [.] operator delete
>3.25%  bro   [.] PriorityQueue::BubbleDown
>2.31%  bro   [.] BaseList::remove_nth
>2.05%  libtcmalloc.so.4.2.6  [.] operator new
>1.90%  bro   [.] Attributes::FindAttr
>1.41%  bro   [.] Dictionary::NextEntry
>1.27%  libc-2.17.so  [.] __memcpy_ssse3_back
>0.97%  bro   [.] StmtList::Exec
>0.87%  bro   [.] Dictionary::Lookup
>0.85%  bro   [.] NameExpr::Eval
>0.84%  bro   [.] BroFunc::Call
>0.80%  libtcmalloc.so.4.2.6  [.] tc_free
>0.77%  libtcmalloc.so.4.2.6  [.] operator delete[]
>0.70%  bro   [.] ones_complement_checksum
>0.60%  libtcmalloc.so.4.2.6  [.] tcmalloc::ThreadCache::
> ReleaseToCentralCache
>0.60%  bro   [.] RecordVal::RecordVal
>0.53%  bro   [.] UnaryExpr::Eval
>0.51%  bro   [.] ExprStmt::Exec
>0.51%  bro   [.] iosource::Manager::FindSoonest
>0.50%  libtcmalloc.so.4.2.6  [.] operator new[]
>
>
> Which sums up to 59.2%
>
> BaseList::remove/BaseList::remove_nth seems particularly easy to
> optimize. Can't that loop be replaced by a memmove?
> I think something may be broken if it's being called that much though.
>
>
>
> —
> Justin Azoff
>
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Performance Enhancements

2017-10-06 Thread Jim Mellander
I particularly like the idea of an allocation pool that per-packet
information can be stored, and reused by the next packet.

There also are probably some optimizations of frequent operations now that
we're in a 64-bit world that could prove useful - the one's complement
checksum calculation in net_util.cc is one that comes to mind, especially
since it works effectively a byte at a time (and works with even byte
counts only).  Seeing as this is done per-packet on all tcp payload,
optimizing this seems reasonable.  Here's a discussion of do the checksum
calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/ -
this website also has an x64 allocator that is claimed to be faster than
tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml  (note: I
haven't tried anything from this source, but find it interesting).

I'm guessing there are a number of such "small" optimizations that could
provide significant performance gains.

Take care,

Jim




On Fri, Oct 6, 2017 at 7:26 AM, Azoff, Justin S  wrote:

>
> > On Oct 6, 2017, at 12:10 AM, Clark, Gilbert  wrote:
> >
> > I'll note that one of the challenges with profiling is that there are
> the bro scripts, and then there is the bro engine.  The scripting layer has
> a completely different set of optimizations that might make sense than the
> engine does: turning off / turning on / tweaking different scripts can have
> a huge impact on Bro's relative performance depending on the frequency with
> which those script fragments are executed.  Thus, one way to look at
> speeding things up might be to take a look at the scripts that are run most
> often and seeing about ways to accelerate core pieces of them ... possibly
> by moving pieces of those scripts to builtins (as C methods).
> >
>
> Re: scripts, I have some code I put together to do arbitrary benchmarks of
> templated bro scripts.  I need to clean it up and publish it, but I found
> some interesting things.  Function calls are relatively slow.. so things
> like
>
> ip in Site::local_nets
>
> Is faster than calling
>
> Site::is_local_addr(ip);
>
> inlining short functions could speed things up a bit.
>
> I also found that things like
>
> port == 22/tcp || port == 3389/tcp
>
> Is faster than checking if port in {22/tcp,3389/tcp}.. up to about 10
> ports.. Having the hash class fallback to a linear search when the hash
> only contains few items could speed things up there.  Things like
> 'likely_server_ports' have 1 or 2 ports in most cases.
>
>
> > If I had to guess at one engine-related thing that would've sped things
> up when I was profiling this stuff back in the day, it'd probably be
> rebuilding the memory allocation strategy / management.  From what I
> remember, Bro does do some malloc / free in the data path, which hurts
> quite a bit when one is trying to make things go fast.  It also means that
> the selection of a memory allocator and NUMA / per-node memory management
> is going to be important.  That's probably not going to qualify as
> something *small*, though ...
>
> Ah!  This reminds me of something I was thinking about a few weeks ago.
> I'm not sure to what extent bro uses memory allocation pools/interning for
> common immutable data structures.  Like for port objects or small strings.
> There's no reason bro should be mallocing/freeing memory to create port
> objects when they are only 65536 times 2 (or 3?) port objects... but bro
> does things like
>
> tcp_hdr->Assign(0, new PortVal(ntohs(tp->th_sport),
> TRANSPORT_TCP));
> tcp_hdr->Assign(1, new PortVal(ntohs(tp->th_dport),
> TRANSPORT_TCP));
>
> For every packet.  As well as allocating a ton of TYPE_COUNT vals for
> things like packet sizes and header lengths.. which will almost always be
> between 0 and 64k.
>
> For things that can't be interned, like ipv6 address, having an allocation
> pool could speed things up... Instead of freeing things like IPAddr objects
> they could just be returned to a pool, and then when a new IPAddr object is
> needed, an already initialized object could be grabbed from the pool and
> 'refreshed' with the new value.
>
> https://golang.org/pkg/sync/#Pool
>
> Talks about that sort of thing.
>
> > On a related note, a fun experiment is always to try running bro with a
> different allocator and seeing what happens ...
>
> I recently noticed our boxes were using jemalloc instead of tcmalloc..
> Switching that caused malloc to drop a few places down in 'perf top' output.
>
>
> —
> Justin Azoff
>
>
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] Performance Enhancements

2017-10-05 Thread Jim Mellander
Hi all:

One item of particular interest to me from Brocon was this tidbit from
Packetsled's lightning talk:

"Optimizing core loops (like net_run() ) with preprocessor branch
prediction macros likely() and unlikely() for ~3% speedup. We optimize for
maximum load."


After conversing with Leo Linsky of Packetsled, I wanted to initiate a
conversation about easy performance improvements that may be within fairly
easy reach:


1. Obviously, branch prediction, as mentioned above.  3% speedup for
(almost) free is nothing to sneeze at.

2. Profiling bro to identify other hot spots that could benefit from
optimization.

3. Best practices for compiling Bro (compiler options, etc.)

4. Data structure revisit (hash functions, perhaps?)

etc.


Perhaps the Bro core team is working on some, all, or a lot more in this
area.  It might be nice to get the Bro community involved too.  Is anyone
else interested?


Jim Mellander

ESNet
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] HTTP Sensitive POST bro policy

2014-04-30 Thread Jim Mellander
Thanks for the input.

1. At some point in the past, ISTR bro did have a connection_end event - at
least thats what my failing memory tells me.  In any event, it seems like a
bug to not even give a warning if you have an event handler for a
non-existent event - a typo could cause difficult to detect errors.

2. Good catch, probably should have an expiration.

The minor inconsistency that I had in mind is the fact that the length of
the entity is checked whilst assembling the POST response without
unescaping, and checked unescaped in the http_end_entity..




On Wed, Apr 30, 2014 at 11:15 AM, Siwek, Jonathan Luke
jsi...@illinois.eduwrote:


 On Apr 30, 2014, at 12:18 PM, Jim Mellander jmellan...@lbl.gov wrote:

  For a number of reasons, I elected to write the attached bro policy,
 which looks at http POSTs and performs regular expression matching on the
 posted data.

 Thanks for sharing.

  Kudos to the first person who finds the minor inconsistency that I
 elected not to address.

 Maybe not what you were referring to, but I had two concerns:

 (1) “connection_end” doesn’t seem to be a defined event, maybe it's meant
 to be “connection_state_remove”.

 (2) Having the global “POST_entities” and “POST_requests” tables without
 read_expire (or another expiry attribute) makes me nervous.  Though I
 think the clean up in “http_end_entity” should catch everything, if it
 doesn’t, that will lead to memory usage issues over time (especially since
 “connection_end” won’t be a cleanup safety net as intended).

 - Jon
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev