Darn it Nick, you beat me to the punch.  =)  YES, please.  I think I
discussed this a while back in my ES tuning conversations, but that's
_super_ important.  I have this documented here
<https://cwiki.apache.org/confluence/display/METRON/Tuning+the+Search+tier>
under Elasticsearch > On Installation > 4.

Also, I'm a huge fan of option one.  Here's how that would pan out in my
environment almost immediately:

I typically don't want to store POST data, so as it currently sits that
means I don't write the details of POSTs at all.  However it may make sense
for me to collect the POST data off the wire and pass it through Metron,
and once it gets through enrichment/threat intel and if one of the IPs are
noteworthy, only then I store it, otherwise it gets tossed.  That gives me
a nice mix of privacy/security for my user population but also the
information I need to respond to possible incidents.  This could look like
holding onto POSTs being used to manipulate web shells (known bad sources,
known compromised hosts (as a very short term IR information gathering
procedure)), or users POSTing their creds to a plaintext phishing site.
Happy to port this discussion to a separate thread.

Regarding fields - I'm for slightly more standardization, without going
overboard.  I don't think we should go overboard, and this could be a long
discussion, but in summary my opinion is that user is a very sane field to
add, and URL is slightly less so.  Again, I'm thinking about this in the
context of what data I have going into my cluster (and that I know others
are sending to theirs), which may contrast with a more general infosec
population.

Jon

On Fri, Jan 13, 2017 at 9:40 AM Casey Stella <[email protected]> wrote:

> OH that's a good idea!
>
> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <[email protected]> wrote:
>
> > I like the "Index Filtering" option based on the flexibility that it
> > provides.  Should each output (HDFS, ES, etc) have its own configuration
> > settings?  For example, aren't things like batching handled separately
> for
> > HDFS versus Elasticsearch?
> >
> > Something along the lines of...
> >
> > {
> >   "hdfs" : {
> >     "when": "exists(field1)",
> >     "batchSize": 100
> >   },
> >
> >   "elasticsearch" : {
> >     "when": "true",
> >     "batchSize": 1000,
> >     "index": "squid"
> >   }
> > }
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <[email protected]>
> wrote:
> >
> > > Yeah, I tend to like the first option too.  Any opposition to that from
> > > anyone?
> > >
> > > The points brought up are good ones and I think that it may be worth a
> > > broader discussion of the requirements of indexing in a separate dev
> list
> > > thread.  Maybe a list of desires with coherent use-cases justifying
> them
> > so
> > > we can think about how this stuff should work and where the natural
> > > extension points should be.  Afterall, we need to toe the line between
> > > engineering and overengineering for features nobody will want.
> > >
> > > I'm not sure about the extensions to the standard fields.  I'm torn
> > between
> > > the notions that we should have no standard fields vs we should have a
> > > boatload of standard fields (with most of them empty).  I exchange
> > > positions fairly regularly on that question. ;)  It may be worth a dev
> > list
> > > discussion to lay out how you imagine an extension of standard fields
> and
> > > how it might look as implemented in Metron.
> > >
> > > Casey
> > >
> > > Casey
> > >
> > > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > > [email protected]>
> > > wrote:
> > >
> > > > I'll second my preference for the first option. I think the ability
> to
> > > use
> > > > Stellar filters to customize indexing would be a big win.
> > > >
> > > > I'm glad Matt brought up the point about data lake and CEP. I think
> > this
> > > is
> > > > a really important use case that we need to consider. Take a simple
> > > > example... If I have data coming in from 3 different firewall vendors
> > > and 2
> > > > different web proxy/url filtering vendors and I want to be able to
> > > analyze
> > > > that data set, I need the data to be indexed all together (likely in
> > > HDFS)
> > > > and to have a normalized schema such that IP address, URL, and user
> > name
> > > > (to take a few) can be easily queried and aggregated. I can also
> > envision
> > > > scenarios where I would want to index data based on attributes other
> > than
> > > > sensor, business unit or subsidiary for example.
> > > >
> > > > I've been wanted to propose extending our 7 standard fields to
> include
> > > > things like URL and user. Is there community interest/support for
> > moving
> > > in
> > > > that direction? If so, I'll start a new thread.
> > > >
> > > > Thanks!
> > > >
> > > > -Kyle
> > > >
> > > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <[email protected]>
> wrote:
> > > >
> > > > > Ah, I see.  If overriding the default index name allows using the
> > same
> > > > > name for multiple sensors, then the goal can be achieved.
> > > > > Thanks,
> > > > > --Matt
> > > > >
> > > > >
> > > > > On 1/12/17, 3:30 PM, "Casey Stella" <[email protected]> wrote:
> > > > >
> > > > >     Oh, you could!  Let's say you have a syslog parser with data
> from
> > > > > sources 1
> > > > >     2 and 3.  You'd end up with one kafka queue with 3 parsers
> > attached
> > > > to
> > > > > that
> > > > >     queue, each picking part the messages from source 1, 2 and 3.
> > > They'd
> > > > > go
> > > > >     through separate enrichment and into the indexing topology.  In
> > the
> > > > >     indexing topology, you could specify the same index name
> "syslog"
> > > and
> > > > > all
> > > > >     of the messages go into the same index for CEP querying if so
> > > > desired.
> > > > >
> > > > >     On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <[email protected]>
> > > > wrote:
> > > > >
> > > > >     > Syslog is hell on parsers – I know, I worked at LogLogic in a
> > > > > previous
> > > > >     > life.  It makes perfect sense to route different lines from
> > > syslog
> > > > > through
> > > > >     > different appropriate parsers.  But a lot of what the parsers
> > do
> > > is
> > > > >     > identify consistent subsets of metadata and annotate it – eg,
> > > > > src_ip_addr,
> > > > >     > event timestamps, etc.  Once those metadata are annotated and
> > > > > available
> > > > >     > with common field names, why doesn’t it make sense to index
> the
> > > > > messages
> > > > >     > together, for CEP querying?  I think Splunk has illustrated
> > this
> > > > > model.
> > > > >     >
> > > > >     > On 1/12/17, 3:00 PM, "Casey Stella" <[email protected]>
> > wrote:
> > > > >     >
> > > > >     >     yeah, I mean, honestly, I think the approach that we've
> > taken
> > > > for
> > > > >     > sources
> > > > >     >     which aggregate different types of data is to provide
> > filters
> > > > at
> > > > > the
> > > > >     > parser
> > > > >     >     level and have multiple parser topologies (with
> different,
> > > > > possibly
> > > > >     >     mutually exclusive filters) running.  This would be a
> > > > completely
> > > > >     > separate
> > > > >     >     sensor.  Imagine a syslog data source that aggregates and
> > you
> > > > > want to
> > > > >     > pick
> > > > >     >     apart certain pieces of messages.  This is why the
> initial
> > > > > thought and
> > > > >     >     architecture was one index per sensor.
> > > > >     >
> > > > >     >     On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley <
> > > [email protected]>
> > > > > wrote:
> > > > >     >
> > > > >     >     > I’m thinking that CEP (Complex Event Processing) is
> > > contrary
> > > > > to the
> > > > >     > idea
> > > > >     >     > of silo-ing data per sensor.
> > > > >     >     > Now it’s true that some of those sensors are already
> > > > > aggregating
> > > > >     > data from
> > > > >     >     > multiple sources, so maybe I’m wrong here.
> > > > >     >     > But it just seems to me that the “data lake” insights
> > come
> > > > from
> > > > >     > being able
> > > > >     >     > to make decisions over the whole mass of data rather
> than
> > > > just
> > > > >     > vertical
> > > > >     >     > slices of it.
> > > > >     >     >
> > > > >     >     > On 1/12/17, 2:15 PM, "Casey Stella" <
> [email protected]>
> > > > > wrote:
> > > > >     >     >
> > > > >     >     >     Hey Matt,
> > > > >     >     >
> > > > >     >     >     Thanks for the comment!
> > > > >     >     >     1. At the moment, we only have one index name, the
> > > > default
> > > > > of
> > > > >     > which is
> > > > >     >     > the
> > > > >     >     >     sensor name but that's entirely up to the user.
> This
> > > is
> > > > > sensor
> > > > >     >     > specific,
> > > > >     >     >     so it'd be a separate config for each sensor.  If
> we
> > > want
> > > > > to
> > > > >     > build
> > > > >     >     > multiple
> > > > >     >     >     indices per sensor, we'd have to think carefully
> > about
> > > > how
> > > > > to do
> > > > >     > that
> > > > >     >     > and
> > > > >     >     >     would be a bigger undertaking.  I guess I can see
> the
> > > > use,
> > > > > though
> > > > >     >     > (redirect
> > > > >     >     >     messages to one index vs another based on a
> predicate
> > > for
> > > > > a given
> > > > >     >     > sensor).
> > > > >     >     >     Anyway, not where I was originally thinking that
> this
> > > > > discussion
> > > > >     > would
> > > > >     >     > go,
> > > > >     >     >     but it's an interesting point.
> > > > >     >     >
> > > > >     >     >     2. I hadn't thought through the implementation
> quite
> > > yet,
> > > > > but we
> > > > >     > don't
> > > > >     >     >     actually have a splitter bolt in that topology,
> just
> > a
> > > > > spout
> > > > >     > that goes
> > > > >     >     > to
> > > > >     >     >     the elasticsearch writer and also to the hdfs
> writer.
> > > > >     >     >
> > > > >     >     >     On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley <
> > > > > [email protected]>
> > > > >     > wrote:
> > > > >     >     >
> > > > >     >     >     > Casey, good to have controls like this.  Couple
> > > > > questions:
> > > > >     >     >     >
> > > > >     >     >     > 1. Regarding the “index” : “squid” name/value
> pair,
> > > is
> > > > > the
> > > > >     > index name
> > > > >     >     >     > expected to always be a sensor name?  Or is the
> > given
> > > > > json
> > > > >     > structure
> > > > >     >     >     > subordinate to a sensor name in zookeeper?  Or
> can
> > we
> > > > > build
> > > > >     > arbitrary
> > > > >     >     >     > indexes with this new specification, independent
> of
> > > > > sensor?
> > > > >     > Should
> > > > >     >     > there
> > > > >     >     >     > actually be a list of “indexes”, ie
> > > > >     >     >     > { “indexes” : [
> > > > >     >     >     >         {“index” : “name1”,
> > > > >     >     >     >                 …
> > > > >     >     >     >         },
> > > > >     >     >     >         {“index” : “name2”,
> > > > >     >     >     >                 …
> > > > >     >     >     >         } ]
> > > > >     >     >     > }
> > > > >     >     >     >
> > > > >     >     >     > 2. Would the filtering / writer selection logic
> > take
> > > > > place in
> > > > >     > the
> > > > >     >     > indexing
> > > > >     >     >     > topology splitter bolt?  Seems like that would
> have
> > > the
> > > > >     > smallest
> > > > >     >     > impact on
> > > > >     >     >     > current implementation, no?
> > > > >     >     >     >
> > > > >     >     >     > Sorry if these are already answered in PR-415, I
> > > > haven’t
> > > > > had
> > > > >     > time to
> > > > >     >     >     > review that one yet.
> > > > >     >     >     > Thanks,
> > > > >     >     >     > --Matt
> > > > >     >     >     >
> > > > >     >     >     >
> > > > >     >     >     > On 1/12/17, 12:55 PM, "Michael Miklavcic" <
> > > > >     >     > [email protected]>
> > > > >     >     >     > wrote:
> > > > >     >     >     >
> > > > >     >     >     >     I like the flexibility and expressibility of
> > the
> > > > > first
> > > > >     > option
> > > > >     >     > with
> > > > >     >     >     > Stellar
> > > > >     >     >     >     filters.
> > > > >     >     >     >
> > > > >     >     >     >     M
> > > > >     >     >     >
> > > > >     >     >     >     On Thu, Jan 12, 2017 at 1:51 PM, Casey
> Stella <
> > > > >     >     > [email protected]>
> > > > >     >     >     > wrote:
> > > > >     >     >     >
> > > > >     >     >     >     > As of METRON-652 <
> https://github.com/apache/
> > > > >     >     >     > incubator-metron/pull/415>, we
> > > > >     >     >     >     > will have decoupled the indexing
> > configuration
> > > > > from the
> > > > >     >     > enrichment
> > > > >     >     >     >     > configuration.  As an immediate follow-up
> to
> > > > that,
> > > > > I'd
> > > > >     > like to
> > > > >     >     >     > provide the
> > > > >     >     >     >     > ability to turn off and on writers via the
> > > > > configs.  I'd
> > > > >     > like
> > > > >     >     > to get
> > > > >     >     >     > some
> > > > >     >     >     >     > community feedback on how the functionality
> > > > should
> > > > > work,
> > > > >     > if
> > > > >     >     > y'all are
> > > > >     >     >     >     > amenable. :)
> > > > >     >     >     >     >
> > > > >     >     >     >     >
> > > > >     >     >     >     > As of now, we have 3 possible writers which
> > can
> > > > be
> > > > > used
> > > > >     > in the
> > > > >     >     >     > indexing
> > > > >     >     >     >     > topology:
> > > > >     >     >     >     >
> > > > >     >     >     >     >    - Solr
> > > > >     >     >     >     >    - Elasticsearch
> > > > >     >     >     >     >    - HDFS
> > > > >     >     >     >     >
> > > > >     >     >     >     > HDFS is always used, elasticsearch or solr
> is
> > > > used
> > > > >     > depending
> > > > >     >     > on how
> > > > >     >     >     > you
> > > > >     >     >     >     > start the indexing topology.
> > > > >     >     >     >     >
> > > > >     >     >     >     > A couple of proposals come to mind
> > immediately:
> > > > >     >     >     >     >
> > > > >     >     >     >     > *Index Filtering*
> > > > >     >     >     >     >
> > > > >     >     >     >     > You would be able to specify a filter as
> > > defined
> > > > > by a
> > > > >     > stellar
> > > > >     >     >     > statement
> > > > >     >     >     >     > (likely a reuse of the StellarFilter that
> > > exists
> > > > > in the
> > > > >     >     > Parsers)
> > > > >     >     >     > which
> > > > >     >     >     >     > would allow you to indicate on a
> > > > > message-by-message basis
> > > > >     >     > whether or
> > > > >     >     >     > not to
> > > > >     >     >     >     > write the message.
> > > > >     >     >     >     >
> > > > >     >     >     >     > The semantics of this would be as follows:
> > > > >     >     >     >     >
> > > > >     >     >     >     >    - Default (i.e. unspecified) is to pass
> > > > > everything
> > > > >     > through
> > > > >     >     > (hence
> > > > >     >     >     >     >    backwards compatible with the current
> > > default
> > > > > config).
> > > > >     >     >     >     >    - Messages which have the associated
> > stellar
> > > > > statement
> > > > >     >     > evaluate
> > > > >     >     >     > to true
> > > > >     >     >     >     >    for the writer type will be written,
> > > otherwise
> > > > > not.
> > > > >     >     >     >     >
> > > > >     >     >     >     >
> > > > >     >     >     >     > Sample indexing config which would write
> out
> > no
> > > > > messages
> > > > >     > to
> > > > >     >     > HDFS and
> > > > >     >     >     > write
> > > > >     >     >     >     > out only messages containing a field called
> > > > > "field1":
> > > > >     >     >     >     > {
> > > > >     >     >     >     >    "index" : "squid"
> > > > >     >     >     >     >   ,"batchSize" : 100
> > > > >     >     >     >     >   ,"filters" : {
> > > > >     >     >     >     >       "HDFS" : "false"
> > > > >     >     >     >     >      ,"ES" : "exists(field1)"
> > > > >     >     >     >     >                  }
> > > > >     >     >     >     > }
> > > > >     >     >     >     >
> > > > >     >     >     >     > *Index On/Off Switch*
> > > > >     >     >     >     >
> > > > >     >     >     >     > A simpler solution would be to just
> provide a
> > > > list
> > > > > of
> > > > >     > writers
> > > > >     >     > to
> > > > >     >     >     > write
> > > > >     >     >     >     > messages.  The semantics would be as
> follows:
> > > > >     >     >     >     >
> > > > >     >     >     >     >    - If the list is unspecified, then the
> > > default
> > > > > is to
> > > > >     > write
> > > > >     >     > all
> > > > >     >     >     > messages
> > > > >     >     >     >     >    for every writer in the indexing
> topology
> > > > >     >     >     >     >    - If the list is specified, then a
> writer
> > > will
> > > > > write
> > > > >     > all
> > > > >     >     > messages
> > > > >     >     >     > if and
> > > > >     >     >     >     >    only if it is named in the list.
> > > > >     >     >     >     >
> > > > >     >     >     >     > Sample indexing config which turns off HDFS
> > and
> > > > > keeps on
> > > > >     >     >     > Elasticsearch:
> > > > >     >     >     >     > {
> > > > >     >     >     >     >    "index" : "squid"
> > > > >     >     >     >     >   ,"batchSize" : 100
> > > > >     >     >     >     >   ,"writers" : [ "ES" ]
> > > > >     >     >     >     > }
> > > > >     >     >     >     >
> > > > >     >     >     >     > Thanks in advance for the feedback!  Also,
> if
> > > you
> > > > > have
> > > > >     > any
> > > > >     >     > other,
> > > > >     >     >     > better
> > > > >     >     >     >     > ideas than the ones presented here, let me
> > know
> > > > > too.
> > > > >     >     >     >     >
> > > > >     >     >     >     > Best,
> > > > >     >     >     >     >
> > > > >     >     >     >     > Casey
> > > > >     >     >     >     >
> > > > >     >     >     >
> > > > >     >     >     >
> > > > >     >     >     >
> > > > >     >     >     >
> > > > >     >     >     >
> > > > >     >     >
> > > > >     >     >
> > > > >     >     >
> > > > >     >     >
> > > > >     >     >
> > > > >     >
> > > > >     >
> > > > >     >
> > > > >     >
> > > > >     >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Nick Allen <[email protected]>
> >
>
-- 

Jon

Sent from my mobile device

Reply via email to