Re: [DISCUSS] Turning off indexing writers feature discussion

Kyle Richardson Thu, 12 Jan 2017 18:59:08 -0800

I'll second my preference for the first option. I think the ability to use
Stellar filters to customize indexing would be a big win.


I'm glad Matt brought up the point about data lake and CEP. I think this is
a really important use case that we need to consider. Take a simple
example... If I have data coming in from 3 different firewall vendors and 2
different web proxy/url filtering vendors and I want to be able to analyze
that data set, I need the data to be indexed all together (likely in HDFS)
and to have a normalized schema such that IP address, URL, and user name
(to take a few) can be easily queried and aggregated. I can also envision
scenarios where I would want to index data based on attributes other than
sensor, business unit or subsidiary for example.

I've been wanted to propose extending our 7 standard fields to include
things like URL and user. Is there community interest/support for moving in
that direction? If so, I'll start a new thread.

Thanks!

-Kyle

On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <[email protected]> wrote:

> Ah, I see.  If overriding the default index name allows using the same
> name for multiple sensors, then the goal can be achieved.
> Thanks,
> --Matt
>
>
> On 1/12/17, 3:30 PM, "Casey Stella" <[email protected]> wrote:
>
>     Oh, you could!  Let's say you have a syslog parser with data from
> sources 1
>     2 and 3.  You'd end up with one kafka queue with 3 parsers attached to
> that
>     queue, each picking part the messages from source 1, 2 and 3.  They'd
> go
>     through separate enrichment and into the indexing topology.  In the
>     indexing topology, you could specify the same index name "syslog" and
> all
>     of the messages go into the same index for CEP querying if so desired.
>
>     On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <[email protected]> wrote:
>
>     > Syslog is hell on parsers – I know, I worked at LogLogic in a
> previous
>     > life.  It makes perfect sense to route different lines from syslog
> through
>     > different appropriate parsers.  But a lot of what the parsers do is
>     > identify consistent subsets of metadata and annotate it – eg,
> src_ip_addr,
>     > event timestamps, etc.  Once those metadata are annotated and
> available
>     > with common field names, why doesn’t it make sense to index the
> messages
>     > together, for CEP querying?  I think Splunk has illustrated this
> model.
>     >
>     > On 1/12/17, 3:00 PM, "Casey Stella" <[email protected]> wrote:
>     >
>     >     yeah, I mean, honestly, I think the approach that we've taken for
>     > sources
>     >     which aggregate different types of data is to provide filters at
> the
>     > parser
>     >     level and have multiple parser topologies (with different,
> possibly
>     >     mutually exclusive filters) running.  This would be a completely
>     > separate
>     >     sensor.  Imagine a syslog data source that aggregates and you
> want to
>     > pick
>     >     apart certain pieces of messages.  This is why the initial
> thought and
>     >     architecture was one index per sensor.
>     >
>     >     On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley <[email protected]>
> wrote:
>     >
>     >     > I’m thinking that CEP (Complex Event Processing) is contrary
> to the
>     > idea
>     >     > of silo-ing data per sensor.
>     >     > Now it’s true that some of those sensors are already
> aggregating
>     > data from
>     >     > multiple sources, so maybe I’m wrong here.
>     >     > But it just seems to me that the “data lake” insights come from
>     > being able
>     >     > to make decisions over the whole mass of data rather than just
>     > vertical
>     >     > slices of it.
>     >     >
>     >     > On 1/12/17, 2:15 PM, "Casey Stella" <[email protected]>
> wrote:
>     >     >
>     >     >     Hey Matt,
>     >     >
>     >     >     Thanks for the comment!
>     >     >     1. At the moment, we only have one index name, the default
> of
>     > which is
>     >     > the
>     >     >     sensor name but that's entirely up to the user.  This is
> sensor
>     >     > specific,
>     >     >     so it'd be a separate config for each sensor.  If we want
> to
>     > build
>     >     > multiple
>     >     >     indices per sensor, we'd have to think carefully about how
> to do
>     > that
>     >     > and
>     >     >     would be a bigger undertaking.  I guess I can see the use,
> though
>     >     > (redirect
>     >     >     messages to one index vs another based on a predicate for
> a given
>     >     > sensor).
>     >     >     Anyway, not where I was originally thinking that this
> discussion
>     > would
>     >     > go,
>     >     >     but it's an interesting point.
>     >     >
>     >     >     2. I hadn't thought through the implementation quite yet,
> but we
>     > don't
>     >     >     actually have a splitter bolt in that topology, just a
> spout
>     > that goes
>     >     > to
>     >     >     the elasticsearch writer and also to the hdfs writer.
>     >     >
>     >     >     On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley <
> [email protected]>
>     > wrote:
>     >     >
>     >     >     > Casey, good to have controls like this.  Couple
> questions:
>     >     >     >
>     >     >     > 1. Regarding the “index” : “squid” name/value pair, is
> the
>     > index name
>     >     >     > expected to always be a sensor name?  Or is the given
> json
>     > structure
>     >     >     > subordinate to a sensor name in zookeeper?  Or can we
> build
>     > arbitrary
>     >     >     > indexes with this new specification, independent of
> sensor?
>     > Should
>     >     > there
>     >     >     > actually be a list of “indexes”, ie
>     >     >     > { “indexes” : [
>     >     >     >         {“index” : “name1”,
>     >     >     >                 …
>     >     >     >         },
>     >     >     >         {“index” : “name2”,
>     >     >     >                 …
>     >     >     >         } ]
>     >     >     > }
>     >     >     >
>     >     >     > 2. Would the filtering / writer selection logic take
> place in
>     > the
>     >     > indexing
>     >     >     > topology splitter bolt?  Seems like that would have the
>     > smallest
>     >     > impact on
>     >     >     > current implementation, no?
>     >     >     >
>     >     >     > Sorry if these are already answered in PR-415, I haven’t
> had
>     > time to
>     >     >     > review that one yet.
>     >     >     > Thanks,
>     >     >     > --Matt
>     >     >     >
>     >     >     >
>     >     >     > On 1/12/17, 12:55 PM, "Michael Miklavcic" <
>     >     > [email protected]>
>     >     >     > wrote:
>     >     >     >
>     >     >     >     I like the flexibility and expressibility of the
> first
>     > option
>     >     > with
>     >     >     > Stellar
>     >     >     >     filters.
>     >     >     >
>     >     >     >     M
>     >     >     >
>     >     >     >     On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella <
>     >     > [email protected]>
>     >     >     > wrote:
>     >     >     >
>     >     >     >     > As of METRON-652 <https://github.com/apache/
>     >     >     > incubator-metron/pull/415>, we
>     >     >     >     > will have decoupled the indexing configuration
> from the
>     >     > enrichment
>     >     >     >     > configuration.  As an immediate follow-up to that,
> I'd
>     > like to
>     >     >     > provide the
>     >     >     >     > ability to turn off and on writers via the
> configs.  I'd
>     > like
>     >     > to get
>     >     >     > some
>     >     >     >     > community feedback on how the functionality should
> work,
>     > if
>     >     > y'all are
>     >     >     >     > amenable. :)
>     >     >     >     >
>     >     >     >     >
>     >     >     >     > As of now, we have 3 possible writers which can be
> used
>     > in the
>     >     >     > indexing
>     >     >     >     > topology:
>     >     >     >     >
>     >     >     >     >    - Solr
>     >     >     >     >    - Elasticsearch
>     >     >     >     >    - HDFS
>     >     >     >     >
>     >     >     >     > HDFS is always used, elasticsearch or solr is used
>     > depending
>     >     > on how
>     >     >     > you
>     >     >     >     > start the indexing topology.
>     >     >     >     >
>     >     >     >     > A couple of proposals come to mind immediately:
>     >     >     >     >
>     >     >     >     > *Index Filtering*
>     >     >     >     >
>     >     >     >     > You would be able to specify a filter as defined
> by a
>     > stellar
>     >     >     > statement
>     >     >     >     > (likely a reuse of the StellarFilter that exists
> in the
>     >     > Parsers)
>     >     >     > which
>     >     >     >     > would allow you to indicate on a
> message-by-message basis
>     >     > whether or
>     >     >     > not to
>     >     >     >     > write the message.
>     >     >     >     >
>     >     >     >     > The semantics of this would be as follows:
>     >     >     >     >
>     >     >     >     >    - Default (i.e. unspecified) is to pass
> everything
>     > through
>     >     > (hence
>     >     >     >     >    backwards compatible with the current default
> config).
>     >     >     >     >    - Messages which have the associated stellar
> statement
>     >     > evaluate
>     >     >     > to true
>     >     >     >     >    for the writer type will be written, otherwise
> not.
>     >     >     >     >
>     >     >     >     >
>     >     >     >     > Sample indexing config which would write out no
> messages
>     > to
>     >     > HDFS and
>     >     >     > write
>     >     >     >     > out only messages containing a field called
> "field1":
>     >     >     >     > {
>     >     >     >     >    "index" : "squid"
>     >     >     >     >   ,"batchSize" : 100
>     >     >     >     >   ,"filters" : {
>     >     >     >     >       "HDFS" : "false"
>     >     >     >     >      ,"ES" : "exists(field1)"
>     >     >     >     >                  }
>     >     >     >     > }
>     >     >     >     >
>     >     >     >     > *Index On/Off Switch*
>     >     >     >     >
>     >     >     >     > A simpler solution would be to just provide a list
> of
>     > writers
>     >     > to
>     >     >     > write
>     >     >     >     > messages.  The semantics would be as follows:
>     >     >     >     >
>     >     >     >     >    - If the list is unspecified, then the default
> is to
>     > write
>     >     > all
>     >     >     > messages
>     >     >     >     >    for every writer in the indexing topology
>     >     >     >     >    - If the list is specified, then a writer will
> write
>     > all
>     >     > messages
>     >     >     > if and
>     >     >     >     >    only if it is named in the list.
>     >     >     >     >
>     >     >     >     > Sample indexing config which turns off HDFS and
> keeps on
>     >     >     > Elasticsearch:
>     >     >     >     > {
>     >     >     >     >    "index" : "squid"
>     >     >     >     >   ,"batchSize" : 100
>     >     >     >     >   ,"writers" : [ "ES" ]
>     >     >     >     > }
>     >     >     >     >
>     >     >     >     > Thanks in advance for the feedback!  Also, if you
> have
>     > any
>     >     > other,
>     >     >     > better
>     >     >     >     > ideas than the ones presented here, let me know
> too.
>     >     >     >     >
>     >     >     >     > Best,
>     >     >     >     >
>     >     >     >     > Casey
>     >     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >     >
>     >     >
>     >     >
>     >
>     >
>     >
>     >
>     >
>
>
>
>

Re: [DISCUSS] Turning off indexing writers feature discussion

Reply via email to