Ok, so here's what I'm thinking based on the discussion:

   - Keeping the configs that we have now (batchSize and index) as defaults
   for the unspecified writer-specific case
   - Adding the config Nick suggested

*Base Case*:
{
}

   - all writers write all messages
   - index named the same as the sensor for all writers
   - batchSize of 1 for all writers

*Writer-non-specific case*:
{
  "index" : "foo"
 ,"batchSize" : 100
}

   - All writers write all messages
   - index is named "foo", different from the sensor for all writers
   - batchSize is 100 for all writers

*Writer-specific case without filters*
{
  "index" : "foo"
 ,"batchSize" : 1
 , "writerConfig" :
   {
      "elasticsearch" : {
                                   "batchSize" : 100
                                 }
   }
}

   - All writers write all messages
   - index is named "foo", different from the sensor for all writers
   - batchSize is 1 for HDFS and 100 for elasticsearch writers
   - NOTE: I could override the index name too

*Writer-specific case with filters*
{
  "index" : "foo"
 ,"batchSize" : 1
 , "writerConfig" :
   {
      "elasticsearch" : {
                                   "batchSize" : 100,
                                   "when" : "exists(field1)"
                                 },
      "hdfs" : {
                     "when" : "false"
                  }
   }
}

   - ES writer writes messages which have field1, HDFS doesn't
   - index is named "foo", different from the sensor for all writers
   - 100 for elasticsearch writers

Thoughts?

On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <cd...@hortonworks.com> wrote:

> For larger installations you need to control what is indexed so you don’t
> end up with a nasty elastic search situation and so you can mine the data
> later for reports and training ml models.
>
> Thanks
> Carolyn
>
>
>
>
> On 1/13/17, 9:40 AM, "Casey Stella" <ceste...@gmail.com> wrote:
>
> >OH that's a good idea!
> >
> >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <n...@nickallen.org> wrote:
> >
> >> I like the "Index Filtering" option based on the flexibility that it
> >> provides.  Should each output (HDFS, ES, etc) have its own configuration
> >> settings?  For example, aren't things like batching handled separately
> for
> >> HDFS versus Elasticsearch?
> >>
> >> Something along the lines of...
> >>
> >> {
> >>   "hdfs" : {
> >>     "when": "exists(field1)",
> >>     "batchSize": 100
> >>   },
> >>
> >>   "elasticsearch" : {
> >>     "when": "true",
> >>     "batchSize": 1000,
> >>     "index": "squid"
> >>   }
> >> }
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <ceste...@gmail.com>
> wrote:
> >>
> >> > Yeah, I tend to like the first option too.  Any opposition to that
> from
> >> > anyone?
> >> >
> >> > The points brought up are good ones and I think that it may be worth a
> >> > broader discussion of the requirements of indexing in a separate dev
> list
> >> > thread.  Maybe a list of desires with coherent use-cases justifying
> them
> >> so
> >> > we can think about how this stuff should work and where the natural
> >> > extension points should be.  Afterall, we need to toe the line between
> >> > engineering and overengineering for features nobody will want.
> >> >
> >> > I'm not sure about the extensions to the standard fields.  I'm torn
> >> between
> >> > the notions that we should have no standard fields vs we should have a
> >> > boatload of standard fields (with most of them empty).  I exchange
> >> > positions fairly regularly on that question. ;)  It may be worth a dev
> >> list
> >> > discussion to lay out how you imagine an extension of standard fields
> and
> >> > how it might look as implemented in Metron.
> >> >
> >> > Casey
> >> >
> >> > Casey
> >> >
> >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> >> > kylerichards...@gmail.com>
> >> > wrote:
> >> >
> >> > > I'll second my preference for the first option. I think the ability
> to
> >> > use
> >> > > Stellar filters to customize indexing would be a big win.
> >> > >
> >> > > I'm glad Matt brought up the point about data lake and CEP. I think
> >> this
> >> > is
> >> > > a really important use case that we need to consider. Take a simple
> >> > > example... If I have data coming in from 3 different firewall
> vendors
> >> > and 2
> >> > > different web proxy/url filtering vendors and I want to be able to
> >> > analyze
> >> > > that data set, I need the data to be indexed all together (likely in
> >> > HDFS)
> >> > > and to have a normalized schema such that IP address, URL, and user
> >> name
> >> > > (to take a few) can be easily queried and aggregated. I can also
> >> envision
> >> > > scenarios where I would want to index data based on attributes other
> >> than
> >> > > sensor, business unit or subsidiary for example.
> >> > >
> >> > > I've been wanted to propose extending our 7 standard fields to
> include
> >> > > things like URL and user. Is there community interest/support for
> >> moving
> >> > in
> >> > > that direction? If so, I'll start a new thread.
> >> > >
> >> > > Thanks!
> >> > >
> >> > > -Kyle
> >> > >
> >> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <ma...@apache.org>
> wrote:
> >> > >
> >> > > > Ah, I see.  If overriding the default index name allows using the
> >> same
> >> > > > name for multiple sensors, then the goal can be achieved.
> >> > > > Thanks,
> >> > > > --Matt
> >> > > >
> >> > > >
> >> > > > On 1/12/17, 3:30 PM, "Casey Stella" <ceste...@gmail.com> wrote:
> >> > > >
> >> > > >     Oh, you could!  Let's say you have a syslog parser with data
> from
> >> > > > sources 1
> >> > > >     2 and 3.  You'd end up with one kafka queue with 3 parsers
> >> attached
> >> > > to
> >> > > > that
> >> > > >     queue, each picking part the messages from source 1, 2 and 3.
> >> > They'd
> >> > > > go
> >> > > >     through separate enrichment and into the indexing topology.
> In
> >> the
> >> > > >     indexing topology, you could specify the same index name
> "syslog"
> >> > and
> >> > > > all
> >> > > >     of the messages go into the same index for CEP querying if so
> >> > > desired.
> >> > > >
> >> > > >     On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <ma...@apache.org
> >
> >> > > wrote:
> >> > > >
> >> > > >     > Syslog is hell on parsers – I know, I worked at LogLogic in
> a
> >> > > > previous
> >> > > >     > life.  It makes perfect sense to route different lines from
> >> > syslog
> >> > > > through
> >> > > >     > different appropriate parsers.  But a lot of what the
> parsers
> >> do
> >> > is
> >> > > >     > identify consistent subsets of metadata and annotate it –
> eg,
> >> > > > src_ip_addr,
> >> > > >     > event timestamps, etc.  Once those metadata are annotated
> and
> >> > > > available
> >> > > >     > with common field names, why doesn’t it make sense to index
> the
> >> > > > messages
> >> > > >     > together, for CEP querying?  I think Splunk has illustrated
> >> this
> >> > > > model.
> >> > > >     >
> >> > > >     > On 1/12/17, 3:00 PM, "Casey Stella" <ceste...@gmail.com>
> >> wrote:
> >> > > >     >
> >> > > >     >     yeah, I mean, honestly, I think the approach that we've
> >> taken
> >> > > for
> >> > > >     > sources
> >> > > >     >     which aggregate different types of data is to provide
> >> filters
> >> > > at
> >> > > > the
> >> > > >     > parser
> >> > > >     >     level and have multiple parser topologies (with
> different,
> >> > > > possibly
> >> > > >     >     mutually exclusive filters) running.  This would be a
> >> > > completely
> >> > > >     > separate
> >> > > >     >     sensor.  Imagine a syslog data source that aggregates
> and
> >> you
> >> > > > want to
> >> > > >     > pick
> >> > > >     >     apart certain pieces of messages.  This is why the
> initial
> >> > > > thought and
> >> > > >     >     architecture was one index per sensor.
> >> > > >     >
> >> > > >     >     On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley <
> >> > ma...@apache.org>
> >> > > > wrote:
> >> > > >     >
> >> > > >     >     > I’m thinking that CEP (Complex Event Processing) is
> >> > contrary
> >> > > > to the
> >> > > >     > idea
> >> > > >     >     > of silo-ing data per sensor.
> >> > > >     >     > Now it’s true that some of those sensors are already
> >> > > > aggregating
> >> > > >     > data from
> >> > > >     >     > multiple sources, so maybe I’m wrong here.
> >> > > >     >     > But it just seems to me that the “data lake” insights
> >> come
> >> > > from
> >> > > >     > being able
> >> > > >     >     > to make decisions over the whole mass of data rather
> than
> >> > > just
> >> > > >     > vertical
> >> > > >     >     > slices of it.
> >> > > >     >     >
> >> > > >     >     > On 1/12/17, 2:15 PM, "Casey Stella" <
> ceste...@gmail.com>
> >> > > > wrote:
> >> > > >     >     >
> >> > > >     >     >     Hey Matt,
> >> > > >     >     >
> >> > > >     >     >     Thanks for the comment!
> >> > > >     >     >     1. At the moment, we only have one index name, the
> >> > > default
> >> > > > of
> >> > > >     > which is
> >> > > >     >     > the
> >> > > >     >     >     sensor name but that's entirely up to the user.
> This
> >> > is
> >> > > > sensor
> >> > > >     >     > specific,
> >> > > >     >     >     so it'd be a separate config for each sensor.  If
> we
> >> > want
> >> > > > to
> >> > > >     > build
> >> > > >     >     > multiple
> >> > > >     >     >     indices per sensor, we'd have to think carefully
> >> about
> >> > > how
> >> > > > to do
> >> > > >     > that
> >> > > >     >     > and
> >> > > >     >     >     would be a bigger undertaking.  I guess I can see
> the
> >> > > use,
> >> > > > though
> >> > > >     >     > (redirect
> >> > > >     >     >     messages to one index vs another based on a
> predicate
> >> > for
> >> > > > a given
> >> > > >     >     > sensor).
> >> > > >     >     >     Anyway, not where I was originally thinking that
> this
> >> > > > discussion
> >> > > >     > would
> >> > > >     >     > go,
> >> > > >     >     >     but it's an interesting point.
> >> > > >     >     >
> >> > > >     >     >     2. I hadn't thought through the implementation
> quite
> >> > yet,
> >> > > > but we
> >> > > >     > don't
> >> > > >     >     >     actually have a splitter bolt in that topology,
> just
> >> a
> >> > > > spout
> >> > > >     > that goes
> >> > > >     >     > to
> >> > > >     >     >     the elasticsearch writer and also to the hdfs
> writer.
> >> > > >     >     >
> >> > > >     >     >     On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley <
> >> > > > ma...@apache.org>
> >> > > >     > wrote:
> >> > > >     >     >
> >> > > >     >     >     > Casey, good to have controls like this.  Couple
> >> > > > questions:
> >> > > >     >     >     >
> >> > > >     >     >     > 1. Regarding the “index” : “squid” name/value
> pair,
> >> > is
> >> > > > the
> >> > > >     > index name
> >> > > >     >     >     > expected to always be a sensor name?  Or is the
> >> given
> >> > > > json
> >> > > >     > structure
> >> > > >     >     >     > subordinate to a sensor name in zookeeper?  Or
> can
> >> we
> >> > > > build
> >> > > >     > arbitrary
> >> > > >     >     >     > indexes with this new specification,
> independent of
> >> > > > sensor?
> >> > > >     > Should
> >> > > >     >     > there
> >> > > >     >     >     > actually be a list of “indexes”, ie
> >> > > >     >     >     > { “indexes” : [
> >> > > >     >     >     >         {“index” : “name1”,
> >> > > >     >     >     >                 …
> >> > > >     >     >     >         },
> >> > > >     >     >     >         {“index” : “name2”,
> >> > > >     >     >     >                 …
> >> > > >     >     >     >         } ]
> >> > > >     >     >     > }
> >> > > >     >     >     >
> >> > > >     >     >     > 2. Would the filtering / writer selection logic
> >> take
> >> > > > place in
> >> > > >     > the
> >> > > >     >     > indexing
> >> > > >     >     >     > topology splitter bolt?  Seems like that would
> have
> >> > the
> >> > > >     > smallest
> >> > > >     >     > impact on
> >> > > >     >     >     > current implementation, no?
> >> > > >     >     >     >
> >> > > >     >     >     > Sorry if these are already answered in PR-415, I
> >> > > haven’t
> >> > > > had
> >> > > >     > time to
> >> > > >     >     >     > review that one yet.
> >> > > >     >     >     > Thanks,
> >> > > >     >     >     > --Matt
> >> > > >     >     >     >
> >> > > >     >     >     >
> >> > > >     >     >     > On 1/12/17, 12:55 PM, "Michael Miklavcic" <
> >> > > >     >     > michael.miklav...@gmail.com>
> >> > > >     >     >     > wrote:
> >> > > >     >     >     >
> >> > > >     >     >     >     I like the flexibility and expressibility of
> >> the
> >> > > > first
> >> > > >     > option
> >> > > >     >     > with
> >> > > >     >     >     > Stellar
> >> > > >     >     >     >     filters.
> >> > > >     >     >     >
> >> > > >     >     >     >     M
> >> > > >     >     >     >
> >> > > >     >     >     >     On Thu, Jan 12, 2017 at 1:51 PM, Casey
> Stella <
> >> > > >     >     > ceste...@gmail.com>
> >> > > >     >     >     > wrote:
> >> > > >     >     >     >
> >> > > >     >     >     >     > As of METRON-652 <
> https://github.com/apache/
> >> > > >     >     >     > incubator-metron/pull/415>, we
> >> > > >     >     >     >     > will have decoupled the indexing
> >> configuration
> >> > > > from the
> >> > > >     >     > enrichment
> >> > > >     >     >     >     > configuration.  As an immediate follow-up
> to
> >> > > that,
> >> > > > I'd
> >> > > >     > like to
> >> > > >     >     >     > provide the
> >> > > >     >     >     >     > ability to turn off and on writers via the
> >> > > > configs.  I'd
> >> > > >     > like
> >> > > >     >     > to get
> >> > > >     >     >     > some
> >> > > >     >     >     >     > community feedback on how the
> functionality
> >> > > should
> >> > > > work,
> >> > > >     > if
> >> > > >     >     > y'all are
> >> > > >     >     >     >     > amenable. :)
> >> > > >     >     >     >     >
> >> > > >     >     >     >     >
> >> > > >     >     >     >     > As of now, we have 3 possible writers
> which
> >> can
> >> > > be
> >> > > > used
> >> > > >     > in the
> >> > > >     >     >     > indexing
> >> > > >     >     >     >     > topology:
> >> > > >     >     >     >     >
> >> > > >     >     >     >     >    - Solr
> >> > > >     >     >     >     >    - Elasticsearch
> >> > > >     >     >     >     >    - HDFS
> >> > > >     >     >     >     >
> >> > > >     >     >     >     > HDFS is always used, elasticsearch or
> solr is
> >> > > used
> >> > > >     > depending
> >> > > >     >     > on how
> >> > > >     >     >     > you
> >> > > >     >     >     >     > start the indexing topology.
> >> > > >     >     >     >     >
> >> > > >     >     >     >     > A couple of proposals come to mind
> >> immediately:
> >> > > >     >     >     >     >
> >> > > >     >     >     >     > *Index Filtering*
> >> > > >     >     >     >     >
> >> > > >     >     >     >     > You would be able to specify a filter as
> >> > defined
> >> > > > by a
> >> > > >     > stellar
> >> > > >     >     >     > statement
> >> > > >     >     >     >     > (likely a reuse of the StellarFilter that
> >> > exists
> >> > > > in the
> >> > > >     >     > Parsers)
> >> > > >     >     >     > which
> >> > > >     >     >     >     > would allow you to indicate on a
> >> > > > message-by-message basis
> >> > > >     >     > whether or
> >> > > >     >     >     > not to
> >> > > >     >     >     >     > write the message.
> >> > > >     >     >     >     >
> >> > > >     >     >     >     > The semantics of this would be as follows:
> >> > > >     >     >     >     >
> >> > > >     >     >     >     >    - Default (i.e. unspecified) is to pass
> >> > > > everything
> >> > > >     > through
> >> > > >     >     > (hence
> >> > > >     >     >     >     >    backwards compatible with the current
> >> > default
> >> > > > config).
> >> > > >     >     >     >     >    - Messages which have the associated
> >> stellar
> >> > > > statement
> >> > > >     >     > evaluate
> >> > > >     >     >     > to true
> >> > > >     >     >     >     >    for the writer type will be written,
> >> > otherwise
> >> > > > not.
> >> > > >     >     >     >     >
> >> > > >     >     >     >     >
> >> > > >     >     >     >     > Sample indexing config which would write
> out
> >> no
> >> > > > messages
> >> > > >     > to
> >> > > >     >     > HDFS and
> >> > > >     >     >     > write
> >> > > >     >     >     >     > out only messages containing a field
> called
> >> > > > "field1":
> >> > > >     >     >     >     > {
> >> > > >     >     >     >     >    "index" : "squid"
> >> > > >     >     >     >     >   ,"batchSize" : 100
> >> > > >     >     >     >     >   ,"filters" : {
> >> > > >     >     >     >     >       "HDFS" : "false"
> >> > > >     >     >     >     >      ,"ES" : "exists(field1)"
> >> > > >     >     >     >     >                  }
> >> > > >     >     >     >     > }
> >> > > >     >     >     >     >
> >> > > >     >     >     >     > *Index On/Off Switch*
> >> > > >     >     >     >     >
> >> > > >     >     >     >     > A simpler solution would be to just
> provide a
> >> > > list
> >> > > > of
> >> > > >     > writers
> >> > > >     >     > to
> >> > > >     >     >     > write
> >> > > >     >     >     >     > messages.  The semantics would be as
> follows:
> >> > > >     >     >     >     >
> >> > > >     >     >     >     >    - If the list is unspecified, then the
> >> > default
> >> > > > is to
> >> > > >     > write
> >> > > >     >     > all
> >> > > >     >     >     > messages
> >> > > >     >     >     >     >    for every writer in the indexing
> topology
> >> > > >     >     >     >     >    - If the list is specified, then a
> writer
> >> > will
> >> > > > write
> >> > > >     > all
> >> > > >     >     > messages
> >> > > >     >     >     > if and
> >> > > >     >     >     >     >    only if it is named in the list.
> >> > > >     >     >     >     >
> >> > > >     >     >     >     > Sample indexing config which turns off
> HDFS
> >> and
> >> > > > keeps on
> >> > > >     >     >     > Elasticsearch:
> >> > > >     >     >     >     > {
> >> > > >     >     >     >     >    "index" : "squid"
> >> > > >     >     >     >     >   ,"batchSize" : 100
> >> > > >     >     >     >     >   ,"writers" : [ "ES" ]
> >> > > >     >     >     >     > }
> >> > > >     >     >     >     >
> >> > > >     >     >     >     > Thanks in advance for the feedback!
> Also, if
> >> > you
> >> > > > have
> >> > > >     > any
> >> > > >     >     > other,
> >> > > >     >     >     > better
> >> > > >     >     >     >     > ideas than the ones presented here, let me
> >> know
> >> > > > too.
> >> > > >     >     >     >     >
> >> > > >     >     >     >     > Best,
> >> > > >     >     >     >     >
> >> > > >     >     >     >     > Casey
> >> > > >     >     >     >     >
> >> > > >     >     >     >
> >> > > >     >     >     >
> >> > > >     >     >     >
> >> > > >     >     >     >
> >> > > >     >     >     >
> >> > > >     >     >
> >> > > >     >     >
> >> > > >     >     >
> >> > > >     >     >
> >> > > >     >     >
> >> > > >     >
> >> > > >     >
> >> > > >     >
> >> > > >     >
> >> > > >     >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Nick Allen <n...@nickallen.org>
> >>
>

Reply via email to