Re: [DISCUSS] Turning off indexing writers feature discussion

Casey Stella Fri, 13 Jan 2017 09:00:13 -0800

Let me noodle on this over the weekend.  Your syntax is looking less
onerous to me and I like the following statement from Otto: "In the end,
each write destination ‘type’ will need it’s own configuration.  This is an
extension point."


I may come around to your way of thinking.

On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler <[email protected]>
wrote:

> In the end, each write destination ‘type’ will need it’s own
> configuration.  This is an extension point.
> {
> HDFS:{
> outputAdapters:[
> {name: avro,
> settings:{
> avro stuff….
> when:{
> },
> {
>  name: sequence file,
>  …..
>
> or some such.
>
>
> On January 13, 2017 at 11:51:15, Nick Allen ([email protected]) wrote:
>
> I will add also that instead of global overrides, like index, we should use
> configuration key names that are more appropriate to the output.
>
> For example, does 'index' really make sense for HDFS? Or would 'path' be
> more appropriate?
>
> {
> 'elasticsearch': {
> 'index': 'foo',
> 'batchSize': 1
> },
> 'hdfs': {
> 'path': '/foo/bar/...',
> 'batchSize': 100
> }
> }
>
> Ok, I've said my peace. Thanks for the effort in summarizing all this,
> Casey.
>
>
> On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen <[email protected]> wrote:
>
> > Nick's concerns about my suggestion were that it was overly complex and
> >> hard to grok and that we could dispense with backwards compatibility and
> >> make people do a bit more work on the default case for the benefits of a
> >> simpler advanced case. (Nick, make sure I don't misstate your position)
> >
> >
> > I will add is that in my mind, the majority case would be a user
> > specifying the outputs, but not things like 'batchSize' or 'when'. I
> think
> > in the majority case, the user would accept whatever the default batch
> size
> > is.
> >
> > Here are alternatives suggestions for all the examples that you provided
> > previously.
> >
> > Base Case
> >
> > - The user must always specify the 'outputs' for clarity.
> > - Uses default index name, batch size and when = true.
> >
> > {
> > 'elasticsearch': {},
> > 'hdfs': {}
> > }
> >
> >
> > <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> a1#writer-non-specific-case>Writer-non-specific
>
> > Case
> >
> > - There are no global overrides, as in Casey's proposal.
> > - Easier to grok IMO.
> >
> > {
> > 'elasticsearch': {
> > 'index': 'foo',
> > 'batchSize': 100
> > },
> > 'hdfs': {
> > 'index': 'foo',
> > 'batchSize': 100
> > }
> > }
> >
> >
> > <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> a1#writer-specific-case-without-filters>Writer-specific
>
> > case without filters
> >
> > {
> > 'elasticsearch': {
> > 'index': 'foo',
> > 'batchSize': 1
> > },
> > 'hdfs': {
> > 'index': 'foo',
> > 'batchSize': 100
> > }
> > }
> >
> >
> > <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> a1#writer-specific-case-with-filters>Writer-specific
>
> > case with filters
> >
> > - Instead of having to say when=false, just don't configure HDFS
> >
> > {
> > 'elasticsearch': {
> > 'index': 'foo',
> > 'batchSize': 100,
> > 'when': 'exists(field1)'
> > }
> > }
> >
> >
> >
> >
> >
> > On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella <[email protected]>
> wrote:
> >
> >> Dave,
> >> For the benefit of posterity and people who might not be as deeply
> >> entangled in the system as we have been, I'll recap things and hopefully
> >> answer your question in the process.
> >>
> >> Historically the index configuration is split between the enrichment
> >> configs and the global configs.
> >>
> >> - The global configs really controls configs that apply to all sensors.
> >> Historically this has been stuff like index connection strings, etc.
> >> - The sensor-specific configs which control things that vary by sensor.
> >>
> >> As of Metron-652 (in review currently), we moved the sensor specific
> >> configs from the enrichment configs. The proposal here is to increase
> the
> >> granularity of the the sensor specific files to make them support index
> >> writer-specific configs. Right now in the indexing topology, we have 2
> >> writers (fixed): ES/Solr and HDFS.
> >>
> >> The proposed configuration would allow you to either specify a blanket
> >> sensor-level config for the index name and batchSize and/or override at
> >> the
> >> writer level, thereby supporting a couple of use-cases:
> >>
> >> - Turning off certain index writers (e.g. HDFS)
> >> - Filtering the messages written to certain index writers
> >>
> >> The two competing configs between Nick and I are as follows:
> >>
> >> - I want to make sure we keep the old sensor-specific defaults with
> >> writer-specific overrides available
> >> - Nick thought we could simplify the permutations by making the
> >> indexing
> >> config only the writer-level configs.
> >>
> >> My concerns about Nick's suggestion were that the default and majority
> >> case, specifying the index and the batchSize for all writers (th eone we
> >> support now) would require more configuration.
> >>
> >> Nick's concerns about my suggestion were that it was overly complex and
> >> hard to grok and that we could dispense with backwards compatibility and
> >> make people do a bit more work on the default case for the benefits of a
> >> simpler advanced case. (Nick, make sure I don't misstate your position).
> >>
> >> Casey
> >>
> >>
> >> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle <[email protected]>
> >> wrote:
> >>
> >> > Casey,
> >> >
> >> > Can you give me a level set of what your thinking is now? I think it's
> >> > global control of all index types + overrides on a per-type basis.
> Fwiw,
> >> > I'm totally for that, but I want to make sure I'm not imposing my
> >> > pre-concieved notions on your consensus-driven ones.
> >> >
> >> > -D....
> >> >
> >> > On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella <[email protected]>
> >> wrote:
> >> >
> >> > > I am suggesting that, yes. The configs are essentially the same as
> >> > yours,
> >> > > except there is an override specified at the top level. Without
> >> that, in
> >> > > order to specify both HDFS and ES have batch sizes of 100, you have
> to
> >> > > explicitly configure each. It's less that I'm trying to have
> >> backwards
> >> > > compatibility and more that I'm trying to make the majority case
> easy:
> >> > both
> >> > > writers write everything to a specified index name with a specified
> >> batch
> >> > > size (which is what we have now). Beyond that, I want to allow for
> >> > > specifying an override for the config on a writer-by-writer basis
> for
> >> > those
> >> > > who need it.
> >> > >
> >> > > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen <[email protected]>
> >> wrote:
> >> > >
> >> > > > Are you saying we support all of these variants? I realize you are
> >> > > trying
> >> > > > to have some backwards compatibility, but this also makes it
> harder
> >> > for a
> >> > > > user to grok (for me at least).
> >> > > >
> >> > > > Personally I like my original example as there are fewer
> >> > sub-structures,
> >> > > > like 'writerConfig', which makes the whole thing simpler and
> easier
> >> to
> >> > > > grok. But maybe others will think your proposal is just as easy to
> >> > grok.
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella <
> [email protected]>
>
> >> > > wrote:
> >> > > >
> >> > > > > Ok, so here's what I'm thinking based on the discussion:
> >> > > > >
> >> > > > > - Keeping the configs that we have now (batchSize and index) as
> >> > > > defaults
> >> > > > > for the unspecified writer-specific case
> >> > > > > - Adding the config Nick suggested
> >> > > > >
> >> > > > > *Base Case*:
> >> > > > > {
> >> > > > > }
> >> > > > >
> >> > > > > - all writers write all messages
> >> > > > > - index named the same as the sensor for all writers
> >> > > > > - batchSize of 1 for all writers
> >> > > > >
> >> > > > > *Writer-non-specific case*:
> >> > > > > {
> >> > > > > "index" : "foo"
> >> > > > > ,"batchSize" : 100
> >> > > > > }
> >> > > > >
> >> > > > > - All writers write all messages
> >> > > > > - index is named "foo", different from the sensor for all
> >> writers
> >> > > > > - batchSize is 100 for all writers
> >> > > > >
> >> > > > > *Writer-specific case without filters*
> >> > > > > {
> >> > > > > "index" : "foo"
> >> > > > > ,"batchSize" : 1
> >> > > > > , "writerConfig" :
> >> > > > > {
> >> > > > > "elasticsearch" : {
> >> > > > > "batchSize" : 100
> >> > > > > }
> >> > > > > }
> >> > > > > }
> >> > > > >
> >> > > > > - All writers write all messages
> >> > > > > - index is named "foo", different from the sensor for all
> >> writers
> >> > > > > - batchSize is 1 for HDFS and 100 for elasticsearch writers
> >> > > > > - NOTE: I could override the index name too
> >> > > > >
> >> > > > > *Writer-specific case with filters*
> >> > > > > {
> >> > > > > "index" : "foo"
> >> > > > > ,"batchSize" : 1
> >> > > > > , "writerConfig" :
> >> > > > > {
> >> > > > > "elasticsearch" : {
> >> > > > > "batchSize" : 100,
> >> > > > > "when" : "exists(field1)"
> >> > > > > },
> >> > > > > "hdfs" : {
> >> > > > > "when" : "false"
> >> > > > > }
> >> > > > > }
> >> > > > > }
> >> > > > >
> >> > > > > - ES writer writes messages which have field1, HDFS doesn't
> >> > > > > - index is named "foo", different from the sensor for all
> >> writers
> >> > > > > - 100 for elasticsearch writers
> >> > > > >
> >> > > > > Thoughts?
> >> > > > >
> >> > > > > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <
> >> [email protected]
> >> > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > For larger installations you need to control what is indexed
> so
> >> you
> >> > > > don’t
> >> > > > > > end up with a nasty elastic search situation and so you can
> mine
> >> > the
> >> > > > data
> >> > > > > > later for reports and training ml models.
> >> > > > > >
> >> > > > > > Thanks
> >> > > > > > Carolyn
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On 1/13/17, 9:40 AM, "Casey Stella" <[email protected]>
> wrote:
> >> > > > > >
> >> > > > > > >OH that's a good idea!
> >> > > > > > >
> >> > > > > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <
> >> [email protected]>
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > >> I like the "Index Filtering" option based on the
> flexibility
> >> > that
> >> > > it
> >> > > > > > >> provides. Should each output (HDFS, ES, etc) have its own
> >> > > > > configuration
> >> > > > > > >> settings? For example, aren't things like batching handled
> >> > > > separately
> >> > > > > > for
> >> > > > > > >> HDFS versus Elasticsearch?
> >> > > > > > >>
> >> > > > > > >> Something along the lines of...
> >> > > > > > >>
> >> > > > > > >> {
> >> > > > > > >> "hdfs" : {
> >> > > > > > >> "when": "exists(field1)",
> >> > > > > > >> "batchSize": 100
> >> > > > > > >> },
> >> > > > > > >>
> >> > > > > > >> "elasticsearch" : {
> >> > > > > > >> "when": "true",
> >> > > > > > >> "batchSize": 1000,
> >> > > > > > >> "index": "squid"
> >> > > > > > >> }
> >> > > > > > >> }
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <
> >> > [email protected]
> >> > > >
> >> > > > > > wrote:
> >> > > > > > >>
> >> > > > > > >> > Yeah, I tend to like the first option too. Any opposition
> >> to
> >> > > that
> >> > > > > > from
> >> > > > > > >> > anyone?
> >> > > > > > >> >
> >> > > > > > >> > The points brought up are good ones and I think that it
> >> may be
> >> > > > > worth a
> >> > > > > > >> > broader discussion of the requirements of indexing in a
> >> > separate
> >> > > > dev
> >> > > > > > list
> >> > > > > > >> > thread. Maybe a list of desires with coherent use-cases
> >> > > > justifying
> >> > > > > > them
> >> > > > > > >> so
> >> > > > > > >> > we can think about how this stuff should work and where
> the
> >> > > > natural
> >> > > > > > >> > extension points should be. Afterall, we need to toe the
> >> line
> >> > > > > between
> >> > > > > > >> > engineering and overengineering for features nobody will
> >> want.
> >> > > > > > >> >
> >> > > > > > >> > I'm not sure about the extensions to the standard fields.
> >> I'm
> >> > > > torn
> >> > > > > > >> between
> >> > > > > > >> > the notions that we should have no standard fields vs we
> >> > should
> >> > > > > have a
> >> > > > > > >> > boatload of standard fields (with most of them empty). I
> >> > > exchange
> >> > > > > > >> > positions fairly regularly on that question. ;) It may be
> >> > > worth a
> >> > > > > dev
> >> > > > > > >> list
> >> > > > > > >> > discussion to lay out how you imagine an extension of
> >> standard
> >> > > > > fields
> >> > > > > > and
> >> > > > > > >> > how it might look as implemented in Metron.
> >> > > > > > >> >
> >> > > > > > >> > Casey
> >> > > > > > >> >
> >> > > > > > >> > Casey
> >> > > > > > >> >
> >> > > > > > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> >> > > > > > >> > [email protected]>
> >> > > > > > >> > wrote:
> >> > > > > > >> >
> >> > > > > > >> > > I'll second my preference for the first option. I think
> >> the
> >> > > > > ability
> >> > > > > > to
> >> > > > > > >> > use
> >> > > > > > >> > > Stellar filters to customize indexing would be a big
> win.
> >> > > > > > >> > >
> >> > > > > > >> > > I'm glad Matt brought up the point about data lake and
> >> CEP.
> >> > I
> >> > > > > think
> >> > > > > > >> this
> >> > > > > > >> > is
> >> > > > > > >> > > a really important use case that we need to consider.
> >> Take a
> >> > > > > simple
> >> > > > > > >> > > example... If I have data coming in from 3 different
> >> > firewall
> >> > > > > > vendors
> >> > > > > > >> > and 2
> >> > > > > > >> > > different web proxy/url filtering vendors and I want to
> >> be
> >> > > able
> >> > > > to
> >> > > > > > >> > analyze
> >> > > > > > >> > > that data set, I need the data to be indexed all
> together
> >> > > > (likely
> >> > > > > in
> >> > > > > > >> > HDFS)
> >> > > > > > >> > > and to have a normalized schema such that IP address,
> >> URL,
> >> > and
> >> > > > > user
> >> > > > > > >> name
> >> > > > > > >> > > (to take a few) can be easily queried and aggregated. I
> >> can
> >> > > also
> >> > > > > > >> envision
> >> > > > > > >> > > scenarios where I would want to index data based on
> >> > attributes
> >> > > > > other
> >> > > > > > >> than
> >> > > > > > >> > > sensor, business unit or subsidiary for example.
> >> > > > > > >> > >
> >> > > > > > >> > > I've been wanted to propose extending our 7 standard
> >> fields
> >> > to
> >> > > > > > include
> >> > > > > > >> > > things like URL and user. Is there community
> >> > interest/support
> >> > > > for
> >> > > > > > >> moving
> >> > > > > > >> > in
> >> > > > > > >> > > that direction? If so, I'll start a new thread.
> >> > > > > > >> > >
> >> > > > > > >> > > Thanks!
> >> > > > > > >> > >
> >> > > > > > >> > > -Kyle
> >> > > > > > >> > >
> >> > > > > > >> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <
> >> > [email protected]
> >> > > >
> >> > > > > > wrote:
> >> > > > > > >> > >
> >> > > > > > >> > > > Ah, I see. If overriding the default index name
> allows
> >> > > using
> >> > > > > the
> >> > > > > > >> same
> >> > > > > > >> > > > name for multiple sensors, then the goal can be
> >> achieved.
> >> > > > > > >> > > > Thanks,
> >> > > > > > >> > > > --Matt
> >> > > > > > >> > > >
> >> > > > > > >> > > >
> >> > > > > > >> > > > On 1/12/17, 3:30 PM, "Casey Stella" <
> >> [email protected]>
> >> > > > wrote:
> >> > > > > > >> > > >
> >> > > > > > >> > > > Oh, you could! Let's say you have a syslog parser
> >> > with
> >> > > > data
> >> > > > > > from
> >> > > > > > >> > > > sources 1
> >> > > > > > >> > > > 2 and 3. You'd end up with one kafka queue with 3
> >> > > parsers
> >> > > > > > >> attached
> >> > > > > > >> > > to
> >> > > > > > >> > > > that
> >> > > > > > >> > > > queue, each picking part the messages from source
> >> 1, 2
> >> > > and
> >> > > > > 3.
> >> > > > > > >> > They'd
> >> > > > > > >> > > > go
> >> > > > > > >> > > > through separate enrichment and into the indexing
> >> > > > topology.
> >> > > > > > In
> >> > > > > > >> the
> >> > > > > > >> > > > indexing topology, you could specify the same index
> >> > name
> >> > > > > > "syslog"
> >> > > > > > >> > and
> >> > > > > > >> > > > all
> >> > > > > > >> > > > of the messages go into the same index for CEP
> >> > querying
> >> > > if
> >> > > > > so
> >> > > > > > >> > > desired.
> >> > > > > > >> > > >
> >> > > > > > >> > > > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <
> >> > > > > [email protected]
> >> > > > > > >
> >> > > > > > >> > > wrote:
> >> > > > > > >> > > >
> >> > > > > > >> > > > > Syslog is hell on parsers – I know, I worked at
> >> > > LogLogic
> >> > > > > in
> >> > > > > > a
> >> > > > > > >> > > > previous
> >> > > > > > >> > > > > life. It makes perfect sense to route different
> >> > lines
> >> > > > > from
> >> > > > > > >> > syslog
> >> > > > > > >> > > > through
> >> > > > > > >> > > > > different appropriate parsers. But a lot of what
> >> > the
> >> > > > > > parsers
> >> > > > > > >> do
> >> > > > > > >> > is
> >> > > > > > >> > > > > identify consistent subsets of metadata and
> >> annotate
> >> > > it
> >> > > > –
> >> > > > > > eg,
> >> > > > > > >> > > > src_ip_addr,
> >> > > > > > >> > > > > event timestamps, etc. Once those metadata are
> >> > > > annotated
> >> > > > > > and
> >> > > > > > >> > > > available
> >> > > > > > >> > > > > with common field names, why doesn’t it make
> >> sense
> >> > to
> >> > > > > index
> >> > > > > > the
> >> > > > > > >> > > > messages
> >> > > > > > >> > > > > together, for CEP querying? I think Splunk has
> >> > > > > illustrated
> >> > > > > > >> this
> >> > > > > > >> > > > model.
> >> > > > > > >> > > > >
> >> > > > > > >> > > > > On 1/12/17, 3:00 PM, "Casey Stella" <
> >> > > [email protected]
> >> > > > >
> >> > > > > > >> wrote:
> >> > > > > > >> > > > >
> >> > > > > > >> > > > > yeah, I mean, honestly, I think the approach
> >> > that
> >> > > > > we've
> >> > > > > > >> taken
> >> > > > > > >> > > for
> >> > > > > > >> > > > > sources
> >> > > > > > >> > > > > which aggregate different types of data is to
> >> > > > provide
> >> > > > > > >> filters
> >> > > > > > >> > > at
> >> > > > > > >> > > > the
> >> > > > > > >> > > > > parser
> >> > > > > > >> > > > > level and have multiple parser topologies
> >> (with
> >> > > > > > different,
> >> > > > > > >> > > > possibly
> >> > > > > > >> > > > > mutually exclusive filters) running. This
> >> would
> >> > > be
> >> > > > a
> >> > > > > > >> > > completely
> >> > > > > > >> > > > > separate
> >> > > > > > >> > > > > sensor. Imagine a syslog data source that
> >> > > > aggregates
> >> > > > > > and
> >> > > > > > >> you
> >> > > > > > >> > > > want to
> >> > > > > > >> > > > > pick
> >> > > > > > >> > > > > apart certain pieces of messages. This is
> >> why
> >> > the
> >> > > > > > initial
> >> > > > > > >> > > > thought and
> >> > > > > > >> > > > > architecture was one index per sensor.
> >> > > > > > >> > > > >
> >> > > > > > >> > > > > On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley <
> >> > > > > > >> > [email protected]>
> >> > > > > > >> > > > wrote:
> >> > > > > > >> > > > >
> >> > > > > > >> > > > > > I’m thinking that CEP (Complex Event
> >> > Processing)
> >> > > > is
> >> > > > > > >> > contrary
> >> > > > > > >> > > > to the
> >> > > > > > >> > > > > idea
> >> > > > > > >> > > > > > of silo-ing data per sensor.
> >> > > > > > >> > > > > > Now it’s true that some of those sensors
> >> are
> >> > > > already
> >> > > > > > >> > > > aggregating
> >> > > > > > >> > > > > data from
> >> > > > > > >> > > > > > multiple sources, so maybe I’m wrong here.
> >> > > > > > >> > > > > > But it just seems to me that the “data
> >> lake”
> >> > > > > insights
> >> > > > > > >> come
> >> > > > > > >> > > from
> >> > > > > > >> > > > > being able
> >> > > > > > >> > > > > > to make decisions over the whole mass of
> >> data
> >> > > > rather
> >> > > > > > than
> >> > > > > > >> > > just
> >> > > > > > >> > > > > vertical
> >> > > > > > >> > > > > > slices of it.
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > > On 1/12/17, 2:15 PM, "Casey Stella" <
> >> > > > > > [email protected]>
> >> > > > > > >> > > > wrote:
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > > Hey Matt,
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > > Thanks for the comment!
> >> > > > > > >> > > > > > 1. At the moment, we only have one
> >> index
> >> > > name,
> >> > > > > the
> >> > > > > > >> > > default
> >> > > > > > >> > > > of
> >> > > > > > >> > > > > which is
> >> > > > > > >> > > > > > the
> >> > > > > > >> > > > > > sensor name but that's entirely up to
> >> the
> >> > > > user.
> >> > > > > > This
> >> > > > > > >> > is
> >> > > > > > >> > > > sensor
> >> > > > > > >> > > > > > specific,
> >> > > > > > >> > > > > > so it'd be a separate config for each
> >> > > sensor.
> >> > > > > If
> >> > > > > > we
> >> > > > > > >> > want
> >> > > > > > >> > > > to
> >> > > > > > >> > > > > build
> >> > > > > > >> > > > > > multiple
> >> > > > > > >> > > > > > indices per sensor, we'd have to think
> >> > > > carefully
> >> > > > > > >> about
> >> > > > > > >> > > how
> >> > > > > > >> > > > to do
> >> > > > > > >> > > > > that
> >> > > > > > >> > > > > > and
> >> > > > > > >> > > > > > would be a bigger undertaking. I
> >> guess I
> >> > > can
> >> > > > > see
> >> > > > > > the
> >> > > > > > >> > > use,
> >> > > > > > >> > > > though
> >> > > > > > >> > > > > > (redirect
> >> > > > > > >> > > > > > messages to one index vs another based
> >> on
> >> > a
> >> > > > > > predicate
> >> > > > > > >> > for
> >> > > > > > >> > > > a given
> >> > > > > > >> > > > > > sensor).
> >> > > > > > >> > > > > > Anyway, not where I was originally
> >> > thinking
> >> > > > that
> >> > > > > > this
> >> > > > > > >> > > > discussion
> >> > > > > > >> > > > > would
> >> > > > > > >> > > > > > go,
> >> > > > > > >> > > > > > but it's an interesting point.
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > > 2. I hadn't thought through the
> >> > > implementation
> >> > > > > > quite
> >> > > > > > >> > yet,
> >> > > > > > >> > > > but we
> >> > > > > > >> > > > > don't
> >> > > > > > >> > > > > > actually have a splitter bolt in that
> >> > > > topology,
> >> > > > > > just
> >> > > > > > >> a
> >> > > > > > >> > > > spout
> >> > > > > > >> > > > > that goes
> >> > > > > > >> > > > > > to
> >> > > > > > >> > > > > > the elasticsearch writer and also to
> >> the
> >> > > hdfs
> >> > > > > > writer.
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > > On Thu, Jan 12, 2017 at 4:52 PM, Matt
> >> > Foley
> >> > > <
> >> > > > > > >> > > > [email protected]>
> >> > > > > > >> > > > > wrote:
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > > > Casey, good to have controls like
> >> this.
> >> > > > > Couple
> >> > > > > > >> > > > questions:
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > > > 1. Regarding the “index” : “squid”
> >> > > > name/value
> >> > > > > > pair,
> >> > > > > > >> > is
> >> > > > > > >> > > > the
> >> > > > > > >> > > > > index name
> >> > > > > > >> > > > > > > expected to always be a sensor
> >> name? Or
> >> > > is
> >> > > > > the
> >> > > > > > >> given
> >> > > > > > >> > > > json
> >> > > > > > >> > > > > structure
> >> > > > > > >> > > > > > > subordinate to a sensor name in
> >> > zookeeper?
> >> > > > Or
> >> > > > > > can
> >> > > > > > >> we
> >> > > > > > >> > > > build
> >> > > > > > >> > > > > arbitrary
> >> > > > > > >> > > > > > > indexes with this new specification,
> >> > > > > > independent of
> >> > > > > > >> > > > sensor?
> >> > > > > > >> > > > > Should
> >> > > > > > >> > > > > > there
> >> > > > > > >> > > > > > > actually be a list of “indexes”, ie
> >> > > > > > >> > > > > > > { “indexes” : [
> >> > > > > > >> > > > > > > {“index” : “name1”,
> >> > > > > > >> > > > > > > …
> >> > > > > > >> > > > > > > },
> >> > > > > > >> > > > > > > {“index” : “name2”,
> >> > > > > > >> > > > > > > …
> >> > > > > > >> > > > > > > } ]
> >> > > > > > >> > > > > > > }
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > > > 2. Would the filtering / writer
> >> > selection
> >> > > > > logic
> >> > > > > > >> take
> >> > > > > > >> > > > place in
> >> > > > > > >> > > > > the
> >> > > > > > >> > > > > > indexing
> >> > > > > > >> > > > > > > topology splitter bolt? Seems like
> >> that
> >> > > > would
> >> > > > > > have
> >> > > > > > >> > the
> >> > > > > > >> > > > > smallest
> >> > > > > > >> > > > > > impact on
> >> > > > > > >> > > > > > > current implementation, no?
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > > > Sorry if these are already answered
> >> in
> >> > > > > PR-415, I
> >> > > > > > >> > > haven’t
> >> > > > > > >> > > > had
> >> > > > > > >> > > > > time to
> >> > > > > > >> > > > > > > review that one yet.
> >> > > > > > >> > > > > > > Thanks,
> >> > > > > > >> > > > > > > --Matt
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > > > On 1/12/17, 12:55 PM, "Michael
> >> > Miklavcic"
> >> > > <
> >> > > > > > >> > > > > > [email protected]>
> >> > > > > > >> > > > > > > wrote:
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > > > I like the flexibility and
> >> > > > expressibility
> >> > > > > of
> >> > > > > > >> the
> >> > > > > > >> > > > first
> >> > > > > > >> > > > > option
> >> > > > > > >> > > > > > with
> >> > > > > > >> > > > > > > Stellar
> >> > > > > > >> > > > > > > filters.
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > > > M
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > > > On Thu, Jan 12, 2017 at 1:51 PM,
> >> > Casey
> >> > > > > > Stella <
> >> > > > > > >> > > > > > [email protected]>
> >> > > > > > >> > > > > > > wrote:
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > > > > As of METRON-652 <
> >> > > > > > https://github.com/apache/
> >> > > > > > >> > > > > > > incubator-metron/pull/415>, we
> >> > > > > > >> > > > > > > > will have decoupled the
> >> indexing
> >> > > > > > >> configuration
> >> > > > > > >> > > > from the
> >> > > > > > >> > > > > > enrichment
> >> > > > > > >> > > > > > > > configuration. As an immediate
> >> > > > > follow-up
> >> > > > > > to
> >> > > > > > >> > > that,
> >> > > > > > >> > > > I'd
> >> > > > > > >> > > > > like to
> >> > > > > > >> > > > > > > provide the
> >> > > > > > >> > > > > > > > ability to turn off and on
> >> writers
> >> > > via
> >> > > > > the
> >> > > > > > >> > > > configs. I'd
> >> > > > > > >> > > > > like
> >> > > > > > >> > > > > > to get
> >> > > > > > >> > > > > > > some
> >> > > > > > >> > > > > > > > community feedback on how the
> >> > > > > > functionality
> >> > > > > > >> > > should
> >> > > > > > >> > > > work,
> >> > > > > > >> > > > > if
> >> > > > > > >> > > > > > y'all are
> >> > > > > > >> > > > > > > > amenable. :)
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > As of now, we have 3 possible
> >> > > writers
> >> > > > > > which
> >> > > > > > >> can
> >> > > > > > >> > > be
> >> > > > > > >> > > > used
> >> > > > > > >> > > > > in the
> >> > > > > > >> > > > > > > indexing
> >> > > > > > >> > > > > > > > topology:
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > - Solr
> >> > > > > > >> > > > > > > > - Elasticsearch
> >> > > > > > >> > > > > > > > - HDFS
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > HDFS is always used,
> >> elasticsearch
> >> > > or
> >> > > > > > solr is
> >> > > > > > >> > > used
> >> > > > > > >> > > > > depending
> >> > > > > > >> > > > > > on how
> >> > > > > > >> > > > > > > you
> >> > > > > > >> > > > > > > > start the indexing topology.
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > A couple of proposals come to
> >> mind
> >> > > > > > >> immediately:
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > *Index Filtering*
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > You would be able to specify a
> >> > > filter
> >> > > > as
> >> > > > > > >> > defined
> >> > > > > > >> > > > by a
> >> > > > > > >> > > > > stellar
> >> > > > > > >> > > > > > > statement
> >> > > > > > >> > > > > > > > (likely a reuse of the
> >> > StellarFilter
> >> > > > > that
> >> > > > > > >> > exists
> >> > > > > > >> > > > in the
> >> > > > > > >> > > > > > Parsers)
> >> > > > > > >> > > > > > > which
> >> > > > > > >> > > > > > > > would allow you to indicate on
> >> a
> >> > > > > > >> > > > message-by-message basis
> >> > > > > > >> > > > > > whether or
> >> > > > > > >> > > > > > > not to
> >> > > > > > >> > > > > > > > write the message.
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > The semantics of this would be
> >> as
> >> > > > > follows:
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > - Default (i.e.
> >> unspecified) is
> >> > > to
> >> > > > > pass
> >> > > > > > >> > > > everything
> >> > > > > > >> > > > > through
> >> > > > > > >> > > > > > (hence
> >> > > > > > >> > > > > > > > backwards compatible with
> >> the
> >> > > > current
> >> > > > > > >> > default
> >> > > > > > >> > > > config).
> >> > > > > > >> > > > > > > > - Messages which have the
> >> > > > associated
> >> > > > > > >> stellar
> >> > > > > > >> > > > statement
> >> > > > > > >> > > > > > evaluate
> >> > > > > > >> > > > > > > to true
> >> > > > > > >> > > > > > > > for the writer type will be
> >> > > > written,
> >> > > > > > >> > otherwise
> >> > > > > > >> > > > not.
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > Sample indexing config which
> >> would
> >> > > > write
> >> > > > > > out
> >> > > > > > >> no
> >> > > > > > >> > > > messages
> >> > > > > > >> > > > > to
> >> > > > > > >> > > > > > HDFS and
> >> > > > > > >> > > > > > > write
> >> > > > > > >> > > > > > > > out only messages containing a
> >> > field
> >> > > > > > called
> >> > > > > > >> > > > "field1":
> >> > > > > > >> > > > > > > > {
> >> > > > > > >> > > > > > > > "index" : "squid"
> >> > > > > > >> > > > > > > > ,"batchSize" : 100
> >> > > > > > >> > > > > > > > ,"filters" : {
> >> > > > > > >> > > > > > > > "HDFS" : "false"
> >> > > > > > >> > > > > > > > ,"ES" : "exists(field1)"
> >> > > > > > >> > > > > > > > }
> >> > > > > > >> > > > > > > > }
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > *Index On/Off Switch*
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > A simpler solution would be to
> >> > just
> >> > > > > > provide a
> >> > > > > > >> > > list
> >> > > > > > >> > > > of
> >> > > > > > >> > > > > writers
> >> > > > > > >> > > > > > to
> >> > > > > > >> > > > > > > write
> >> > > > > > >> > > > > > > > messages. The semantics would
> >> be
> >> > as
> >> > > > > > follows:
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > - If the list is
> >> unspecified,
> >> > > then
> >> > > > > the
> >> > > > > > >> > default
> >> > > > > > >> > > > is to
> >> > > > > > >> > > > > write
> >> > > > > > >> > > > > > all
> >> > > > > > >> > > > > > > messages
> >> > > > > > >> > > > > > > > for every writer in the
> >> > indexing
> >> > > > > > topology
> >> > > > > > >> > > > > > > > - If the list is specified,
> >> > then
> >> > > a
> >> > > > > > writer
> >> > > > > > >> > will
> >> > > > > > >> > > > write
> >> > > > > > >> > > > > all
> >> > > > > > >> > > > > > messages
> >> > > > > > >> > > > > > > if and
> >> > > > > > >> > > > > > > > only if it is named in the
> >> > list.
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > Sample indexing config which
> >> turns
> >> > > off
> >> > > > > > HDFS
> >> > > > > > >> and
> >> > > > > > >> > > > keeps on
> >> > > > > > >> > > > > > > Elasticsearch:
> >> > > > > > >> > > > > > > > {
> >> > > > > > >> > > > > > > > "index" : "squid"
> >> > > > > > >> > > > > > > > ,"batchSize" : 100
> >> > > > > > >> > > > > > > > ,"writers" : [ "ES" ]
> >> > > > > > >> > > > > > > > }
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > Thanks in advance for the
> >> > feedback!
> >> > > > > > Also, if
> >> > > > > > >> > you
> >> > > > > > >> > > > have
> >> > > > > > >> > > > > any
> >> > > > > > >> > > > > > other,
> >> > > > > > >> > > > > > > better
> >> > > > > > >> > > > > > > > ideas than the ones presented
> >> > here,
> >> > > > let
> >> > > > > me
> >> > > > > > >> know
> >> > > > > > >> > > > too.
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > Best,
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > > > Casey
> >> > > > > > >> > > > > > > >
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > > >
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > >
> >> > > > > > >> > > > >
> >> > > > > > >> > > > >
> >> > > > > > >> > > > >
> >> > > > > > >> > > > >
> >> > > > > > >> > > >
> >> > > > > > >> > > >
> >> > > > > > >> > > >
> >> > > > > > >> > > >
> >> > > > > > >> > >
> >> > > > > > >> >
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > > >> --
> >> > > > > > >> Nick Allen <[email protected]>
> >> > > > > > >>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Nick Allen <[email protected]>
> >> > > >
> >> > >
> >> >
> >>
> >
> >
> >
> > --
> > Nick Allen <[email protected]>
> >
>
>
>
> --
> Nick Allen <[email protected]>
>

Re: [DISCUSS] Turning off indexing writers feature discussion

Reply via email to