Re: [DISCUSS] Turning off indexing writers feature discussion

Casey Stella Mon, 16 Jan 2017 07:51:12 -0800

hahaha :)

On Mon, Jan 16, 2017 at 10:49 AM, Nick Allen <[email protected]> wrote:


> I don't quite support it for #1 and #2, but you absolutely sold me on #3.
> Good sell.  +1
>
>
> On Mon, Jan 16, 2017 at 10:46 AM, Casey Stella <[email protected]> wrote:
>
> > Well, I like it for a couple of reasons:
> >
> >    - It's explicit and clear that the writer is on or off
> >    - It enables people to keep their writer config in the file without
> >    having the writer on (so I don't have to adjust the when clause to
> > "false"
> >    - It enables us to not have to execute a stellar statement for "off"
> >    writers.
> >
> >
> >
> > On Mon, Jan 16, 2017 at 10:40 AM, Nick Allen <[email protected]> wrote:
> >
> > > I'm all for a compromise here.  Sounds like we're getting close.
> > >
> > > Just one thing.  Can you layout the reasoning for having 'enabled' and
> > > 'when'?  I don't follow the reasoning, but maybe I am missing
> something.
> > >
> > > On Sat, Jan 14, 2017 at 12:13 PM, Kyle Richardson <
> > > [email protected]
> > > > wrote:
> > >
> > > > I'm +1 on the current proposal. I like Nick's syntax and agree with
> > Jon's
> > > > enabled property. I also like the idea of a path property for HDFS.
> > > >
> > > > -Kyle
> > > >
> > > > > On Jan 14, 2017, at 10:51 AM, Casey Stella <[email protected]>
> > wrote:
> > > > >
> > > > > I'm +1 on an explicit enabled property and a filter (or when)
> > > property. I
> > > > > think we are zeroing in on a decent design, so that is good.
> > > > >
> > > > > To recap, what I am +1 on is Nick's proposed syntax with the
> > following
> > > > > modifications:
> > > > > 1. An explicit enabled field
> > > > > 2. A default on for unspecified to match current semantics
> > > > >
> > > > > Casey
> > > > >> On Sat, Jan 14, 2017 at 10:45 [email protected] <[email protected]>
> > > > wrote:
> > > > >>
> > > > >> This has the additional benefit of doing something like below when
> > you
> > > > want
> > > > >> to temporarily disable the hdfs writer, but don't want to remove
> the
> > > > >> settings.  This removes the need to store the path and batchSize
> > (and
> > > > many
> > > > >> additional settings) somewhere else so they can be brought back in
> > > when
> > > > you
> > > > >> want to re-enable it, which is a nice workflow attribute for the
> end
> > > > user:
> > > > >>
> > > > >> {
> > > > >>   'elasticsearch': {
> > > > >>      'enabled': 'true',
> > > > >>      'index': 'foo',
> > > > >>      'batchSize': 100,
> > > > >>    },
> > > > >>   'hdfs': {
> > > > >>      'enabled': 'false',
> > > > >>      'path': '/foo/bar/...',
> > > > >>      'batchSize': 100,
> > > > >>    },
> > > > >>   'solr': {
> > > > >>      'enabled': 'false'
> > > > >>    }
> > > > >> }
> > > > >>
> > > > >> Jon
> > > > >>
> > > > >>> On Sat, Jan 14, 2017 at 9:24 AM [email protected] <
> [email protected]
> > >
> > > > wrote:
> > > > >>>
> > > > >>> I similarly have a concern there because I prefer being as
> explicit
> > > as
> > > > >>> possible, which makes things easier to pick up for new users.
> > Using
> > > my
> > > > >>> example from earlier this could look like specifying
> while(false),
> > > but
> > > > an
> > > > >>> even better and more obvious approach may be to use
> enabled(false).
> > > So
> > > > >> the
> > > > >>> current simple default would be:
> > > > >>>
> > > > >>> {
> > > > >>>   'elasticsearch': { 'enabled': 'true' },
> > > > >>>   'hdfs': { 'enabled': 'true' },
> > > > >>>   'solr': { enabled': 'false' }
> > > > >>> }
> > > > >>>
> > > > >>> And to use ES with some overrides but not HDFS or solr it would
> > look
> > > > >> like:
> > > > >>>
> > > > >>> {
> > > > >>>   'elasticsearch': {
> > > > >>>      'enabled': 'true',
> > > > >>>      'index': 'foo',
> > > > >>>      'batchSize': 100
> > > > >>>    },
> > > > >>>   'hdfs': {
> > > > >>>      'enabled': 'false'
> > > > >>>    },
> > > > >>>   'solr': {
> > > > >>>      'enabled': 'false'
> > > > >>>    }
> > > > >>> }
> > > > >>>
> > > > >>> Jon
> > > > >>>
> > > > >>> On Fri, Jan 13, 2017 at 10:21 PM Casey Stella <
> [email protected]>
> > > > >> wrote:
> > > > >>>
> > > > >>> One thing that I thought of that I very strenuous do not like in
> > > Nick's
> > > > >>> proposal is that if a writer config is not specified then it is
> > > turned
> > > > >> off
> > > > >>> (I think; if I misunderstood let me know). In the situation where
> > we
> > > > >> have a
> > > > >>> new sensor, right now if there are no index config and no
> > enrichment
> > > > >>> config, it still passes through to the index using defaults. In
> > this
> > > > new
> > > > >>> scheme it would not. This changes the default semantics for the
> > > system
> > > > >> and
> > > > >>> I think it changes it for the worse.
> > > > >>>
> > > > >>> I would strongly prefer a on-by-default indexing config as we
> have
> > > now.
> > > > >>>> On Fri, Jan 13, 2017 at 17:13 Casey Stella <[email protected]>
> > > > wrote:
> > > > >>>>
> > > > >>>> One thing that I really like about Nick's suggestion is that it
> > > allows
> > > > >>>> writer-specific configs in a clear and simple way.  It is more
> > > complex
> > > > >>> for
> > > > >>>> the default case (all writers write to indices named the same
> > thing
> > > > >> with
> > > > >>> a
> > > > >>>> fixed batch size), which I do not like, but maybe it's worth the
> > > > >>> compromise
> > > > >>>> to make it less complex for the advanced case.
> > > > >>>>
> > > > >>>> Thanks a lot for the suggestion, Nick, it's interesting;  I'm
> > > > beginning
> > > > >>> to
> > > > >>>> lean your way.
> > > > >>>>
> > > > >>>> On Fri, Jan 13, 2017 at 2:51 PM, [email protected] <
> > [email protected]
> > > >
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>> I like the suggestions you made, Nick.  The only thing I would
> add
> > > is
> > > > >>> that
> > > > >>>> it's also nice to see an explicit when(false), as people newer
> to
> > > the
> > > > >>>> platform may not know where to expect configs for the different
> > > > >> writers.
> > > > >>>> Being able to do it either way, which I think is already assumed
> > in
> > > > >> your
> > > > >>>> model, would make sense.  I would just suggest that, if we
> support
> > > but
> > > > >>> are
> > > > >>>> disabling a writer, that the platform inserts a default
> > when(false)
> > > to
> > > > >> be
> > > > >>>> explicit.
> > > > >>>>
> > > > >>>> Jon
> > > > >>>>
> > > > >>>> On Fri, Jan 13, 2017 at 11:59 AM Casey Stella <
> [email protected]
> > >
> > > > >>> wrote:
> > > > >>>>
> > > > >>>>> Let me noodle on this over the weekend.  Your syntax is looking
> > > less
> > > > >>>>> onerous to me and I like the following statement from Otto: "In
> > the
> > > > >>> end,
> > > > >>>>> each write destination ‘type’ will need it’s own configuration.
> > > This
> > > > >>> is
> > > > >>>> an
> > > > >>>>> extension point."
> > > > >>>>>
> > > > >>>>> I may come around to your way of thinking.
> > > > >>>>>
> > > > >>>>> On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler <
> > > > >> [email protected]
> > > > >>>>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> In the end, each write destination ‘type’ will need it’s own
> > > > >>>>>> configuration.  This is an extension point.
> > > > >>>>>> {
> > > > >>>>>> HDFS:{
> > > > >>>>>> outputAdapters:[
> > > > >>>>>> {name: avro,
> > > > >>>>>> settings:{
> > > > >>>>>> avro stuff….
> > > > >>>>>> when:{
> > > > >>>>>> },
> > > > >>>>>> {
> > > > >>>>>> name: sequence file,
> > > > >>>>>> …..
> > > > >>>>>>
> > > > >>>>>> or some such.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> On January 13, 2017 at 11:51:15, Nick Allen (
> [email protected]
> > )
> > > > >>>> wrote:
> > > > >>>>>>
> > > > >>>>>> I will add also that instead of global overrides, like index,
> we
> > > > >>> should
> > > > >>>>> use
> > > > >>>>>> configuration key names that are more appropriate to the
> output.
> > > > >>>>>>
> > > > >>>>>> For example, does 'index' really make sense for HDFS? Or would
> > > > >> 'path'
> > > > >>>> be
> > > > >>>>>> more appropriate?
> > > > >>>>>>
> > > > >>>>>> {
> > > > >>>>>> 'elasticsearch': {
> > > > >>>>>> 'index': 'foo',
> > > > >>>>>> 'batchSize': 1
> > > > >>>>>> },
> > > > >>>>>> 'hdfs': {
> > > > >>>>>> 'path': '/foo/bar/...',
> > > > >>>>>> 'batchSize': 100
> > > > >>>>>> }
> > > > >>>>>> }
> > > > >>>>>>
> > > > >>>>>> Ok, I've said my peace. Thanks for the effort in summarizing
> all
> > > > >>> this,
> > > > >>>>>> Casey.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen <
> > [email protected]>
> > > > >>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> Nick's concerns about my suggestion were that it was overly
> > > > >> complex
> > > > >>>> and
> > > > >>>>>>>> hard to grok and that we could dispense with backwards
> > > > >>> compatibility
> > > > >>>>> and
> > > > >>>>>>>> make people do a bit more work on the default case for the
> > > > >>> benefits
> > > > >>>>> of a
> > > > >>>>>>>> simpler advanced case. (Nick, make sure I don't misstate
> your
> > > > >>>>> position)
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> I will add is that in my mind, the majority case would be a
> > user
> > > > >>>>>>> specifying the outputs, but not things like 'batchSize' or
> > > > >> 'when'.
> > > > >>> I
> > > > >>>>>> think
> > > > >>>>>>> in the majority case, the user would accept whatever the
> > default
> > > > >>>> batch
> > > > >>>>>> size
> > > > >>>>>>> is.
> > > > >>>>>>>
> > > > >>>>>>> Here are alternatives suggestions for all the examples that
> you
> > > > >>>>> provided
> > > > >>>>>>> previously.
> > > > >>>>>>>
> > > > >>>>>>> Base Case
> > > > >>>>>>>
> > > > >>>>>>> - The user must always specify the 'outputs' for clarity.
> > > > >>>>>>> - Uses default index name, batch size and when = true.
> > > > >>>>>>>
> > > > >>>>>>> {
> > > > >>>>>>> 'elasticsearch': {},
> > > > >>>>>>> 'hdfs': {}
> > > > >>>>>>> }
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> <
> > > > >>>>>> https://gist.github.com/nickwallen/
> > 489735b65cdb38aae6e45cec7633a0
> > > > >>>>>> a1#writer-non-specific-case>Writer-non-specific
> > > > >>>>>>
> > > > >>>>>>> Case
> > > > >>>>>>>
> > > > >>>>>>> - There are no global overrides, as in Casey's proposal.
> > > > >>>>>>> - Easier to grok IMO.
> > > > >>>>>>>
> > > > >>>>>>> {
> > > > >>>>>>> 'elasticsearch': {
> > > > >>>>>>> 'index': 'foo',
> > > > >>>>>>> 'batchSize': 100
> > > > >>>>>>> },
> > > > >>>>>>> 'hdfs': {
> > > > >>>>>>> 'index': 'foo',
> > > > >>>>>>> 'batchSize': 100
> > > > >>>>>>> }
> > > > >>>>>>> }
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> <
> > > > >>>>>> https://gist.github.com/nickwallen/
> > 489735b65cdb38aae6e45cec7633a0
> > > > >>>>>> a1#writer-specific-case-without-filters>Writer-specific
> > > > >>>>>>
> > > > >>>>>>> case without filters
> > > > >>>>>>>
> > > > >>>>>>> {
> > > > >>>>>>> 'elasticsearch': {
> > > > >>>>>>> 'index': 'foo',
> > > > >>>>>>> 'batchSize': 1
> > > > >>>>>>> },
> > > > >>>>>>> 'hdfs': {
> > > > >>>>>>> 'index': 'foo',
> > > > >>>>>>> 'batchSize': 100
> > > > >>>>>>> }
> > > > >>>>>>> }
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> <
> > > > >>>>>> https://gist.github.com/nickwallen/
> > 489735b65cdb38aae6e45cec7633a0
> > > > >>>>>> a1#writer-specific-case-with-filters>Writer-specific
> > > > >>>>>>
> > > > >>>>>>> case with filters
> > > > >>>>>>>
> > > > >>>>>>> - Instead of having to say when=false, just don't configure
> > HDFS
> > > > >>>>>>>
> > > > >>>>>>> {
> > > > >>>>>>> 'elasticsearch': {
> > > > >>>>>>> 'index': 'foo',
> > > > >>>>>>> 'batchSize': 100,
> > > > >>>>>>> 'when': 'exists(field1)'
> > > > >>>>>>> }
> > > > >>>>>>> }
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella <
> > > > >> [email protected]
> > > > >>>>
> > > > >>>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Dave,
> > > > >>>>>>>> For the benefit of posterity and people who might not be as
> > > > >> deeply
> > > > >>>>>>>> entangled in the system as we have been, I'll recap things
> and
> > > > >>>>> hopefully
> > > > >>>>>>>> answer your question in the process.
> > > > >>>>>>>>
> > > > >>>>>>>> Historically the index configuration is split between the
> > > > >>> enrichment
> > > > >>>>>>>> configs and the global configs.
> > > > >>>>>>>>
> > > > >>>>>>>> - The global configs really controls configs that apply to
> all
> > > > >>>>> sensors.
> > > > >>>>>>>> Historically this has been stuff like index connection
> > strings,
> > > > >>> etc.
> > > > >>>>>>>> - The sensor-specific configs which control things that vary
> > by
> > > > >>>>> sensor.
> > > > >>>>>>>>
> > > > >>>>>>>> As of Metron-652 (in review currently), we moved the sensor
> > > > >>> specific
> > > > >>>>>>>> configs from the enrichment configs. The proposal here is to
> > > > >>>> increase
> > > > >>>>>> the
> > > > >>>>>>>> granularity of the the sensor specific files to make them
> > > > >> support
> > > > >>>>> index
> > > > >>>>>>>> writer-specific configs. Right now in the indexing topology,
> > we
> > > > >>>> have 2
> > > > >>>>>>>> writers (fixed): ES/Solr and HDFS.
> > > > >>>>>>>>
> > > > >>>>>>>> The proposed configuration would allow you to either
> specify a
> > > > >>>> blanket
> > > > >>>>>>>> sensor-level config for the index name and batchSize and/or
> > > > >>> override
> > > > >>>>> at
> > > > >>>>>>>> the
> > > > >>>>>>>> writer level, thereby supporting a couple of use-cases:
> > > > >>>>>>>>
> > > > >>>>>>>> - Turning off certain index writers (e.g. HDFS)
> > > > >>>>>>>> - Filtering the messages written to certain index writers
> > > > >>>>>>>>
> > > > >>>>>>>> The two competing configs between Nick and I are as follows:
> > > > >>>>>>>>
> > > > >>>>>>>> - I want to make sure we keep the old sensor-specific
> defaults
> > > > >>> with
> > > > >>>>>>>> writer-specific overrides available
> > > > >>>>>>>> - Nick thought we could simplify the permutations by making
> > the
> > > > >>>>>>>> indexing
> > > > >>>>>>>> config only the writer-level configs.
> > > > >>>>>>>>
> > > > >>>>>>>> My concerns about Nick's suggestion were that the default
> and
> > > > >>>> majority
> > > > >>>>>>>> case, specifying the index and the batchSize for all writers
> > (th
> > > > >>>> eone
> > > > >>>>> we
> > > > >>>>>>>> support now) would require more configuration.
> > > > >>>>>>>>
> > > > >>>>>>>> Nick's concerns about my suggestion were that it was overly
> > > > >>> complex
> > > > >>>>> and
> > > > >>>>>>>> hard to grok and that we could dispense with backwards
> > > > >>> compatibility
> > > > >>>>> and
> > > > >>>>>>>> make people do a bit more work on the default case for the
> > > > >>> benefits
> > > > >>>>> of a
> > > > >>>>>>>> simpler advanced case. (Nick, make sure I don't misstate
> your
> > > > >>>>> position).
> > > > >>>>>>>>
> > > > >>>>>>>> Casey
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle <
> > > > >>> [email protected]>
> > > > >>>>>>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> Casey,
> > > > >>>>>>>>>
> > > > >>>>>>>>> Can you give me a level set of what your thinking is now? I
> > > > >>> think
> > > > >>>>> it's
> > > > >>>>>>>>> global control of all index types + overrides on a per-type
> > > > >>> basis.
> > > > >>>>>> Fwiw,
> > > > >>>>>>>>> I'm totally for that, but I want to make sure I'm not
> > imposing
> > > > >>> my
> > > > >>>>>>>>> pre-concieved notions on your consensus-driven ones.
> > > > >>>>>>>>>
> > > > >>>>>>>>> -D....
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella <
> > > > >>>> [email protected]>
> > > > >>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> I am suggesting that, yes. The configs are essentially the
> > > > >>> same
> > > > >>>> as
> > > > >>>>>>>>> yours,
> > > > >>>>>>>>>> except there is an override specified at the top level.
> > > > >>> Without
> > > > >>>>>>>> that, in
> > > > >>>>>>>>>> order to specify both HDFS and ES have batch sizes of 100,
> > > > >> you
> > > > >>>>> have
> > > > >>>>>> to
> > > > >>>>>>>>>> explicitly configure each. It's less that I'm trying to
> have
> > > > >>>>>>>> backwards
> > > > >>>>>>>>>> compatibility and more that I'm trying to make the
> majority
> > > > >>> case
> > > > >>>>>> easy:
> > > > >>>>>>>>> both
> > > > >>>>>>>>>> writers write everything to a specified index name with a
> > > > >>>>> specified
> > > > >>>>>>>> batch
> > > > >>>>>>>>>> size (which is what we have now). Beyond that, I want to
> > > > >> allow
> > > > >>>> for
> > > > >>>>>>>>>> specifying an override for the config on a
> writer-by-writer
> > > > >>>> basis
> > > > >>>>>> for
> > > > >>>>>>>>> those
> > > > >>>>>>>>>> who need it.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen <
> > > > >>>> [email protected]>
> > > > >>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> Are you saying we support all of these variants? I
> realize
> > > > >>> you
> > > > >>>>> are
> > > > >>>>>>>>>> trying
> > > > >>>>>>>>>>> to have some backwards compatibility, but this also makes
> > > > >> it
> > > > >>>>>> harder
> > > > >>>>>>>>> for a
> > > > >>>>>>>>>>> user to grok (for me at least).
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Personally I like my original example as there are fewer
> > > > >>>>>>>>> sub-structures,
> > > > >>>>>>>>>>> like 'writerConfig', which makes the whole thing simpler
> > > > >> and
> > > > >>>>>> easier
> > > > >>>>>>>> to
> > > > >>>>>>>>>>> grok. But maybe others will think your proposal is just
> as
> > > > >>>> easy
> > > > >>>>> to
> > > > >>>>>>>>> grok.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella <
> > > > >>>>>> [email protected]>
> > > > >>>>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> Ok, so here's what I'm thinking based on the discussion:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> - Keeping the configs that we have now (batchSize and
> > > > >>> index)
> > > > >>>>> as
> > > > >>>>>>>>>>> defaults
> > > > >>>>>>>>>>>> for the unspecified writer-specific case
> > > > >>>>>>>>>>>> - Adding the config Nick suggested
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> *Base Case*:
> > > > >>>>>>>>>>>> {
> > > > >>>>>>>>>>>> }
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> - all writers write all messages
> > > > >>>>>>>>>>>> - index named the same as the sensor for all writers
> > > > >>>>>>>>>>>> - batchSize of 1 for all writers
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> *Writer-non-specific case*:
> > > > >>>>>>>>>>>> {
> > > > >>>>>>>>>>>> "index" : "foo"
> > > > >>>>>>>>>>>> ,"batchSize" : 100
> > > > >>>>>>>>>>>> }
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> - All writers write all messages
> > > > >>>>>>>>>>>> - index is named "foo", different from the sensor for
> > > > >> all
> > > > >>>>>>>> writers
> > > > >>>>>>>>>>>> - batchSize is 100 for all writers
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> *Writer-specific case without filters*
> > > > >>>>>>>>>>>> {
> > > > >>>>>>>>>>>> "index" : "foo"
> > > > >>>>>>>>>>>> ,"batchSize" : 1
> > > > >>>>>>>>>>>> , "writerConfig" :
> > > > >>>>>>>>>>>> {
> > > > >>>>>>>>>>>> "elasticsearch" : {
> > > > >>>>>>>>>>>> "batchSize" : 100
> > > > >>>>>>>>>>>> }
> > > > >>>>>>>>>>>> }
> > > > >>>>>>>>>>>> }
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> - All writers write all messages
> > > > >>>>>>>>>>>> - index is named "foo", different from the sensor for
> > > > >> all
> > > > >>>>>>>> writers
> > > > >>>>>>>>>>>> - batchSize is 1 for HDFS and 100 for elasticsearch
> > > > >>> writers
> > > > >>>>>>>>>>>> - NOTE: I could override the index name too
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> *Writer-specific case with filters*
> > > > >>>>>>>>>>>> {
> > > > >>>>>>>>>>>> "index" : "foo"
> > > > >>>>>>>>>>>> ,"batchSize" : 1
> > > > >>>>>>>>>>>> , "writerConfig" :
> > > > >>>>>>>>>>>> {
> > > > >>>>>>>>>>>> "elasticsearch" : {
> > > > >>>>>>>>>>>> "batchSize" : 100,
> > > > >>>>>>>>>>>> "when" : "exists(field1)"
> > > > >>>>>>>>>>>> },
> > > > >>>>>>>>>>>> "hdfs" : {
> > > > >>>>>>>>>>>> "when" : "false"
> > > > >>>>>>>>>>>> }
> > > > >>>>>>>>>>>> }
> > > > >>>>>>>>>>>> }
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> - ES writer writes messages which have field1, HDFS
> > > > >>> doesn't
> > > > >>>>>>>>>>>> - index is named "foo", different from the sensor for
> > > > >> all
> > > > >>>>>>>> writers
> > > > >>>>>>>>>>>> - 100 for elasticsearch writers
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Thoughts?
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <
> > > > >>>>>>>> [email protected]
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> For larger installations you need to control what is
> > > > >>>> indexed
> > > > >>>>>> so
> > > > >>>>>>>> you
> > > > >>>>>>>>>>> don’t
> > > > >>>>>>>>>>>>> end up with a nasty elastic search situation and so
> > > > >> you
> > > > >>>> can
> > > > >>>>>> mine
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>> data
> > > > >>>>>>>>>>>>> later for reports and training ml models.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Thanks
> > > > >>>>>>>>>>>>> Carolyn
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On 1/13/17, 9:40 AM, "Casey Stella" <
> > > > >> [email protected]
> > > > >>>>
> > > > >>>>>> wrote:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> OH that's a good idea!
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <
> > > > >>>>>>>> [email protected]>
> > > > >>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> I like the "Index Filtering" option based on the
> > > > >>>>>> flexibility
> > > > >>>>>>>>> that
> > > > >>>>>>>>>> it
> > > > >>>>>>>>>>>>>>> provides. Should each output (HDFS, ES, etc) have
> > > > >> its
> > > > >>>> own
> > > > >>>>>>>>>>>> configuration
> > > > >>>>>>>>>>>>>>> settings? For example, aren't things like batching
> > > > >>>>> handled
> > > > >>>>>>>>>>> separately
> > > > >>>>>>>>>>>>> for
> > > > >>>>>>>>>>>>>>> HDFS versus Elasticsearch?
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Something along the lines of...
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> {
> > > > >>>>>>>>>>>>>>> "hdfs" : {
> > > > >>>>>>>>>>>>>>> "when": "exists(field1)",
> > > > >>>>>>>>>>>>>>> "batchSize": 100
> > > > >>>>>>>>>>>>>>> },
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> "elasticsearch" : {
> > > > >>>>>>>>>>>>>>> "when": "true",
> > > > >>>>>>>>>>>>>>> "batchSize": 1000,
> > > > >>>>>>>>>>>>>>> "index": "squid"
> > > > >>>>>>>>>>>>>>> }
> > > > >>>>>>>>>>>>>>> }
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <
> > > > >>>>>>>>> [email protected]
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Yeah, I tend to like the first option too. Any
> > > > >>>>> opposition
> > > > >>>>>>>> to
> > > > >>>>>>>>>> that
> > > > >>>>>>>>>>>>> from
> > > > >>>>>>>>>>>>>>>> anyone?
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> The points brought up are good ones and I think
> > > > >>> that
> > > > >>>> it
> > > > >>>>>>>> may be
> > > > >>>>>>>>>>>> worth a
> > > > >>>>>>>>>>>>>>>> broader discussion of the requirements of
> > > > >> indexing
> > > > >>>> in a
> > > > >>>>>>>>> separate
> > > > >>>>>>>>>>> dev
> > > > >>>>>>>>>>>>> list
> > > > >>>>>>>>>>>>>>>> thread. Maybe a list of desires with coherent
> > > > >>>> use-cases
> > > > >>>>>>>>>>> justifying
> > > > >>>>>>>>>>>>> them
> > > > >>>>>>>>>>>>>>> so
> > > > >>>>>>>>>>>>>>>> we can think about how this stuff should work and
> > > > >>>> where
> > > > >>>>>> the
> > > > >>>>>>>>>>> natural
> > > > >>>>>>>>>>>>>>>> extension points should be. Afterall, we need to
> > > > >>> toe
> > > > >>>>> the
> > > > >>>>>>>> line
> > > > >>>>>>>>>>>> between
> > > > >>>>>>>>>>>>>>>> engineering and overengineering for features
> > > > >> nobody
> > > > >>>>> will
> > > > >>>>>>>> want.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> I'm not sure about the extensions to the standard
> > > > >>>>> fields.
> > > > >>>>>>>> I'm
> > > > >>>>>>>>>>> torn
> > > > >>>>>>>>>>>>>>> between
> > > > >>>>>>>>>>>>>>>> the notions that we should have no standard
> > > > >> fields
> > > > >>> vs
> > > > >>>>> we
> > > > >>>>>>>>> should
> > > > >>>>>>>>>>>> have a
> > > > >>>>>>>>>>>>>>>> boatload of standard fields (with most of them
> > > > >>>> empty).
> > > > >>>>> I
> > > > >>>>>>>>>> exchange
> > > > >>>>>>>>>>>>>>>> positions fairly regularly on that question. ;)
> > > > >> It
> > > > >>>> may
> > > > >>>>> be
> > > > >>>>>>>>>> worth a
> > > > >>>>>>>>>>>> dev
> > > > >>>>>>>>>>>>>>> list
> > > > >>>>>>>>>>>>>>>> discussion to lay out how you imagine an
> > > > >> extension
> > > > >>> of
> > > > >>>>>>>> standard
> > > > >>>>>>>>>>>> fields
> > > > >>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>> how it might look as implemented in Metron.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Casey
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Casey
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson
> > > > >> <
> > > > >>>>>>>>>>>>>>>> [email protected]>
> > > > >>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> I'll second my preference for the first
> > > > >> option. I
> > > > >>>>> think
> > > > >>>>>>>> the
> > > > >>>>>>>>>>>> ability
> > > > >>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>> use
> > > > >>>>>>>>>>>>>>>>> Stellar filters to customize indexing would be
> > > > >> a
> > > > >>>> big
> > > > >>>>>> win.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> I'm glad Matt brought up the point about data
> > > > >>> lake
> > > > >>>>> and
> > > > >>>>>>>> CEP.
> > > > >>>>>>>>> I
> > > > >>>>>>>>>>>> think
> > > > >>>>>>>>>>>>>>> this
> > > > >>>>>>>>>>>>>>>> is
> > > > >>>>>>>>>>>>>>>>> a really important use case that we need to
> > > > >>>> consider.
> > > > >>>>>>>> Take a
> > > > >>>>>>>>>>>> simple
> > > > >>>>>>>>>>>>>>>>> example... If I have data coming in from 3
> > > > >>>> different
> > > > >>>>>>>>> firewall
> > > > >>>>>>>>>>>>> vendors
> > > > >>>>>>>>>>>>>>>> and 2
> > > > >>>>>>>>>>>>>>>>> different web proxy/url filtering vendors and I
> > > > >>>> want
> > > > >>>>> to
> > > > >>>>>>>> be
> > > > >>>>>>>>>> able
> > > > >>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>> analyze
> > > > >>>>>>>>>>>>>>>>> that data set, I need the data to be indexed
> > > > >> all
> > > > >>>>>> together
> > > > >>>>>>>>>>> (likely
> > > > >>>>>>>>>>>> in
> > > > >>>>>>>>>>>>>>>> HDFS)
> > > > >>>>>>>>>>>>>>>>> and to have a normalized schema such that IP
> > > > >>>> address,
> > > > >>>>>>>> URL,
> > > > >>>>>>>>> and
> > > > >>>>>>>>>>>> user
> > > > >>>>>>>>>>>>>>> name
> > > > >>>>>>>>>>>>>>>>> (to take a few) can be easily queried and
> > > > >>>>> aggregated. I
> > > > >>>>>>>> can
> > > > >>>>>>>>>> also
> > > > >>>>>>>>>>>>>>> envision
> > > > >>>>>>>>>>>>>>>>> scenarios where I would want to index data
> > > > >> based
> > > > >>> on
> > > > >>>>>>>>> attributes
> > > > >>>>>>>>>>>> other
> > > > >>>>>>>>>>>>>>> than
> > > > >>>>>>>>>>>>>>>>> sensor, business unit or subsidiary for
> > > > >> example.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> I've been wanted to propose extending our 7
> > > > >>>> standard
> > > > >>>>>>>> fields
> > > > >>>>>>>>> to
> > > > >>>>>>>>>>>>> include
> > > > >>>>>>>>>>>>>>>>> things like URL and user. Is there community
> > > > >>>>>>>>> interest/support
> > > > >>>>>>>>>>> for
> > > > >>>>>>>>>>>>>>> moving
> > > > >>>>>>>>>>>>>>>> in
> > > > >>>>>>>>>>>>>>>>> that direction? If so, I'll start a new thread.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Thanks!
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> -Kyle
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <
> > > > >>>>>>>>> [email protected]
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> Ah, I see. If overriding the default index
> > > > >> name
> > > > >>>>>> allows
> > > > >>>>>>>>>> using
> > > > >>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>> same
> > > > >>>>>>>>>>>>>>>>>> name for multiple sensors, then the goal can
> > > > >> be
> > > > >>>>>>>> achieved.
> > > > >>>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>>> --Matt
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> On 1/12/17, 3:30 PM, "Casey Stella" <
> > > > >>>>>>>> [email protected]>
> > > > >>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> Oh, you could! Let's say you have a syslog
> > > > >>> parser
> > > > >>>>>>>>> with
> > > > >>>>>>>>>>> data
> > > > >>>>>>>>>>>>> from
> > > > >>>>>>>>>>>>>>>>>> sources 1
> > > > >>>>>>>>>>>>>>>>>> 2 and 3. You'd end up with one kafka queue
> > > > >>> with 3
> > > > >>>>>>>>>> parsers
> > > > >>>>>>>>>>>>>>> attached
> > > > >>>>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>>> that
> > > > >>>>>>>>>>>>>>>>>> queue, each picking part the messages from
> > > > >>> source
> > > > >>>>>>>> 1, 2
> > > > >>>>>>>>>> and
> > > > >>>>>>>>>>>> 3.
> > > > >>>>>>>>>>>>>>>> They'd
> > > > >>>>>>>>>>>>>>>>>> go
> > > > >>>>>>>>>>>>>>>>>> through separate enrichment and into the
> > > > >>> indexing
> > > > >>>>>>>>>>> topology.
> > > > >>>>>>>>>>>>> In
> > > > >>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>> indexing topology, you could specify the same
> > > > >>>> index
> > > > >>>>>>>>> name
> > > > >>>>>>>>>>>>> "syslog"
> > > > >>>>>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>>>> all
> > > > >>>>>>>>>>>>>>>>>> of the messages go into the same index for
> > > > >> CEP
> > > > >>>>>>>>> querying
> > > > >>>>>>>>>> if
> > > > >>>>>>>>>>>> so
> > > > >>>>>>>>>>>>>>>>> desired.
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <
> > > > >>>>>>>>>>>> [email protected]
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> Syslog is hell on parsers – I know, I
> > > > >> worked
> > > > >>> at
> > > > >>>>>>>>>> LogLogic
> > > > >>>>>>>>>>>> in
> > > > >>>>>>>>>>>>> a
> > > > >>>>>>>>>>>>>>>>>> previous
> > > > >>>>>>>>>>>>>>>>>>> life. It makes perfect sense to route
> > > > >>> different
> > > > >>>>>>>>> lines
> > > > >>>>>>>>>>>> from
> > > > >>>>>>>>>>>>>>>> syslog
> > > > >>>>>>>>>>>>>>>>>> through
> > > > >>>>>>>>>>>>>>>>>>> different appropriate parsers. But a lot of
> > > > >>>> what
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>>> parsers
> > > > >>>>>>>>>>>>>>> do
> > > > >>>>>>>>>>>>>>>> is
> > > > >>>>>>>>>>>>>>>>>>> identify consistent subsets of metadata and
> > > > >>>>>>>> annotate
> > > > >>>>>>>>>> it
> > > > >>>>>>>>>>> –
> > > > >>>>>>>>>>>>> eg,
> > > > >>>>>>>>>>>>>>>>>> src_ip_addr,
> > > > >>>>>>>>>>>>>>>>>>> event timestamps, etc. Once those metadata
> > > > >>> are
> > > > >>>>>>>>>>> annotated
> > > > >>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>>>> available
> > > > >>>>>>>>>>>>>>>>>>> with common field names, why doesn’t it
> > > > >> make
> > > > >>>>>>>> sense
> > > > >>>>>>>>> to
> > > > >>>>>>>>>>>> index
> > > > >>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>> messages
> > > > >>>>>>>>>>>>>>>>>>> together, for CEP querying? I think Splunk
> > > > >>> has
> > > > >>>>>>>>>>>> illustrated
> > > > >>>>>>>>>>>>>>> this
> > > > >>>>>>>>>>>>>>>>>> model.
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> On 1/12/17, 3:00 PM, "Casey Stella" <
> > > > >>>>>>>>>> [email protected]
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> yeah, I mean, honestly, I think the
> > > > >> approach
> > > > >>>>>>>>> that
> > > > >>>>>>>>>>>> we've
> > > > >>>>>>>>>>>>>>> taken
> > > > >>>>>>>>>>>>>>>>> for
> > > > >>>>>>>>>>>>>>>>>>> sources
> > > > >>>>>>>>>>>>>>>>>>> which aggregate different types of data is
> > > > >> to
> > > > >>>>>>>>>>> provide
> > > > >>>>>>>>>>>>>>> filters
> > > > >>>>>>>>>>>>>>>>> at
> > > > >>>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>> parser
> > > > >>>>>>>>>>>>>>>>>>> level and have multiple parser topologies
> > > > >>>>>>>> (with
> > > > >>>>>>>>>>>>> different,
> > > > >>>>>>>>>>>>>>>>>> possibly
> > > > >>>>>>>>>>>>>>>>>>> mutually exclusive filters) running. This
> > > > >>>>>>>> would
> > > > >>>>>>>>>> be
> > > > >>>>>>>>>>> a
> > > > >>>>>>>>>>>>>>>>> completely
> > > > >>>>>>>>>>>>>>>>>>> separate
> > > > >>>>>>>>>>>>>>>>>>> sensor. Imagine a syslog data source that
> > > > >>>>>>>>>>> aggregates
> > > > >>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>> you
> > > > >>>>>>>>>>>>>>>>>> want to
> > > > >>>>>>>>>>>>>>>>>>> pick
> > > > >>>>>>>>>>>>>>>>>>> apart certain pieces of messages. This is
> > > > >>>>>>>> why
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>>> initial
> > > > >>>>>>>>>>>>>>>>>> thought and
> > > > >>>>>>>>>>>>>>>>>>> architecture was one index per sensor.
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 5:55 PM, Matt
> > > > >> Foley <
> > > > >>>>>>>>>>>>>>>> [email protected]>
> > > > >>>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> I’m thinking that CEP (Complex Event
> > > > >>>>>>>>> Processing)
> > > > >>>>>>>>>>> is
> > > > >>>>>>>>>>>>>>>> contrary
> > > > >>>>>>>>>>>>>>>>>> to the
> > > > >>>>>>>>>>>>>>>>>>> idea
> > > > >>>>>>>>>>>>>>>>>>>> of silo-ing data per sensor.
> > > > >>>>>>>>>>>>>>>>>>>> Now it’s true that some of those sensors
> > > > >>>>>>>> are
> > > > >>>>>>>>>>> already
> > > > >>>>>>>>>>>>>>>>>> aggregating
> > > > >>>>>>>>>>>>>>>>>>> data from
> > > > >>>>>>>>>>>>>>>>>>>> multiple sources, so maybe I’m wrong
> > > > >> here.
> > > > >>>>>>>>>>>>>>>>>>>> But it just seems to me that the “data
> > > > >>>>>>>> lake”
> > > > >>>>>>>>>>>> insights
> > > > >>>>>>>>>>>>>>> come
> > > > >>>>>>>>>>>>>>>>> from
> > > > >>>>>>>>>>>>>>>>>>> being able
> > > > >>>>>>>>>>>>>>>>>>>> to make decisions over the whole mass of
> > > > >>>>>>>> data
> > > > >>>>>>>>>>> rather
> > > > >>>>>>>>>>>>> than
> > > > >>>>>>>>>>>>>>>>> just
> > > > >>>>>>>>>>>>>>>>>>> vertical
> > > > >>>>>>>>>>>>>>>>>>>> slices of it.
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> On 1/12/17, 2:15 PM, "Casey Stella" <
> > > > >>>>>>>>>>>>> [email protected]>
> > > > >>>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> Hey Matt,
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> Thanks for the comment!
> > > > >>>>>>>>>>>>>>>>>>>> 1. At the moment, we only have one
> > > > >>>>>>>> index
> > > > >>>>>>>>>> name,
> > > > >>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>> default
> > > > >>>>>>>>>>>>>>>>>> of
> > > > >>>>>>>>>>>>>>>>>>> which is
> > > > >>>>>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>>> sensor name but that's entirely up to
> > > > >>>>>>>> the
> > > > >>>>>>>>>>> user.
> > > > >>>>>>>>>>>>> This
> > > > >>>>>>>>>>>>>>>> is
> > > > >>>>>>>>>>>>>>>>>> sensor
> > > > >>>>>>>>>>>>>>>>>>>> specific,
> > > > >>>>>>>>>>>>>>>>>>>> so it'd be a separate config for each
> > > > >>>>>>>>>> sensor.
> > > > >>>>>>>>>>>> If
> > > > >>>>>>>>>>>>> we
> > > > >>>>>>>>>>>>>>>> want
> > > > >>>>>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>>>> build
> > > > >>>>>>>>>>>>>>>>>>>> multiple
> > > > >>>>>>>>>>>>>>>>>>>> indices per sensor, we'd have to think
> > > > >>>>>>>>>>> carefully
> > > > >>>>>>>>>>>>>>> about
> > > > >>>>>>>>>>>>>>>>> how
> > > > >>>>>>>>>>>>>>>>>> to do
> > > > >>>>>>>>>>>>>>>>>>> that
> > > > >>>>>>>>>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>>>>>> would be a bigger undertaking. I
> > > > >>>>>>>> guess I
> > > > >>>>>>>>>> can
> > > > >>>>>>>>>>>> see
> > > > >>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>> use,
> > > > >>>>>>>>>>>>>>>>>> though
> > > > >>>>>>>>>>>>>>>>>>>> (redirect
> > > > >>>>>>>>>>>>>>>>>>>> messages to one index vs another based
> > > > >>>>>>>> on
> > > > >>>>>>>>> a
> > > > >>>>>>>>>>>>> predicate
> > > > >>>>>>>>>>>>>>>> for
> > > > >>>>>>>>>>>>>>>>>> a given
> > > > >>>>>>>>>>>>>>>>>>>> sensor).
> > > > >>>>>>>>>>>>>>>>>>>> Anyway, not where I was originally
> > > > >>>>>>>>> thinking
> > > > >>>>>>>>>>> that
> > > > >>>>>>>>>>>>> this
> > > > >>>>>>>>>>>>>>>>>> discussion
> > > > >>>>>>>>>>>>>>>>>>> would
> > > > >>>>>>>>>>>>>>>>>>>> go,
> > > > >>>>>>>>>>>>>>>>>>>> but it's an interesting point.
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> 2. I hadn't thought through the
> > > > >>>>>>>>>> implementation
> > > > >>>>>>>>>>>>> quite
> > > > >>>>>>>>>>>>>>>> yet,
> > > > >>>>>>>>>>>>>>>>>> but we
> > > > >>>>>>>>>>>>>>>>>>> don't
> > > > >>>>>>>>>>>>>>>>>>>> actually have a splitter bolt in that
> > > > >>>>>>>>>>> topology,
> > > > >>>>>>>>>>>>> just
> > > > >>>>>>>>>>>>>>> a
> > > > >>>>>>>>>>>>>>>>>> spout
> > > > >>>>>>>>>>>>>>>>>>> that goes
> > > > >>>>>>>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>>>>> the elasticsearch writer and also to
> > > > >>>>>>>> the
> > > > >>>>>>>>>> hdfs
> > > > >>>>>>>>>>>>> writer.
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 4:52 PM, Matt
> > > > >>>>>>>>> Foley
> > > > >>>>>>>>>> <
> > > > >>>>>>>>>>>>>>>>>> [email protected]>
> > > > >>>>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> Casey, good to have controls like
> > > > >>>>>>>> this.
> > > > >>>>>>>>>>>> Couple
> > > > >>>>>>>>>>>>>>>>>> questions:
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> 1. Regarding the “index” : “squid”
> > > > >>>>>>>>>>> name/value
> > > > >>>>>>>>>>>>> pair,
> > > > >>>>>>>>>>>>>>>> is
> > > > >>>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>> index name
> > > > >>>>>>>>>>>>>>>>>>>>> expected to always be a sensor
> > > > >>>>>>>> name? Or
> > > > >>>>>>>>>> is
> > > > >>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>> given
> > > > >>>>>>>>>>>>>>>>>> json
> > > > >>>>>>>>>>>>>>>>>>> structure
> > > > >>>>>>>>>>>>>>>>>>>>> subordinate to a sensor name in
> > > > >>>>>>>>> zookeeper?
> > > > >>>>>>>>>>> Or
> > > > >>>>>>>>>>>>> can
> > > > >>>>>>>>>>>>>>> we
> > > > >>>>>>>>>>>>>>>>>> build
> > > > >>>>>>>>>>>>>>>>>>> arbitrary
> > > > >>>>>>>>>>>>>>>>>>>>> indexes with this new specification,
> > > > >>>>>>>>>>>>> independent of
> > > > >>>>>>>>>>>>>>>>>> sensor?
> > > > >>>>>>>>>>>>>>>>>>> Should
> > > > >>>>>>>>>>>>>>>>>>>> there
> > > > >>>>>>>>>>>>>>>>>>>>> actually be a list of “indexes”, ie
> > > > >>>>>>>>>>>>>>>>>>>>> { “indexes” : [
> > > > >>>>>>>>>>>>>>>>>>>>> {“index” : “name1”,
> > > > >>>>>>>>>>>>>>>>>>>>> …
> > > > >>>>>>>>>>>>>>>>>>>>> },
> > > > >>>>>>>>>>>>>>>>>>>>> {“index” : “name2”,
> > > > >>>>>>>>>>>>>>>>>>>>> …
> > > > >>>>>>>>>>>>>>>>>>>>> } ]
> > > > >>>>>>>>>>>>>>>>>>>>> }
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> 2. Would the filtering / writer
> > > > >>>>>>>>> selection
> > > > >>>>>>>>>>>> logic
> > > > >>>>>>>>>>>>>>> take
> > > > >>>>>>>>>>>>>>>>>> place in
> > > > >>>>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>>> indexing
> > > > >>>>>>>>>>>>>>>>>>>>> topology splitter bolt? Seems like
> > > > >>>>>>>> that
> > > > >>>>>>>>>>> would
> > > > >>>>>>>>>>>>> have
> > > > >>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>> smallest
> > > > >>>>>>>>>>>>>>>>>>>> impact on
> > > > >>>>>>>>>>>>>>>>>>>>> current implementation, no?
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> Sorry if these are already answered
> > > > >>>>>>>> in
> > > > >>>>>>>>>>>> PR-415, I
> > > > >>>>>>>>>>>>>>>>> haven’t
> > > > >>>>>>>>>>>>>>>>>> had
> > > > >>>>>>>>>>>>>>>>>>> time to
> > > > >>>>>>>>>>>>>>>>>>>>> review that one yet.
> > > > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>>>>>> --Matt
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> On 1/12/17, 12:55 PM, "Michael
> > > > >>>>>>>>> Miklavcic"
> > > > >>>>>>>>>> <
> > > > >>>>>>>>>>>>>>>>>>>> [email protected]>
> > > > >>>>>>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> I like the flexibility and
> > > > >>>>>>>>>>> expressibility
> > > > >>>>>>>>>>>> of
> > > > >>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>> first
> > > > >>>>>>>>>>>>>>>>>>> option
> > > > >>>>>>>>>>>>>>>>>>>> with
> > > > >>>>>>>>>>>>>>>>>>>>> Stellar
> > > > >>>>>>>>>>>>>>>>>>>>> filters.
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> M
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 1:51 PM,
> > > > >>>>>>>>> Casey
> > > > >>>>>>>>>>>>> Stella <
> > > > >>>>>>>>>>>>>>>>>>>> [email protected]>
> > > > >>>>>>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> As of METRON-652 <
> > > > >>>>>>>>>>>>> https://github.com/apache/
> > > > >>>>>>>>>>>>>>>>>>>>> incubator-metron/pull/415>, we
> > > > >>>>>>>>>>>>>>>>>>>>>> will have decoupled the
> > > > >>>>>>>> indexing
> > > > >>>>>>>>>>>>>>> configuration
> > > > >>>>>>>>>>>>>>>>>> from the
> > > > >>>>>>>>>>>>>>>>>>>> enrichment
> > > > >>>>>>>>>>>>>>>>>>>>>> configuration. As an immediate
> > > > >>>>>>>>>>>> follow-up
> > > > >>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>> that,
> > > > >>>>>>>>>>>>>>>>>> I'd
> > > > >>>>>>>>>>>>>>>>>>> like to
> > > > >>>>>>>>>>>>>>>>>>>>> provide the
> > > > >>>>>>>>>>>>>>>>>>>>>> ability to turn off and on
> > > > >>>>>>>> writers
> > > > >>>>>>>>>> via
> > > > >>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>> configs. I'd
> > > > >>>>>>>>>>>>>>>>>>> like
> > > > >>>>>>>>>>>>>>>>>>>> to get
> > > > >>>>>>>>>>>>>>>>>>>>> some
> > > > >>>>>>>>>>>>>>>>>>>>>> community feedback on how the
> > > > >>>>>>>>>>>>> functionality
> > > > >>>>>>>>>>>>>>>>> should
> > > > >>>>>>>>>>>>>>>>>> work,
> > > > >>>>>>>>>>>>>>>>>>> if
> > > > >>>>>>>>>>>>>>>>>>>> y'all are
> > > > >>>>>>>>>>>>>>>>>>>>>> amenable. :)
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> As of now, we have 3 possible
> > > > >>>>>>>>>> writers
> > > > >>>>>>>>>>>>> which
> > > > >>>>>>>>>>>>>>> can
> > > > >>>>>>>>>>>>>>>>> be
> > > > >>>>>>>>>>>>>>>>>> used
> > > > >>>>>>>>>>>>>>>>>>> in the
> > > > >>>>>>>>>>>>>>>>>>>>> indexing
> > > > >>>>>>>>>>>>>>>>>>>>>> topology:
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> - Solr
> > > > >>>>>>>>>>>>>>>>>>>>>> - Elasticsearch
> > > > >>>>>>>>>>>>>>>>>>>>>> - HDFS
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> HDFS is always used,
> > > > >>>>>>>> elasticsearch
> > > > >>>>>>>>>> or
> > > > >>>>>>>>>>>>> solr is
> > > > >>>>>>>>>>>>>>>>> used
> > > > >>>>>>>>>>>>>>>>>>> depending
> > > > >>>>>>>>>>>>>>>>>>>> on how
> > > > >>>>>>>>>>>>>>>>>>>>> you
> > > > >>>>>>>>>>>>>>>>>>>>>> start the indexing topology.
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> A couple of proposals come to
> > > > >>>>>>>> mind
> > > > >>>>>>>>>>>>>>> immediately:
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> *Index Filtering*
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> You would be able to specify a
> > > > >>>>>>>>>> filter
> > > > >>>>>>>>>>> as
> > > > >>>>>>>>>>>>>>>> defined
> > > > >>>>>>>>>>>>>>>>>> by a
> > > > >>>>>>>>>>>>>>>>>>> stellar
> > > > >>>>>>>>>>>>>>>>>>>>> statement
> > > > >>>>>>>>>>>>>>>>>>>>>> (likely a reuse of the
> > > > >>>>>>>>> StellarFilter
> > > > >>>>>>>>>>>> that
> > > > >>>>>>>>>>>>>>>> exists
> > > > >>>>>>>>>>>>>>>>>> in the
> > > > >>>>>>>>>>>>>>>>>>>> Parsers)
> > > > >>>>>>>>>>>>>>>>>>>>> which
> > > > >>>>>>>>>>>>>>>>>>>>>> would allow you to indicate on
> > > > >>>>>>>> a
> > > > >>>>>>>>>>>>>>>>>> message-by-message basis
> > > > >>>>>>>>>>>>>>>>>>>> whether or
> > > > >>>>>>>>>>>>>>>>>>>>> not to
> > > > >>>>>>>>>>>>>>>>>>>>>> write the message.
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> The semantics of this would be
> > > > >>>>>>>> as
> > > > >>>>>>>>>>>> follows:
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> - Default (i.e.
> > > > >>>>>>>> unspecified) is
> > > > >>>>>>>>>> to
> > > > >>>>>>>>>>>> pass
> > > > >>>>>>>>>>>>>>>>>> everything
> > > > >>>>>>>>>>>>>>>>>>> through
> > > > >>>>>>>>>>>>>>>>>>>> (hence
> > > > >>>>>>>>>>>>>>>>>>>>>> backwards compatible with
> > > > >>>>>>>> the
> > > > >>>>>>>>>>> current
> > > > >>>>>>>>>>>>>>>> default
> > > > >>>>>>>>>>>>>>>>>> config).
> > > > >>>>>>>>>>>>>>>>>>>>>> - Messages which have the
> > > > >>>>>>>>>>> associated
> > > > >>>>>>>>>>>>>>> stellar
> > > > >>>>>>>>>>>>>>>>>> statement
> > > > >>>>>>>>>>>>>>>>>>>> evaluate
> > > > >>>>>>>>>>>>>>>>>>>>> to true
> > > > >>>>>>>>>>>>>>>>>>>>>> for the writer type will be
> > > > >>>>>>>>>>> written,
> > > > >>>>>>>>>>>>>>>> otherwise
> > > > >>>>>>>>>>>>>>>>>> not.
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> Sample indexing config which
> > > > >>>>>>>> would
> > > > >>>>>>>>>>> write
> > > > >>>>>>>>>>>>> out
> > > > >>>>>>>>>>>>>>> no
> > > > >>>>>>>>>>>>>>>>>> messages
> > > > >>>>>>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>>>>> HDFS and
> > > > >>>>>>>>>>>>>>>>>>>>> write
> > > > >>>>>>>>>>>>>>>>>>>>>> out only messages containing a
> > > > >>>>>>>>> field
> > > > >>>>>>>>>>>>> called
> > > > >>>>>>>>>>>>>>>>>> "field1":
> > > > >>>>>>>>>>>>>>>>>>>>>> {
> > > > >>>>>>>>>>>>>>>>>>>>>> "index" : "squid"
> > > > >>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100
> > > > >>>>>>>>>>>>>>>>>>>>>> ,"filters" : {
> > > > >>>>>>>>>>>>>>>>>>>>>> "HDFS" : "false"
> > > > >>>>>>>>>>>>>>>>>>>>>> ,"ES" : "exists(field1)"
> > > > >>>>>>>>>>>>>>>>>>>>>> }
> > > > >>>>>>>>>>>>>>>>>>>>>> }
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> *Index On/Off Switch*
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> A simpler solution would be to
> > > > >>>>>>>>> just
> > > > >>>>>>>>>>>>> provide a
> > > > >>>>>>>>>>>>>>>>> list
> > > > >>>>>>>>>>>>>>>>>> of
> > > > >>>>>>>>>>>>>>>>>>> writers
> > > > >>>>>>>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>>>>>> write
> > > > >>>>>>>>>>>>>>>>>>>>>> messages. The semantics would
> > > > >>>>>>>> be
> > > > >>>>>>>>> as
> > > > >>>>>>>>>>>>> follows:
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> - If the list is
> > > > >>>>>>>> unspecified,
> > > > >>>>>>>>>> then
> > > > >>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>> default
> > > > >>>>>>>>>>>>>>>>>> is to
> > > > >>>>>>>>>>>>>>>>>>> write
> > > > >>>>>>>>>>>>>>>>>>>> all
> > > > >>>>>>>>>>>>>>>>>>>>> messages
> > > > >>>>>>>>>>>>>>>>>>>>>> for every writer in the
> > > > >>>>>>>>> indexing
> > > > >>>>>>>>>>>>> topology
> > > > >>>>>>>>>>>>>>>>>>>>>> - If the list is specified,
> > > > >>>>>>>>> then
> > > > >>>>>>>>>> a
> > > > >>>>>>>>>>>>> writer
> > > > >>>>>>>>>>>>>>>> will
> > > > >>>>>>>>>>>>>>>>>> write
> > > > >>>>>>>>>>>>>>>>>>> all
> > > > >>>>>>>>>>>>>>>>>>>> messages
> > > > >>>>>>>>>>>>>>>>>>>>> if and
> > > > >>>>>>>>>>>>>>>>>>>>>> only if it is named in the
> > > > >>>>>>>>> list.
> > > > >>>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>>> Sample indexing config which
> > > > >>>>>>>> turns
> > > > >>>>>>>>>> off
> > > > >>>>>>>>>>>>> HDFS
> > > > >>>>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>>>> keeps on
> > > > >>>>>>>>>>>>>>>>>>>>> Elasticsearch:
> > > > >>>>>>>>>>>>>>>>>>>>>> {
> > > > >>>>>>>>>>>>>>>>>>>>>> "index" : "squid"
> > > > >>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100
> > > > >>>>>>>>>>>>>>>>>>>>>> ,"writers" : [ "ES" ]
> > > > >>>
> > > > >>> --
> > > > >>
> > > > >> Jon
> > > > >>
> > > > >> Sent from my mobile device
> > > > >>
> > > >
> > > >
> > >
> > >
> > > --
> > > Nick Allen <[email protected]>
> > >
> >
>
>
>
> --
> Nick Allen <[email protected]>
>

Re: [DISCUSS] Turning off indexing writers feature discussion

Reply via email to