Re: [DISCUSS] Turning off indexing writers feature discussion

Nick Allen Mon, 16 Jan 2017 07:41:25 -0800

I'm all for a compromise here.  Sounds like we're getting close.

Just one thing.  Can you layout the reasoning for having 'enabled' and
'when'?  I don't follow the reasoning, but maybe I am missing something.


On Sat, Jan 14, 2017 at 12:13 PM, Kyle Richardson <[email protected]
> wrote:

> I'm +1 on the current proposal. I like Nick's syntax and agree with Jon's
> enabled property. I also like the idea of a path property for HDFS.
>
> -Kyle
>
> > On Jan 14, 2017, at 10:51 AM, Casey Stella <[email protected]> wrote:
> >
> > I'm +1 on an explicit enabled property and a filter (or when) property. I
> > think we are zeroing in on a decent design, so that is good.
> >
> > To recap, what I am +1 on is Nick's proposed syntax with the following
> > modifications:
> > 1. An explicit enabled field
> > 2. A default on for unspecified to match current semantics
> >
> > Casey
> >> On Sat, Jan 14, 2017 at 10:45 [email protected] <[email protected]>
> wrote:
> >>
> >> This has the additional benefit of doing something like below when you
> want
> >> to temporarily disable the hdfs writer, but don't want to remove the
> >> settings.  This removes the need to store the path and batchSize (and
> many
> >> additional settings) somewhere else so they can be brought back in when
> you
> >> want to re-enable it, which is a nice workflow attribute for the end
> user:
> >>
> >> {
> >>   'elasticsearch': {
> >>      'enabled': 'true',
> >>      'index': 'foo',
> >>      'batchSize': 100,
> >>    },
> >>   'hdfs': {
> >>      'enabled': 'false',
> >>      'path': '/foo/bar/...',
> >>      'batchSize': 100,
> >>    },
> >>   'solr': {
> >>      'enabled': 'false'
> >>    }
> >> }
> >>
> >> Jon
> >>
> >>> On Sat, Jan 14, 2017 at 9:24 AM [email protected] <[email protected]>
> wrote:
> >>>
> >>> I similarly have a concern there because I prefer being as explicit as
> >>> possible, which makes things easier to pick up for new users.  Using my
> >>> example from earlier this could look like specifying while(false), but
> an
> >>> even better and more obvious approach may be to use enabled(false).  So
> >> the
> >>> current simple default would be:
> >>>
> >>> {
> >>>   'elasticsearch': { 'enabled': 'true' },
> >>>   'hdfs': { 'enabled': 'true' },
> >>>   'solr': { enabled': 'false' }
> >>> }
> >>>
> >>> And to use ES with some overrides but not HDFS or solr it would look
> >> like:
> >>>
> >>> {
> >>>   'elasticsearch': {
> >>>      'enabled': 'true',
> >>>      'index': 'foo',
> >>>      'batchSize': 100
> >>>    },
> >>>   'hdfs': {
> >>>      'enabled': 'false'
> >>>    },
> >>>   'solr': {
> >>>      'enabled': 'false'
> >>>    }
> >>> }
> >>>
> >>> Jon
> >>>
> >>> On Fri, Jan 13, 2017 at 10:21 PM Casey Stella <[email protected]>
> >> wrote:
> >>>
> >>> One thing that I thought of that I very strenuous do not like in Nick's
> >>> proposal is that if a writer config is not specified then it is turned
> >> off
> >>> (I think; if I misunderstood let me know). In the situation where we
> >> have a
> >>> new sensor, right now if there are no index config and no enrichment
> >>> config, it still passes through to the index using defaults. In this
> new
> >>> scheme it would not. This changes the default semantics for the system
> >> and
> >>> I think it changes it for the worse.
> >>>
> >>> I would strongly prefer a on-by-default indexing config as we have now.
> >>>> On Fri, Jan 13, 2017 at 17:13 Casey Stella <[email protected]>
> wrote:
> >>>>
> >>>> One thing that I really like about Nick's suggestion is that it allows
> >>>> writer-specific configs in a clear and simple way.  It is more complex
> >>> for
> >>>> the default case (all writers write to indices named the same thing
> >> with
> >>> a
> >>>> fixed batch size), which I do not like, but maybe it's worth the
> >>> compromise
> >>>> to make it less complex for the advanced case.
> >>>>
> >>>> Thanks a lot for the suggestion, Nick, it's interesting;  I'm
> beginning
> >>> to
> >>>> lean your way.
> >>>>
> >>>> On Fri, Jan 13, 2017 at 2:51 PM, [email protected] <[email protected]>
> >>>> wrote:
> >>>>
> >>>> I like the suggestions you made, Nick.  The only thing I would add is
> >>> that
> >>>> it's also nice to see an explicit when(false), as people newer to the
> >>>> platform may not know where to expect configs for the different
> >> writers.
> >>>> Being able to do it either way, which I think is already assumed in
> >> your
> >>>> model, would make sense.  I would just suggest that, if we support but
> >>> are
> >>>> disabling a writer, that the platform inserts a default when(false) to
> >> be
> >>>> explicit.
> >>>>
> >>>> Jon
> >>>>
> >>>> On Fri, Jan 13, 2017 at 11:59 AM Casey Stella <[email protected]>
> >>> wrote:
> >>>>
> >>>>> Let me noodle on this over the weekend.  Your syntax is looking less
> >>>>> onerous to me and I like the following statement from Otto: "In the
> >>> end,
> >>>>> each write destination ‘type’ will need it’s own configuration.  This
> >>> is
> >>>> an
> >>>>> extension point."
> >>>>>
> >>>>> I may come around to your way of thinking.
> >>>>>
> >>>>> On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler <
> >> [email protected]
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> In the end, each write destination ‘type’ will need it’s own
> >>>>>> configuration.  This is an extension point.
> >>>>>> {
> >>>>>> HDFS:{
> >>>>>> outputAdapters:[
> >>>>>> {name: avro,
> >>>>>> settings:{
> >>>>>> avro stuff….
> >>>>>> when:{
> >>>>>> },
> >>>>>> {
> >>>>>> name: sequence file,
> >>>>>> …..
> >>>>>>
> >>>>>> or some such.
> >>>>>>
> >>>>>>
> >>>>>> On January 13, 2017 at 11:51:15, Nick Allen ([email protected])
> >>>> wrote:
> >>>>>>
> >>>>>> I will add also that instead of global overrides, like index, we
> >>> should
> >>>>> use
> >>>>>> configuration key names that are more appropriate to the output.
> >>>>>>
> >>>>>> For example, does 'index' really make sense for HDFS? Or would
> >> 'path'
> >>>> be
> >>>>>> more appropriate?
> >>>>>>
> >>>>>> {
> >>>>>> 'elasticsearch': {
> >>>>>> 'index': 'foo',
> >>>>>> 'batchSize': 1
> >>>>>> },
> >>>>>> 'hdfs': {
> >>>>>> 'path': '/foo/bar/...',
> >>>>>> 'batchSize': 100
> >>>>>> }
> >>>>>> }
> >>>>>>
> >>>>>> Ok, I've said my peace. Thanks for the effort in summarizing all
> >>> this,
> >>>>>> Casey.
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen <[email protected]>
> >>>> wrote:
> >>>>>>
> >>>>>>> Nick's concerns about my suggestion were that it was overly
> >> complex
> >>>> and
> >>>>>>>> hard to grok and that we could dispense with backwards
> >>> compatibility
> >>>>> and
> >>>>>>>> make people do a bit more work on the default case for the
> >>> benefits
> >>>>> of a
> >>>>>>>> simpler advanced case. (Nick, make sure I don't misstate your
> >>>>> position)
> >>>>>>>
> >>>>>>>
> >>>>>>> I will add is that in my mind, the majority case would be a user
> >>>>>>> specifying the outputs, but not things like 'batchSize' or
> >> 'when'.
> >>> I
> >>>>>> think
> >>>>>>> in the majority case, the user would accept whatever the default
> >>>> batch
> >>>>>> size
> >>>>>>> is.
> >>>>>>>
> >>>>>>> Here are alternatives suggestions for all the examples that you
> >>>>> provided
> >>>>>>> previously.
> >>>>>>>
> >>>>>>> Base Case
> >>>>>>>
> >>>>>>> - The user must always specify the 'outputs' for clarity.
> >>>>>>> - Uses default index name, batch size and when = true.
> >>>>>>>
> >>>>>>> {
> >>>>>>> 'elasticsearch': {},
> >>>>>>> 'hdfs': {}
> >>>>>>> }
> >>>>>>>
> >>>>>>>
> >>>>>>> <
> >>>>>> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> >>>>>> a1#writer-non-specific-case>Writer-non-specific
> >>>>>>
> >>>>>>> Case
> >>>>>>>
> >>>>>>> - There are no global overrides, as in Casey's proposal.
> >>>>>>> - Easier to grok IMO.
> >>>>>>>
> >>>>>>> {
> >>>>>>> 'elasticsearch': {
> >>>>>>> 'index': 'foo',
> >>>>>>> 'batchSize': 100
> >>>>>>> },
> >>>>>>> 'hdfs': {
> >>>>>>> 'index': 'foo',
> >>>>>>> 'batchSize': 100
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>>
> >>>>>>> <
> >>>>>> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> >>>>>> a1#writer-specific-case-without-filters>Writer-specific
> >>>>>>
> >>>>>>> case without filters
> >>>>>>>
> >>>>>>> {
> >>>>>>> 'elasticsearch': {
> >>>>>>> 'index': 'foo',
> >>>>>>> 'batchSize': 1
> >>>>>>> },
> >>>>>>> 'hdfs': {
> >>>>>>> 'index': 'foo',
> >>>>>>> 'batchSize': 100
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>>
> >>>>>>> <
> >>>>>> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> >>>>>> a1#writer-specific-case-with-filters>Writer-specific
> >>>>>>
> >>>>>>> case with filters
> >>>>>>>
> >>>>>>> - Instead of having to say when=false, just don't configure HDFS
> >>>>>>>
> >>>>>>> {
> >>>>>>> 'elasticsearch': {
> >>>>>>> 'index': 'foo',
> >>>>>>> 'batchSize': 100,
> >>>>>>> 'when': 'exists(field1)'
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella <
> >> [email protected]
> >>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Dave,
> >>>>>>>> For the benefit of posterity and people who might not be as
> >> deeply
> >>>>>>>> entangled in the system as we have been, I'll recap things and
> >>>>> hopefully
> >>>>>>>> answer your question in the process.
> >>>>>>>>
> >>>>>>>> Historically the index configuration is split between the
> >>> enrichment
> >>>>>>>> configs and the global configs.
> >>>>>>>>
> >>>>>>>> - The global configs really controls configs that apply to all
> >>>>> sensors.
> >>>>>>>> Historically this has been stuff like index connection strings,
> >>> etc.
> >>>>>>>> - The sensor-specific configs which control things that vary by
> >>>>> sensor.
> >>>>>>>>
> >>>>>>>> As of Metron-652 (in review currently), we moved the sensor
> >>> specific
> >>>>>>>> configs from the enrichment configs. The proposal here is to
> >>>> increase
> >>>>>> the
> >>>>>>>> granularity of the the sensor specific files to make them
> >> support
> >>>>> index
> >>>>>>>> writer-specific configs. Right now in the indexing topology, we
> >>>> have 2
> >>>>>>>> writers (fixed): ES/Solr and HDFS.
> >>>>>>>>
> >>>>>>>> The proposed configuration would allow you to either specify a
> >>>> blanket
> >>>>>>>> sensor-level config for the index name and batchSize and/or
> >>> override
> >>>>> at
> >>>>>>>> the
> >>>>>>>> writer level, thereby supporting a couple of use-cases:
> >>>>>>>>
> >>>>>>>> - Turning off certain index writers (e.g. HDFS)
> >>>>>>>> - Filtering the messages written to certain index writers
> >>>>>>>>
> >>>>>>>> The two competing configs between Nick and I are as follows:
> >>>>>>>>
> >>>>>>>> - I want to make sure we keep the old sensor-specific defaults
> >>> with
> >>>>>>>> writer-specific overrides available
> >>>>>>>> - Nick thought we could simplify the permutations by making the
> >>>>>>>> indexing
> >>>>>>>> config only the writer-level configs.
> >>>>>>>>
> >>>>>>>> My concerns about Nick's suggestion were that the default and
> >>>> majority
> >>>>>>>> case, specifying the index and the batchSize for all writers (th
> >>>> eone
> >>>>> we
> >>>>>>>> support now) would require more configuration.
> >>>>>>>>
> >>>>>>>> Nick's concerns about my suggestion were that it was overly
> >>> complex
> >>>>> and
> >>>>>>>> hard to grok and that we could dispense with backwards
> >>> compatibility
> >>>>> and
> >>>>>>>> make people do a bit more work on the default case for the
> >>> benefits
> >>>>> of a
> >>>>>>>> simpler advanced case. (Nick, make sure I don't misstate your
> >>>>> position).
> >>>>>>>>
> >>>>>>>> Casey
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle <
> >>> [email protected]>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Casey,
> >>>>>>>>>
> >>>>>>>>> Can you give me a level set of what your thinking is now? I
> >>> think
> >>>>> it's
> >>>>>>>>> global control of all index types + overrides on a per-type
> >>> basis.
> >>>>>> Fwiw,
> >>>>>>>>> I'm totally for that, but I want to make sure I'm not imposing
> >>> my
> >>>>>>>>> pre-concieved notions on your consensus-driven ones.
> >>>>>>>>>
> >>>>>>>>> -D....
> >>>>>>>>>
> >>>>>>>>> On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella <
> >>>> [email protected]>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I am suggesting that, yes. The configs are essentially the
> >>> same
> >>>> as
> >>>>>>>>> yours,
> >>>>>>>>>> except there is an override specified at the top level.
> >>> Without
> >>>>>>>> that, in
> >>>>>>>>>> order to specify both HDFS and ES have batch sizes of 100,
> >> you
> >>>>> have
> >>>>>> to
> >>>>>>>>>> explicitly configure each. It's less that I'm trying to have
> >>>>>>>> backwards
> >>>>>>>>>> compatibility and more that I'm trying to make the majority
> >>> case
> >>>>>> easy:
> >>>>>>>>> both
> >>>>>>>>>> writers write everything to a specified index name with a
> >>>>> specified
> >>>>>>>> batch
> >>>>>>>>>> size (which is what we have now). Beyond that, I want to
> >> allow
> >>>> for
> >>>>>>>>>> specifying an override for the config on a writer-by-writer
> >>>> basis
> >>>>>> for
> >>>>>>>>> those
> >>>>>>>>>> who need it.
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen <
> >>>> [email protected]>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Are you saying we support all of these variants? I realize
> >>> you
> >>>>> are
> >>>>>>>>>> trying
> >>>>>>>>>>> to have some backwards compatibility, but this also makes
> >> it
> >>>>>> harder
> >>>>>>>>> for a
> >>>>>>>>>>> user to grok (for me at least).
> >>>>>>>>>>>
> >>>>>>>>>>> Personally I like my original example as there are fewer
> >>>>>>>>> sub-structures,
> >>>>>>>>>>> like 'writerConfig', which makes the whole thing simpler
> >> and
> >>>>>> easier
> >>>>>>>> to
> >>>>>>>>>>> grok. But maybe others will think your proposal is just as
> >>>> easy
> >>>>> to
> >>>>>>>>> grok.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella <
> >>>>>> [email protected]>
> >>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Ok, so here's what I'm thinking based on the discussion:
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Keeping the configs that we have now (batchSize and
> >>> index)
> >>>>> as
> >>>>>>>>>>> defaults
> >>>>>>>>>>>> for the unspecified writer-specific case
> >>>>>>>>>>>> - Adding the config Nick suggested
> >>>>>>>>>>>>
> >>>>>>>>>>>> *Base Case*:
> >>>>>>>>>>>> {
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> - all writers write all messages
> >>>>>>>>>>>> - index named the same as the sensor for all writers
> >>>>>>>>>>>> - batchSize of 1 for all writers
> >>>>>>>>>>>>
> >>>>>>>>>>>> *Writer-non-specific case*:
> >>>>>>>>>>>> {
> >>>>>>>>>>>> "index" : "foo"
> >>>>>>>>>>>> ,"batchSize" : 100
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> - All writers write all messages
> >>>>>>>>>>>> - index is named "foo", different from the sensor for
> >> all
> >>>>>>>> writers
> >>>>>>>>>>>> - batchSize is 100 for all writers
> >>>>>>>>>>>>
> >>>>>>>>>>>> *Writer-specific case without filters*
> >>>>>>>>>>>> {
> >>>>>>>>>>>> "index" : "foo"
> >>>>>>>>>>>> ,"batchSize" : 1
> >>>>>>>>>>>> , "writerConfig" :
> >>>>>>>>>>>> {
> >>>>>>>>>>>> "elasticsearch" : {
> >>>>>>>>>>>> "batchSize" : 100
> >>>>>>>>>>>> }
> >>>>>>>>>>>> }
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> - All writers write all messages
> >>>>>>>>>>>> - index is named "foo", different from the sensor for
> >> all
> >>>>>>>> writers
> >>>>>>>>>>>> - batchSize is 1 for HDFS and 100 for elasticsearch
> >>> writers
> >>>>>>>>>>>> - NOTE: I could override the index name too
> >>>>>>>>>>>>
> >>>>>>>>>>>> *Writer-specific case with filters*
> >>>>>>>>>>>> {
> >>>>>>>>>>>> "index" : "foo"
> >>>>>>>>>>>> ,"batchSize" : 1
> >>>>>>>>>>>> , "writerConfig" :
> >>>>>>>>>>>> {
> >>>>>>>>>>>> "elasticsearch" : {
> >>>>>>>>>>>> "batchSize" : 100,
> >>>>>>>>>>>> "when" : "exists(field1)"
> >>>>>>>>>>>> },
> >>>>>>>>>>>> "hdfs" : {
> >>>>>>>>>>>> "when" : "false"
> >>>>>>>>>>>> }
> >>>>>>>>>>>> }
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> - ES writer writes messages which have field1, HDFS
> >>> doesn't
> >>>>>>>>>>>> - index is named "foo", different from the sensor for
> >> all
> >>>>>>>> writers
> >>>>>>>>>>>> - 100 for elasticsearch writers
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thoughts?
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <
> >>>>>>>> [email protected]
> >>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> For larger installations you need to control what is
> >>>> indexed
> >>>>>> so
> >>>>>>>> you
> >>>>>>>>>>> don’t
> >>>>>>>>>>>>> end up with a nasty elastic search situation and so
> >> you
> >>>> can
> >>>>>> mine
> >>>>>>>>> the
> >>>>>>>>>>> data
> >>>>>>>>>>>>> later for reports and training ml models.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>> Carolyn
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 1/13/17, 9:40 AM, "Casey Stella" <
> >> [email protected]
> >>>>
> >>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> OH that's a good idea!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <
> >>>>>>>> [email protected]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I like the "Index Filtering" option based on the
> >>>>>> flexibility
> >>>>>>>>> that
> >>>>>>>>>> it
> >>>>>>>>>>>>>>> provides. Should each output (HDFS, ES, etc) have
> >> its
> >>>> own
> >>>>>>>>>>>> configuration
> >>>>>>>>>>>>>>> settings? For example, aren't things like batching
> >>>>> handled
> >>>>>>>>>>> separately
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>> HDFS versus Elasticsearch?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Something along the lines of...
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>> "hdfs" : {
> >>>>>>>>>>>>>>> "when": "exists(field1)",
> >>>>>>>>>>>>>>> "batchSize": 100
> >>>>>>>>>>>>>>> },
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> "elasticsearch" : {
> >>>>>>>>>>>>>>> "when": "true",
> >>>>>>>>>>>>>>> "batchSize": 1000,
> >>>>>>>>>>>>>>> "index": "squid"
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <
> >>>>>>>>> [email protected]
> >>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Yeah, I tend to like the first option too. Any
> >>>>> opposition
> >>>>>>>> to
> >>>>>>>>>> that
> >>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>> anyone?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The points brought up are good ones and I think
> >>> that
> >>>> it
> >>>>>>>> may be
> >>>>>>>>>>>> worth a
> >>>>>>>>>>>>>>>> broader discussion of the requirements of
> >> indexing
> >>>> in a
> >>>>>>>>> separate
> >>>>>>>>>>> dev
> >>>>>>>>>>>>> list
> >>>>>>>>>>>>>>>> thread. Maybe a list of desires with coherent
> >>>> use-cases
> >>>>>>>>>>> justifying
> >>>>>>>>>>>>> them
> >>>>>>>>>>>>>>> so
> >>>>>>>>>>>>>>>> we can think about how this stuff should work and
> >>>> where
> >>>>>> the
> >>>>>>>>>>> natural
> >>>>>>>>>>>>>>>> extension points should be. Afterall, we need to
> >>> toe
> >>>>> the
> >>>>>>>> line
> >>>>>>>>>>>> between
> >>>>>>>>>>>>>>>> engineering and overengineering for features
> >> nobody
> >>>>> will
> >>>>>>>> want.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I'm not sure about the extensions to the standard
> >>>>> fields.
> >>>>>>>> I'm
> >>>>>>>>>>> torn
> >>>>>>>>>>>>>>> between
> >>>>>>>>>>>>>>>> the notions that we should have no standard
> >> fields
> >>> vs
> >>>>> we
> >>>>>>>>> should
> >>>>>>>>>>>> have a
> >>>>>>>>>>>>>>>> boatload of standard fields (with most of them
> >>>> empty).
> >>>>> I
> >>>>>>>>>> exchange
> >>>>>>>>>>>>>>>> positions fairly regularly on that question. ;)
> >> It
> >>>> may
> >>>>> be
> >>>>>>>>>> worth a
> >>>>>>>>>>>> dev
> >>>>>>>>>>>>>>> list
> >>>>>>>>>>>>>>>> discussion to lay out how you imagine an
> >> extension
> >>> of
> >>>>>>>> standard
> >>>>>>>>>>>> fields
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> how it might look as implemented in Metron.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Casey
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Casey
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson
> >> <
> >>>>>>>>>>>>>>>> [email protected]>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'll second my preference for the first
> >> option. I
> >>>>> think
> >>>>>>>> the
> >>>>>>>>>>>> ability
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>> Stellar filters to customize indexing would be
> >> a
> >>>> big
> >>>>>> win.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'm glad Matt brought up the point about data
> >>> lake
> >>>>> and
> >>>>>>>> CEP.
> >>>>>>>>> I
> >>>>>>>>>>>> think
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> a really important use case that we need to
> >>>> consider.
> >>>>>>>> Take a
> >>>>>>>>>>>> simple
> >>>>>>>>>>>>>>>>> example... If I have data coming in from 3
> >>>> different
> >>>>>>>>> firewall
> >>>>>>>>>>>>> vendors
> >>>>>>>>>>>>>>>> and 2
> >>>>>>>>>>>>>>>>> different web proxy/url filtering vendors and I
> >>>> want
> >>>>> to
> >>>>>>>> be
> >>>>>>>>>> able
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>> analyze
> >>>>>>>>>>>>>>>>> that data set, I need the data to be indexed
> >> all
> >>>>>> together
> >>>>>>>>>>> (likely
> >>>>>>>>>>>> in
> >>>>>>>>>>>>>>>> HDFS)
> >>>>>>>>>>>>>>>>> and to have a normalized schema such that IP
> >>>> address,
> >>>>>>>> URL,
> >>>>>>>>> and
> >>>>>>>>>>>> user
> >>>>>>>>>>>>>>> name
> >>>>>>>>>>>>>>>>> (to take a few) can be easily queried and
> >>>>> aggregated. I
> >>>>>>>> can
> >>>>>>>>>> also
> >>>>>>>>>>>>>>> envision
> >>>>>>>>>>>>>>>>> scenarios where I would want to index data
> >> based
> >>> on
> >>>>>>>>> attributes
> >>>>>>>>>>>> other
> >>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>> sensor, business unit or subsidiary for
> >> example.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I've been wanted to propose extending our 7
> >>>> standard
> >>>>>>>> fields
> >>>>>>>>> to
> >>>>>>>>>>>>> include
> >>>>>>>>>>>>>>>>> things like URL and user. Is there community
> >>>>>>>>> interest/support
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>> moving
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> that direction? If so, I'll start a new thread.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> -Kyle
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <
> >>>>>>>>> [email protected]
> >>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Ah, I see. If overriding the default index
> >> name
> >>>>>> allows
> >>>>>>>>>> using
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>> name for multiple sensors, then the goal can
> >> be
> >>>>>>>> achieved.
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>> --Matt
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 1/12/17, 3:30 PM, "Casey Stella" <
> >>>>>>>> [email protected]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Oh, you could! Let's say you have a syslog
> >>> parser
> >>>>>>>>> with
> >>>>>>>>>>> data
> >>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>> sources 1
> >>>>>>>>>>>>>>>>>> 2 and 3. You'd end up with one kafka queue
> >>> with 3
> >>>>>>>>>> parsers
> >>>>>>>>>>>>>>> attached
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>> queue, each picking part the messages from
> >>> source
> >>>>>>>> 1, 2
> >>>>>>>>>> and
> >>>>>>>>>>>> 3.
> >>>>>>>>>>>>>>>> They'd
> >>>>>>>>>>>>>>>>>> go
> >>>>>>>>>>>>>>>>>> through separate enrichment and into the
> >>> indexing
> >>>>>>>>>>> topology.
> >>>>>>>>>>>>> In
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> indexing topology, you could specify the same
> >>>> index
> >>>>>>>>> name
> >>>>>>>>>>>>> "syslog"
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>> of the messages go into the same index for
> >> CEP
> >>>>>>>>> querying
> >>>>>>>>>> if
> >>>>>>>>>>>> so
> >>>>>>>>>>>>>>>>> desired.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <
> >>>>>>>>>>>> [email protected]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Syslog is hell on parsers – I know, I
> >> worked
> >>> at
> >>>>>>>>>> LogLogic
> >>>>>>>>>>>> in
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>> previous
> >>>>>>>>>>>>>>>>>>> life. It makes perfect sense to route
> >>> different
> >>>>>>>>> lines
> >>>>>>>>>>>> from
> >>>>>>>>>>>>>>>> syslog
> >>>>>>>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>> different appropriate parsers. But a lot of
> >>>> what
> >>>>>>>>> the
> >>>>>>>>>>>>> parsers
> >>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> identify consistent subsets of metadata and
> >>>>>>>> annotate
> >>>>>>>>>> it
> >>>>>>>>>>> –
> >>>>>>>>>>>>> eg,
> >>>>>>>>>>>>>>>>>> src_ip_addr,
> >>>>>>>>>>>>>>>>>>> event timestamps, etc. Once those metadata
> >>> are
> >>>>>>>>>>> annotated
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> available
> >>>>>>>>>>>>>>>>>>> with common field names, why doesn’t it
> >> make
> >>>>>>>> sense
> >>>>>>>>> to
> >>>>>>>>>>>> index
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> messages
> >>>>>>>>>>>>>>>>>>> together, for CEP querying? I think Splunk
> >>> has
> >>>>>>>>>>>> illustrated
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>> model.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On 1/12/17, 3:00 PM, "Casey Stella" <
> >>>>>>>>>> [email protected]
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> yeah, I mean, honestly, I think the
> >> approach
> >>>>>>>>> that
> >>>>>>>>>>>> we've
> >>>>>>>>>>>>>>> taken
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> sources
> >>>>>>>>>>>>>>>>>>> which aggregate different types of data is
> >> to
> >>>>>>>>>>> provide
> >>>>>>>>>>>>>>> filters
> >>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> parser
> >>>>>>>>>>>>>>>>>>> level and have multiple parser topologies
> >>>>>>>> (with
> >>>>>>>>>>>>> different,
> >>>>>>>>>>>>>>>>>> possibly
> >>>>>>>>>>>>>>>>>>> mutually exclusive filters) running. This
> >>>>>>>> would
> >>>>>>>>>> be
> >>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> completely
> >>>>>>>>>>>>>>>>>>> separate
> >>>>>>>>>>>>>>>>>>> sensor. Imagine a syslog data source that
> >>>>>>>>>>> aggregates
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>> want to
> >>>>>>>>>>>>>>>>>>> pick
> >>>>>>>>>>>>>>>>>>> apart certain pieces of messages. This is
> >>>>>>>> why
> >>>>>>>>> the
> >>>>>>>>>>>>> initial
> >>>>>>>>>>>>>>>>>> thought and
> >>>>>>>>>>>>>>>>>>> architecture was one index per sensor.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 5:55 PM, Matt
> >> Foley <
> >>>>>>>>>>>>>>>> [email protected]>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I’m thinking that CEP (Complex Event
> >>>>>>>>> Processing)
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>>> contrary
> >>>>>>>>>>>>>>>>>> to the
> >>>>>>>>>>>>>>>>>>> idea
> >>>>>>>>>>>>>>>>>>>> of silo-ing data per sensor.
> >>>>>>>>>>>>>>>>>>>> Now it’s true that some of those sensors
> >>>>>>>> are
> >>>>>>>>>>> already
> >>>>>>>>>>>>>>>>>> aggregating
> >>>>>>>>>>>>>>>>>>> data from
> >>>>>>>>>>>>>>>>>>>> multiple sources, so maybe I’m wrong
> >> here.
> >>>>>>>>>>>>>>>>>>>> But it just seems to me that the “data
> >>>>>>>> lake”
> >>>>>>>>>>>> insights
> >>>>>>>>>>>>>>> come
> >>>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>>> being able
> >>>>>>>>>>>>>>>>>>>> to make decisions over the whole mass of
> >>>>>>>> data
> >>>>>>>>>>> rather
> >>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>>> vertical
> >>>>>>>>>>>>>>>>>>>> slices of it.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On 1/12/17, 2:15 PM, "Casey Stella" <
> >>>>>>>>>>>>> [email protected]>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hey Matt,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for the comment!
> >>>>>>>>>>>>>>>>>>>> 1. At the moment, we only have one
> >>>>>>>> index
> >>>>>>>>>> name,
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> default
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> which is
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> sensor name but that's entirely up to
> >>>>>>>> the
> >>>>>>>>>>> user.
> >>>>>>>>>>>>> This
> >>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> sensor
> >>>>>>>>>>>>>>>>>>>> specific,
> >>>>>>>>>>>>>>>>>>>> so it'd be a separate config for each
> >>>>>>>>>> sensor.
> >>>>>>>>>>>> If
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>> want
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> build
> >>>>>>>>>>>>>>>>>>>> multiple
> >>>>>>>>>>>>>>>>>>>> indices per sensor, we'd have to think
> >>>>>>>>>>> carefully
> >>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>> how
> >>>>>>>>>>>>>>>>>> to do
> >>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>> would be a bigger undertaking. I
> >>>>>>>> guess I
> >>>>>>>>>> can
> >>>>>>>>>>>> see
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> use,
> >>>>>>>>>>>>>>>>>> though
> >>>>>>>>>>>>>>>>>>>> (redirect
> >>>>>>>>>>>>>>>>>>>> messages to one index vs another based
> >>>>>>>> on
> >>>>>>>>> a
> >>>>>>>>>>>>> predicate
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> a given
> >>>>>>>>>>>>>>>>>>>> sensor).
> >>>>>>>>>>>>>>>>>>>> Anyway, not where I was originally
> >>>>>>>>> thinking
> >>>>>>>>>>> that
> >>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>> discussion
> >>>>>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>> go,
> >>>>>>>>>>>>>>>>>>>> but it's an interesting point.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 2. I hadn't thought through the
> >>>>>>>>>> implementation
> >>>>>>>>>>>>> quite
> >>>>>>>>>>>>>>>> yet,
> >>>>>>>>>>>>>>>>>> but we
> >>>>>>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>>>> actually have a splitter bolt in that
> >>>>>>>>>>> topology,
> >>>>>>>>>>>>> just
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>> spout
> >>>>>>>>>>>>>>>>>>> that goes
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> the elasticsearch writer and also to
> >>>>>>>> the
> >>>>>>>>>> hdfs
> >>>>>>>>>>>>> writer.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 4:52 PM, Matt
> >>>>>>>>> Foley
> >>>>>>>>>> <
> >>>>>>>>>>>>>>>>>> [email protected]>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Casey, good to have controls like
> >>>>>>>> this.
> >>>>>>>>>>>> Couple
> >>>>>>>>>>>>>>>>>> questions:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 1. Regarding the “index” : “squid”
> >>>>>>>>>>> name/value
> >>>>>>>>>>>>> pair,
> >>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> index name
> >>>>>>>>>>>>>>>>>>>>> expected to always be a sensor
> >>>>>>>> name? Or
> >>>>>>>>>> is
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>> given
> >>>>>>>>>>>>>>>>>> json
> >>>>>>>>>>>>>>>>>>> structure
> >>>>>>>>>>>>>>>>>>>>> subordinate to a sensor name in
> >>>>>>>>> zookeeper?
> >>>>>>>>>>> Or
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>> build
> >>>>>>>>>>>>>>>>>>> arbitrary
> >>>>>>>>>>>>>>>>>>>>> indexes with this new specification,
> >>>>>>>>>>>>> independent of
> >>>>>>>>>>>>>>>>>> sensor?
> >>>>>>>>>>>>>>>>>>> Should
> >>>>>>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>>>> actually be a list of “indexes”, ie
> >>>>>>>>>>>>>>>>>>>>> { “indexes” : [
> >>>>>>>>>>>>>>>>>>>>> {“index” : “name1”,
> >>>>>>>>>>>>>>>>>>>>> …
> >>>>>>>>>>>>>>>>>>>>> },
> >>>>>>>>>>>>>>>>>>>>> {“index” : “name2”,
> >>>>>>>>>>>>>>>>>>>>> …
> >>>>>>>>>>>>>>>>>>>>> } ]
> >>>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 2. Would the filtering / writer
> >>>>>>>>> selection
> >>>>>>>>>>>> logic
> >>>>>>>>>>>>>>> take
> >>>>>>>>>>>>>>>>>> place in
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> indexing
> >>>>>>>>>>>>>>>>>>>>> topology splitter bolt? Seems like
> >>>>>>>> that
> >>>>>>>>>>> would
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> smallest
> >>>>>>>>>>>>>>>>>>>> impact on
> >>>>>>>>>>>>>>>>>>>>> current implementation, no?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Sorry if these are already answered
> >>>>>>>> in
> >>>>>>>>>>>> PR-415, I
> >>>>>>>>>>>>>>>>> haven’t
> >>>>>>>>>>>>>>>>>> had
> >>>>>>>>>>>>>>>>>>> time to
> >>>>>>>>>>>>>>>>>>>>> review that one yet.
> >>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>> --Matt
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On 1/12/17, 12:55 PM, "Michael
> >>>>>>>>> Miklavcic"
> >>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>> [email protected]>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I like the flexibility and
> >>>>>>>>>>> expressibility
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> first
> >>>>>>>>>>>>>>>>>>> option
> >>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>> Stellar
> >>>>>>>>>>>>>>>>>>>>> filters.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> M
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 1:51 PM,
> >>>>>>>>> Casey
> >>>>>>>>>>>>> Stella <
> >>>>>>>>>>>>>>>>>>>> [email protected]>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> As of METRON-652 <
> >>>>>>>>>>>>> https://github.com/apache/
> >>>>>>>>>>>>>>>>>>>>> incubator-metron/pull/415>, we
> >>>>>>>>>>>>>>>>>>>>>> will have decoupled the
> >>>>>>>> indexing
> >>>>>>>>>>>>>>> configuration
> >>>>>>>>>>>>>>>>>> from the
> >>>>>>>>>>>>>>>>>>>> enrichment
> >>>>>>>>>>>>>>>>>>>>>> configuration. As an immediate
> >>>>>>>>>>>> follow-up
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> that,
> >>>>>>>>>>>>>>>>>> I'd
> >>>>>>>>>>>>>>>>>>> like to
> >>>>>>>>>>>>>>>>>>>>> provide the
> >>>>>>>>>>>>>>>>>>>>>> ability to turn off and on
> >>>>>>>> writers
> >>>>>>>>>> via
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> configs. I'd
> >>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>> to get
> >>>>>>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>>> community feedback on how the
> >>>>>>>>>>>>> functionality
> >>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>> work,
> >>>>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>> y'all are
> >>>>>>>>>>>>>>>>>>>>>> amenable. :)
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> As of now, we have 3 possible
> >>>>>>>>>> writers
> >>>>>>>>>>>>> which
> >>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>>>> in the
> >>>>>>>>>>>>>>>>>>>>> indexing
> >>>>>>>>>>>>>>>>>>>>>> topology:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> - Solr
> >>>>>>>>>>>>>>>>>>>>>> - Elasticsearch
> >>>>>>>>>>>>>>>>>>>>>> - HDFS
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> HDFS is always used,
> >>>>>>>> elasticsearch
> >>>>>>>>>> or
> >>>>>>>>>>>>> solr is
> >>>>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>>>> depending
> >>>>>>>>>>>>>>>>>>>> on how
> >>>>>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>>> start the indexing topology.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> A couple of proposals come to
> >>>>>>>> mind
> >>>>>>>>>>>>>>> immediately:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> *Index Filtering*
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> You would be able to specify a
> >>>>>>>>>> filter
> >>>>>>>>>>> as
> >>>>>>>>>>>>>>>> defined
> >>>>>>>>>>>>>>>>>> by a
> >>>>>>>>>>>>>>>>>>> stellar
> >>>>>>>>>>>>>>>>>>>>> statement
> >>>>>>>>>>>>>>>>>>>>>> (likely a reuse of the
> >>>>>>>>> StellarFilter
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> exists
> >>>>>>>>>>>>>>>>>> in the
> >>>>>>>>>>>>>>>>>>>> Parsers)
> >>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>> would allow you to indicate on
> >>>>>>>> a
> >>>>>>>>>>>>>>>>>> message-by-message basis
> >>>>>>>>>>>>>>>>>>>> whether or
> >>>>>>>>>>>>>>>>>>>>> not to
> >>>>>>>>>>>>>>>>>>>>>> write the message.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> The semantics of this would be
> >>>>>>>> as
> >>>>>>>>>>>> follows:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> - Default (i.e.
> >>>>>>>> unspecified) is
> >>>>>>>>>> to
> >>>>>>>>>>>> pass
> >>>>>>>>>>>>>>>>>> everything
> >>>>>>>>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>>>>> (hence
> >>>>>>>>>>>>>>>>>>>>>> backwards compatible with
> >>>>>>>> the
> >>>>>>>>>>> current
> >>>>>>>>>>>>>>>> default
> >>>>>>>>>>>>>>>>>> config).
> >>>>>>>>>>>>>>>>>>>>>> - Messages which have the
> >>>>>>>>>>> associated
> >>>>>>>>>>>>>>> stellar
> >>>>>>>>>>>>>>>>>> statement
> >>>>>>>>>>>>>>>>>>>> evaluate
> >>>>>>>>>>>>>>>>>>>>> to true
> >>>>>>>>>>>>>>>>>>>>>> for the writer type will be
> >>>>>>>>>>> written,
> >>>>>>>>>>>>>>>> otherwise
> >>>>>>>>>>>>>>>>>> not.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Sample indexing config which
> >>>>>>>> would
> >>>>>>>>>>> write
> >>>>>>>>>>>>> out
> >>>>>>>>>>>>>>> no
> >>>>>>>>>>>>>>>>>> messages
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> HDFS and
> >>>>>>>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>>>>>> out only messages containing a
> >>>>>>>>> field
> >>>>>>>>>>>>> called
> >>>>>>>>>>>>>>>>>> "field1":
> >>>>>>>>>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>>>>>>>> "index" : "squid"
> >>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100
> >>>>>>>>>>>>>>>>>>>>>> ,"filters" : {
> >>>>>>>>>>>>>>>>>>>>>> "HDFS" : "false"
> >>>>>>>>>>>>>>>>>>>>>> ,"ES" : "exists(field1)"
> >>>>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> *Index On/Off Switch*
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> A simpler solution would be to
> >>>>>>>>> just
> >>>>>>>>>>>>> provide a
> >>>>>>>>>>>>>>>>> list
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> writers
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>>>>>> messages. The semantics would
> >>>>>>>> be
> >>>>>>>>> as
> >>>>>>>>>>>>> follows:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> - If the list is
> >>>>>>>> unspecified,
> >>>>>>>>>> then
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> default
> >>>>>>>>>>>>>>>>>> is to
> >>>>>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>>> messages
> >>>>>>>>>>>>>>>>>>>>>> for every writer in the
> >>>>>>>>> indexing
> >>>>>>>>>>>>> topology
> >>>>>>>>>>>>>>>>>>>>>> - If the list is specified,
> >>>>>>>>> then
> >>>>>>>>>> a
> >>>>>>>>>>>>> writer
> >>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>> messages
> >>>>>>>>>>>>>>>>>>>>> if and
> >>>>>>>>>>>>>>>>>>>>>> only if it is named in the
> >>>>>>>>> list.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Sample indexing config which
> >>>>>>>> turns
> >>>>>>>>>> off
> >>>>>>>>>>>>> HDFS
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> keeps on
> >>>>>>>>>>>>>>>>>>>>> Elasticsearch:
> >>>>>>>>>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>>>>>>>> "index" : "squid"
> >>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100
> >>>>>>>>>>>>>>>>>>>>>> ,"writers" : [ "ES" ]
> >>>
> >>> --
> >>
> >> Jon
> >>
> >> Sent from my mobile device
> >>
>
>


-- 
Nick Allen <[email protected]>

Re: [DISCUSS] Turning off indexing writers feature discussion

Reply via email to