Re: [DISCUSS] Turning off indexing writers feature discussion

Casey Stella Fri, 13 Jan 2017 19:22:47 -0800

One thing that I thought of that I very strenuous do not like in Nick's
proposal is that if a writer config is not specified then it is turned off
(I think; if I misunderstood let me know). In the situation where we have a
new sensor, right now if there are no index config and no enrichment
config, it still passes through to the index using defaults. In this new
scheme it would not. This changes the default semantics for the system and
I think it changes it for the worse.


I would strongly prefer a on-by-default indexing config as we have now.
On Fri, Jan 13, 2017 at 17:13 Casey Stella <[email protected]> wrote:

> One thing that I really like about Nick's suggestion is that it allows
> writer-specific configs in a clear and simple way.  It is more complex for
> the default case (all writers write to indices named the same thing with a
> fixed batch size), which I do not like, but maybe it's worth the compromise
> to make it less complex for the advanced case.
>
> Thanks a lot for the suggestion, Nick, it's interesting;  I'm beginning to
> lean your way.
>
> On Fri, Jan 13, 2017 at 2:51 PM, [email protected] <[email protected]>
> wrote:
>
> I like the suggestions you made, Nick.  The only thing I would add is that
> it's also nice to see an explicit when(false), as people newer to the
> platform may not know where to expect configs for the different writers.
> Being able to do it either way, which I think is already assumed in your
> model, would make sense.  I would just suggest that, if we support but are
> disabling a writer, that the platform inserts a default when(false) to be
> explicit.
>
> Jon
>
> On Fri, Jan 13, 2017 at 11:59 AM Casey Stella <[email protected]> wrote:
>
> > Let me noodle on this over the weekend.  Your syntax is looking less
> > onerous to me and I like the following statement from Otto: "In the end,
> > each write destination ‘type’ will need it’s own configuration.  This is
> an
> > extension point."
> >
> > I may come around to your way of thinking.
> >
> > On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler <[email protected]>
> > wrote:
> >
> > > In the end, each write destination ‘type’ will need it’s own
> > > configuration.  This is an extension point.
> > > {
> > > HDFS:{
> > > outputAdapters:[
> > > {name: avro,
> > > settings:{
> > > avro stuff….
> > > when:{
> > > },
> > > {
> > >  name: sequence file,
> > >  …..
> > >
> > > or some such.
> > >
> > >
> > > On January 13, 2017 at 11:51:15, Nick Allen ([email protected])
> wrote:
> > >
> > > I will add also that instead of global overrides, like index, we should
> > use
> > > configuration key names that are more appropriate to the output.
> > >
> > > For example, does 'index' really make sense for HDFS? Or would 'path'
> be
> > > more appropriate?
> > >
> > > {
> > > 'elasticsearch': {
> > > 'index': 'foo',
> > > 'batchSize': 1
> > > },
> > > 'hdfs': {
> > > 'path': '/foo/bar/...',
> > > 'batchSize': 100
> > > }
> > > }
> > >
> > > Ok, I've said my peace. Thanks for the effort in summarizing all this,
> > > Casey.
> > >
> > >
> > > On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen <[email protected]>
> wrote:
> > >
> > > > Nick's concerns about my suggestion were that it was overly complex
> and
> > > >> hard to grok and that we could dispense with backwards compatibility
> > and
> > > >> make people do a bit more work on the default case for the benefits
> > of a
> > > >> simpler advanced case. (Nick, make sure I don't misstate your
> > position)
> > > >
> > > >
> > > > I will add is that in my mind, the majority case would be a user
> > > > specifying the outputs, but not things like 'batchSize' or 'when'. I
> > > think
> > > > in the majority case, the user would accept whatever the default
> batch
> > > size
> > > > is.
> > > >
> > > > Here are alternatives suggestions for all the examples that you
> > provided
> > > > previously.
> > > >
> > > > Base Case
> > > >
> > > > - The user must always specify the 'outputs' for clarity.
> > > > - Uses default index name, batch size and when = true.
> > > >
> > > > {
> > > > 'elasticsearch': {},
> > > > 'hdfs': {}
> > > > }
> > > >
> > > >
> > > > <
> > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > > a1#writer-non-specific-case>Writer-non-specific
> > >
> > > > Case
> > > >
> > > > - There are no global overrides, as in Casey's proposal.
> > > > - Easier to grok IMO.
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > },
> > > > 'hdfs': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > }
> > > > }
> > > >
> > > >
> > > > <
> > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > > a1#writer-specific-case-without-filters>Writer-specific
> > >
> > > > case without filters
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 1
> > > > },
> > > > 'hdfs': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > }
> > > > }
> > > >
> > > >
> > > > <
> > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > > a1#writer-specific-case-with-filters>Writer-specific
> > >
> > > > case with filters
> > > >
> > > > - Instead of having to say when=false, just don't configure HDFS
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100,
> > > > 'when': 'exists(field1)'
> > > > }
> > > > }
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella <[email protected]>
> > > wrote:
> > > >
> > > >> Dave,
> > > >> For the benefit of posterity and people who might not be as deeply
> > > >> entangled in the system as we have been, I'll recap things and
> > hopefully
> > > >> answer your question in the process.
> > > >>
> > > >> Historically the index configuration is split between the enrichment
> > > >> configs and the global configs.
> > > >>
> > > >> - The global configs really controls configs that apply to all
> > sensors.
> > > >> Historically this has been stuff like index connection strings, etc.
> > > >> - The sensor-specific configs which control things that vary by
> > sensor.
> > > >>
> > > >> As of Metron-652 (in review currently), we moved the sensor specific
> > > >> configs from the enrichment configs. The proposal here is to
> increase
> > > the
> > > >> granularity of the the sensor specific files to make them support
> > index
> > > >> writer-specific configs. Right now in the indexing topology, we
> have 2
> > > >> writers (fixed): ES/Solr and HDFS.
> > > >>
> > > >> The proposed configuration would allow you to either specify a
> blanket
> > > >> sensor-level config for the index name and batchSize and/or override
> > at
> > > >> the
> > > >> writer level, thereby supporting a couple of use-cases:
> > > >>
> > > >> - Turning off certain index writers (e.g. HDFS)
> > > >> - Filtering the messages written to certain index writers
> > > >>
> > > >> The two competing configs between Nick and I are as follows:
> > > >>
> > > >> - I want to make sure we keep the old sensor-specific defaults with
> > > >> writer-specific overrides available
> > > >> - Nick thought we could simplify the permutations by making the
> > > >> indexing
> > > >> config only the writer-level configs.
> > > >>
> > > >> My concerns about Nick's suggestion were that the default and
> majority
> > > >> case, specifying the index and the batchSize for all writers (th
> eone
> > we
> > > >> support now) would require more configuration.
> > > >>
> > > >> Nick's concerns about my suggestion were that it was overly complex
> > and
> > > >> hard to grok and that we could dispense with backwards compatibility
> > and
> > > >> make people do a bit more work on the default case for the benefits
> > of a
> > > >> simpler advanced case. (Nick, make sure I don't misstate your
> > position).
> > > >>
> > > >> Casey
> > > >>
> > > >>
> > > >> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle <[email protected]>
> > > >> wrote:
> > > >>
> > > >> > Casey,
> > > >> >
> > > >> > Can you give me a level set of what your thinking is now? I think
> > it's
> > > >> > global control of all index types + overrides on a per-type basis.
> > > Fwiw,
> > > >> > I'm totally for that, but I want to make sure I'm not imposing my
> > > >> > pre-concieved notions on your consensus-driven ones.
> > > >> >
> > > >> > -D....
> > > >> >
> > > >> > On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella <
> [email protected]>
> > > >> wrote:
> > > >> >
> > > >> > > I am suggesting that, yes. The configs are essentially the same
> as
> > > >> > yours,
> > > >> > > except there is an override specified at the top level. Without
> > > >> that, in
> > > >> > > order to specify both HDFS and ES have batch sizes of 100, you
> > have
> > > to
> > > >> > > explicitly configure each. It's less that I'm trying to have
> > > >> backwards
> > > >> > > compatibility and more that I'm trying to make the majority case
> > > easy:
> > > >> > both
> > > >> > > writers write everything to a specified index name with a
> > specified
> > > >> batch
> > > >> > > size (which is what we have now). Beyond that, I want to allow
> for
> > > >> > > specifying an override for the config on a writer-by-writer
> basis
> > > for
> > > >> > those
> > > >> > > who need it.
> > > >> > >
> > > >> > > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen <
> [email protected]>
> > > >> wrote:
> > > >> > >
> > > >> > > > Are you saying we support all of these variants? I realize you
> > are
> > > >> > > trying
> > > >> > > > to have some backwards compatibility, but this also makes it
> > > harder
> > > >> > for a
> > > >> > > > user to grok (for me at least).
> > > >> > > >
> > > >> > > > Personally I like my original example as there are fewer
> > > >> > sub-structures,
> > > >> > > > like 'writerConfig', which makes the whole thing simpler and
> > > easier
> > > >> to
> > > >> > > > grok. But maybe others will think your proposal is just as
> easy
> > to
> > > >> > grok.
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella <
> > > [email protected]>
> > >
> > > >> > > wrote:
> > > >> > > >
> > > >> > > > > Ok, so here's what I'm thinking based on the discussion:
> > > >> > > > >
> > > >> > > > > - Keeping the configs that we have now (batchSize and index)
> > as
> > > >> > > > defaults
> > > >> > > > > for the unspecified writer-specific case
> > > >> > > > > - Adding the config Nick suggested
> > > >> > > > >
> > > >> > > > > *Base Case*:
> > > >> > > > > {
> > > >> > > > > }
> > > >> > > > >
> > > >> > > > > - all writers write all messages
> > > >> > > > > - index named the same as the sensor for all writers
> > > >> > > > > - batchSize of 1 for all writers
> > > >> > > > >
> > > >> > > > > *Writer-non-specific case*:
> > > >> > > > > {
> > > >> > > > > "index" : "foo"
> > > >> > > > > ,"batchSize" : 100
> > > >> > > > > }
> > > >> > > > >
> > > >> > > > > - All writers write all messages
> > > >> > > > > - index is named "foo", different from the sensor for all
> > > >> writers
> > > >> > > > > - batchSize is 100 for all writers
> > > >> > > > >
> > > >> > > > > *Writer-specific case without filters*
> > > >> > > > > {
> > > >> > > > > "index" : "foo"
> > > >> > > > > ,"batchSize" : 1
> > > >> > > > > , "writerConfig" :
> > > >> > > > > {
> > > >> > > > > "elasticsearch" : {
> > > >> > > > > "batchSize" : 100
> > > >> > > > > }
> > > >> > > > > }
> > > >> > > > > }
> > > >> > > > >
> > > >> > > > > - All writers write all messages
> > > >> > > > > - index is named "foo", different from the sensor for all
> > > >> writers
> > > >> > > > > - batchSize is 1 for HDFS and 100 for elasticsearch writers
> > > >> > > > > - NOTE: I could override the index name too
> > > >> > > > >
> > > >> > > > > *Writer-specific case with filters*
> > > >> > > > > {
> > > >> > > > > "index" : "foo"
> > > >> > > > > ,"batchSize" : 1
> > > >> > > > > , "writerConfig" :
> > > >> > > > > {
> > > >> > > > > "elasticsearch" : {
> > > >> > > > > "batchSize" : 100,
> > > >> > > > > "when" : "exists(field1)"
> > > >> > > > > },
> > > >> > > > > "hdfs" : {
> > > >> > > > > "when" : "false"
> > > >> > > > > }
> > > >> > > > > }
> > > >> > > > > }
> > > >> > > > >
> > > >> > > > > - ES writer writes messages which have field1, HDFS doesn't
> > > >> > > > > - index is named "foo", different from the sensor for all
> > > >> writers
> > > >> > > > > - 100 for elasticsearch writers
> > > >> > > > >
> > > >> > > > > Thoughts?
> > > >> > > > >
> > > >> > > > > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <
> > > >> [email protected]
> > > >> > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > For larger installations you need to control what is
> indexed
> > > so
> > > >> you
> > > >> > > > don’t
> > > >> > > > > > end up with a nasty elastic search situation and so you
> can
> > > mine
> > > >> > the
> > > >> > > > data
> > > >> > > > > > later for reports and training ml models.
> > > >> > > > > >
> > > >> > > > > > Thanks
> > > >> > > > > > Carolyn
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On 1/13/17, 9:40 AM, "Casey Stella" <[email protected]>
> > > wrote:
> > > >> > > > > >
> > > >> > > > > > >OH that's a good idea!
> > > >> > > > > > >
> > > >> > > > > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <
> > > >> [email protected]>
> > > >> > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > >> I like the "Index Filtering" option based on the
> > > flexibility
> > > >> > that
> > > >> > > it
> > > >> > > > > > >> provides. Should each output (HDFS, ES, etc) have its
> own
> > > >> > > > > configuration
> > > >> > > > > > >> settings? For example, aren't things like batching
> > handled
> > > >> > > > separately
> > > >> > > > > > for
> > > >> > > > > > >> HDFS versus Elasticsearch?
> > > >> > > > > > >>
> > > >> > > > > > >> Something along the lines of...
> > > >> > > > > > >>
> > > >> > > > > > >> {
> > > >> > > > > > >> "hdfs" : {
> > > >> > > > > > >> "when": "exists(field1)",
> > > >> > > > > > >> "batchSize": 100
> > > >> > > > > > >> },
> > > >> > > > > > >>
> > > >> > > > > > >> "elasticsearch" : {
> > > >> > > > > > >> "when": "true",
> > > >> > > > > > >> "batchSize": 1000,
> > > >> > > > > > >> "index": "squid"
> > > >> > > > > > >> }
> > > >> > > > > > >> }
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <
> > > >> > [email protected]
> > > >> > > >
> > > >> > > > > > wrote:
> > > >> > > > > > >>
> > > >> > > > > > >> > Yeah, I tend to like the first option too. Any
> > opposition
> > > >> to
> > > >> > > that
> > > >> > > > > > from
> > > >> > > > > > >> > anyone?
> > > >> > > > > > >> >
> > > >> > > > > > >> > The points brought up are good ones and I think that
> it
> > > >> may be
> > > >> > > > > worth a
> > > >> > > > > > >> > broader discussion of the requirements of indexing
> in a
> > > >> > separate
> > > >> > > > dev
> > > >> > > > > > list
> > > >> > > > > > >> > thread. Maybe a list of desires with coherent
> use-cases
> > > >> > > > justifying
> > > >> > > > > > them
> > > >> > > > > > >> so
> > > >> > > > > > >> > we can think about how this stuff should work and
> where
> > > the
> > > >> > > > natural
> > > >> > > > > > >> > extension points should be. Afterall, we need to toe
> > the
> > > >> line
> > > >> > > > > between
> > > >> > > > > > >> > engineering and overengineering for features nobody
> > will
> > > >> want.
> > > >> > > > > > >> >
> > > >> > > > > > >> > I'm not sure about the extensions to the standard
> > fields.
> > > >> I'm
> > > >> > > > torn
> > > >> > > > > > >> between
> > > >> > > > > > >> > the notions that we should have no standard fields vs
> > we
> > > >> > should
> > > >> > > > > have a
> > > >> > > > > > >> > boatload of standard fields (with most of them
> empty).
> > I
> > > >> > > exchange
> > > >> > > > > > >> > positions fairly regularly on that question. ;) It
> may
> > be
> > > >> > > worth a
> > > >> > > > > dev
> > > >> > > > > > >> list
> > > >> > > > > > >> > discussion to lay out how you imagine an extension of
> > > >> standard
> > > >> > > > > fields
> > > >> > > > > > and
> > > >> > > > > > >> > how it might look as implemented in Metron.
> > > >> > > > > > >> >
> > > >> > > > > > >> > Casey
> > > >> > > > > > >> >
> > > >> > > > > > >> > Casey
> > > >> > > > > > >> >
> > > >> > > > > > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > > >> > > > > > >> > [email protected]>
> > > >> > > > > > >> > wrote:
> > > >> > > > > > >> >
> > > >> > > > > > >> > > I'll second my preference for the first option. I
> > think
> > > >> the
> > > >> > > > > ability
> > > >> > > > > > to
> > > >> > > > > > >> > use
> > > >> > > > > > >> > > Stellar filters to customize indexing would be a
> big
> > > win.
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > I'm glad Matt brought up the point about data lake
> > and
> > > >> CEP.
> > > >> > I
> > > >> > > > > think
> > > >> > > > > > >> this
> > > >> > > > > > >> > is
> > > >> > > > > > >> > > a really important use case that we need to
> consider.
> > > >> Take a
> > > >> > > > > simple
> > > >> > > > > > >> > > example... If I have data coming in from 3
> different
> > > >> > firewall
> > > >> > > > > > vendors
> > > >> > > > > > >> > and 2
> > > >> > > > > > >> > > different web proxy/url filtering vendors and I
> want
> > to
> > > >> be
> > > >> > > able
> > > >> > > > to
> > > >> > > > > > >> > analyze
> > > >> > > > > > >> > > that data set, I need the data to be indexed all
> > > together
> > > >> > > > (likely
> > > >> > > > > in
> > > >> > > > > > >> > HDFS)
> > > >> > > > > > >> > > and to have a normalized schema such that IP
> address,
> > > >> URL,
> > > >> > and
> > > >> > > > > user
> > > >> > > > > > >> name
> > > >> > > > > > >> > > (to take a few) can be easily queried and
> > aggregated. I
> > > >> can
> > > >> > > also
> > > >> > > > > > >> envision
> > > >> > > > > > >> > > scenarios where I would want to index data based on
> > > >> > attributes
> > > >> > > > > other
> > > >> > > > > > >> than
> > > >> > > > > > >> > > sensor, business unit or subsidiary for example.
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > I've been wanted to propose extending our 7
> standard
> > > >> fields
> > > >> > to
> > > >> > > > > > include
> > > >> > > > > > >> > > things like URL and user. Is there community
> > > >> > interest/support
> > > >> > > > for
> > > >> > > > > > >> moving
> > > >> > > > > > >> > in
> > > >> > > > > > >> > > that direction? If so, I'll start a new thread.
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > Thanks!
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > -Kyle
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <
> > > >> > [email protected]
> > > >> > > >
> > > >> > > > > > wrote:
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > > Ah, I see. If overriding the default index name
> > > allows
> > > >> > > using
> > > >> > > > > the
> > > >> > > > > > >> same
> > > >> > > > > > >> > > > name for multiple sensors, then the goal can be
> > > >> achieved.
> > > >> > > > > > >> > > > Thanks,
> > > >> > > > > > >> > > > --Matt
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > > On 1/12/17, 3:30 PM, "Casey Stella" <
> > > >> [email protected]>
> > > >> > > > wrote:
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > > Oh, you could! Let's say you have a syslog parser
> > > >> > with
> > > >> > > > data
> > > >> > > > > > from
> > > >> > > > > > >> > > > sources 1
> > > >> > > > > > >> > > > 2 and 3. You'd end up with one kafka queue with 3
> > > >> > > parsers
> > > >> > > > > > >> attached
> > > >> > > > > > >> > > to
> > > >> > > > > > >> > > > that
> > > >> > > > > > >> > > > queue, each picking part the messages from source
> > > >> 1, 2
> > > >> > > and
> > > >> > > > > 3.
> > > >> > > > > > >> > They'd
> > > >> > > > > > >> > > > go
> > > >> > > > > > >> > > > through separate enrichment and into the indexing
> > > >> > > > topology.
> > > >> > > > > > In
> > > >> > > > > > >> the
> > > >> > > > > > >> > > > indexing topology, you could specify the same
> index
> > > >> > name
> > > >> > > > > > "syslog"
> > > >> > > > > > >> > and
> > > >> > > > > > >> > > > all
> > > >> > > > > > >> > > > of the messages go into the same index for CEP
> > > >> > querying
> > > >> > > if
> > > >> > > > > so
> > > >> > > > > > >> > > desired.
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <
> > > >> > > > > [email protected]
> > > >> > > > > > >
> > > >> > > > > > >> > > wrote:
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > > > Syslog is hell on parsers – I know, I worked at
> > > >> > > LogLogic
> > > >> > > > > in
> > > >> > > > > > a
> > > >> > > > > > >> > > > previous
> > > >> > > > > > >> > > > > life. It makes perfect sense to route different
> > > >> > lines
> > > >> > > > > from
> > > >> > > > > > >> > syslog
> > > >> > > > > > >> > > > through
> > > >> > > > > > >> > > > > different appropriate parsers. But a lot of
> what
> > > >> > the
> > > >> > > > > > parsers
> > > >> > > > > > >> do
> > > >> > > > > > >> > is
> > > >> > > > > > >> > > > > identify consistent subsets of metadata and
> > > >> annotate
> > > >> > > it
> > > >> > > > –
> > > >> > > > > > eg,
> > > >> > > > > > >> > > > src_ip_addr,
> > > >> > > > > > >> > > > > event timestamps, etc. Once those metadata are
> > > >> > > > annotated
> > > >> > > > > > and
> > > >> > > > > > >> > > > available
> > > >> > > > > > >> > > > > with common field names, why doesn’t it make
> > > >> sense
> > > >> > to
> > > >> > > > > index
> > > >> > > > > > the
> > > >> > > > > > >> > > > messages
> > > >> > > > > > >> > > > > together, for CEP querying? I think Splunk has
> > > >> > > > > illustrated
> > > >> > > > > > >> this
> > > >> > > > > > >> > > > model.
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> > > > > On 1/12/17, 3:00 PM, "Casey Stella" <
> > > >> > > [email protected]
> > > >> > > > >
> > > >> > > > > > >> wrote:
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> > > > > yeah, I mean, honestly, I think the approach
> > > >> > that
> > > >> > > > > we've
> > > >> > > > > > >> taken
> > > >> > > > > > >> > > for
> > > >> > > > > > >> > > > > sources
> > > >> > > > > > >> > > > > which aggregate different types of data is to
> > > >> > > > provide
> > > >> > > > > > >> filters
> > > >> > > > > > >> > > at
> > > >> > > > > > >> > > > the
> > > >> > > > > > >> > > > > parser
> > > >> > > > > > >> > > > > level and have multiple parser topologies
> > > >> (with
> > > >> > > > > > different,
> > > >> > > > > > >> > > > possibly
> > > >> > > > > > >> > > > > mutually exclusive filters) running. This
> > > >> would
> > > >> > > be
> > > >> > > > a
> > > >> > > > > > >> > > completely
> > > >> > > > > > >> > > > > separate
> > > >> > > > > > >> > > > > sensor. Imagine a syslog data source that
> > > >> > > > aggregates
> > > >> > > > > > and
> > > >> > > > > > >> you
> > > >> > > > > > >> > > > want to
> > > >> > > > > > >> > > > > pick
> > > >> > > > > > >> > > > > apart certain pieces of messages. This is
> > > >> why
> > > >> > the
> > > >> > > > > > initial
> > > >> > > > > > >> > > > thought and
> > > >> > > > > > >> > > > > architecture was one index per sensor.
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> > > > > On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley <
> > > >> > > > > > >> > [email protected]>
> > > >> > > > > > >> > > > wrote:
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> > > > > > I’m thinking that CEP (Complex Event
> > > >> > Processing)
> > > >> > > > is
> > > >> > > > > > >> > contrary
> > > >> > > > > > >> > > > to the
> > > >> > > > > > >> > > > > idea
> > > >> > > > > > >> > > > > > of silo-ing data per sensor.
> > > >> > > > > > >> > > > > > Now it’s true that some of those sensors
> > > >> are
> > > >> > > > already
> > > >> > > > > > >> > > > aggregating
> > > >> > > > > > >> > > > > data from
> > > >> > > > > > >> > > > > > multiple sources, so maybe I’m wrong here.
> > > >> > > > > > >> > > > > > But it just seems to me that the “data
> > > >> lake”
> > > >> > > > > insights
> > > >> > > > > > >> come
> > > >> > > > > > >> > > from
> > > >> > > > > > >> > > > > being able
> > > >> > > > > > >> > > > > > to make decisions over the whole mass of
> > > >> data
> > > >> > > > rather
> > > >> > > > > > than
> > > >> > > > > > >> > > just
> > > >> > > > > > >> > > > > vertical
> > > >> > > > > > >> > > > > > slices of it.
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > > On 1/12/17, 2:15 PM, "Casey Stella" <
> > > >> > > > > > [email protected]>
> > > >> > > > > > >> > > > wrote:
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > > Hey Matt,
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > > Thanks for the comment!
> > > >> > > > > > >> > > > > > 1. At the moment, we only have one
> > > >> index
> > > >> > > name,
> > > >> > > > > the
> > > >> > > > > > >> > > default
> > > >> > > > > > >> > > > of
> > > >> > > > > > >> > > > > which is
> > > >> > > > > > >> > > > > > the
> > > >> > > > > > >> > > > > > sensor name but that's entirely up to
> > > >> the
> > > >> > > > user.
> > > >> > > > > > This
> > > >> > > > > > >> > is
> > > >> > > > > > >> > > > sensor
> > > >> > > > > > >> > > > > > specific,
> > > >> > > > > > >> > > > > > so it'd be a separate config for each
> > > >> > > sensor.
> > > >> > > > > If
> > > >> > > > > > we
> > > >> > > > > > >> > want
> > > >> > > > > > >> > > > to
> > > >> > > > > > >> > > > > build
> > > >> > > > > > >> > > > > > multiple
> > > >> > > > > > >> > > > > > indices per sensor, we'd have to think
> > > >> > > > carefully
> > > >> > > > > > >> about
> > > >> > > > > > >> > > how
> > > >> > > > > > >> > > > to do
> > > >> > > > > > >> > > > > that
> > > >> > > > > > >> > > > > > and
> > > >> > > > > > >> > > > > > would be a bigger undertaking. I
> > > >> guess I
> > > >> > > can
> > > >> > > > > see
> > > >> > > > > > the
> > > >> > > > > > >> > > use,
> > > >> > > > > > >> > > > though
> > > >> > > > > > >> > > > > > (redirect
> > > >> > > > > > >> > > > > > messages to one index vs another based
> > > >> on
> > > >> > a
> > > >> > > > > > predicate
> > > >> > > > > > >> > for
> > > >> > > > > > >> > > > a given
> > > >> > > > > > >> > > > > > sensor).
> > > >> > > > > > >> > > > > > Anyway, not where I was originally
> > > >> > thinking
> > > >> > > > that
> > > >> > > > > > this
> > > >> > > > > > >> > > > discussion
> > > >> > > > > > >> > > > > would
> > > >> > > > > > >> > > > > > go,
> > > >> > > > > > >> > > > > > but it's an interesting point.
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > > 2. I hadn't thought through the
> > > >> > > implementation
> > > >> > > > > > quite
> > > >> > > > > > >> > yet,
> > > >> > > > > > >> > > > but we
> > > >> > > > > > >> > > > > don't
> > > >> > > > > > >> > > > > > actually have a splitter bolt in that
> > > >> > > > topology,
> > > >> > > > > > just
> > > >> > > > > > >> a
> > > >> > > > > > >> > > > spout
> > > >> > > > > > >> > > > > that goes
> > > >> > > > > > >> > > > > > to
> > > >> > > > > > >> > > > > > the elasticsearch writer and also to
> > > >> the
> > > >> > > hdfs
> > > >> > > > > > writer.
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > > On Thu, Jan 12, 2017 at 4:52 PM, Matt
> > > >> > Foley
> > > >> > > <
> > > >> > > > > > >> > > > [email protected]>
> > > >> > > > > > >> > > > > wrote:
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > > > Casey, good to have controls like
> > > >> this.
> > > >> > > > > Couple
> > > >> > > > > > >> > > > questions:
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > > > 1. Regarding the “index” : “squid”
> > > >> > > > name/value
> > > >> > > > > > pair,
> > > >> > > > > > >> > is
> > > >> > > > > > >> > > > the
> > > >> > > > > > >> > > > > index name
> > > >> > > > > > >> > > > > > > expected to always be a sensor
> > > >> name? Or
> > > >> > > is
> > > >> > > > > the
> > > >> > > > > > >> given
> > > >> > > > > > >> > > > json
> > > >> > > > > > >> > > > > structure
> > > >> > > > > > >> > > > > > > subordinate to a sensor name in
> > > >> > zookeeper?
> > > >> > > > Or
> > > >> > > > > > can
> > > >> > > > > > >> we
> > > >> > > > > > >> > > > build
> > > >> > > > > > >> > > > > arbitrary
> > > >> > > > > > >> > > > > > > indexes with this new specification,
> > > >> > > > > > independent of
> > > >> > > > > > >> > > > sensor?
> > > >> > > > > > >> > > > > Should
> > > >> > > > > > >> > > > > > there
> > > >> > > > > > >> > > > > > > actually be a list of “indexes”, ie
> > > >> > > > > > >> > > > > > > { “indexes” : [
> > > >> > > > > > >> > > > > > > {“index” : “name1”,
> > > >> > > > > > >> > > > > > > …
> > > >> > > > > > >> > > > > > > },
> > > >> > > > > > >> > > > > > > {“index” : “name2”,
> > > >> > > > > > >> > > > > > > …
> > > >> > > > > > >> > > > > > > } ]
> > > >> > > > > > >> > > > > > > }
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > > > 2. Would the filtering / writer
> > > >> > selection
> > > >> > > > > logic
> > > >> > > > > > >> take
> > > >> > > > > > >> > > > place in
> > > >> > > > > > >> > > > > the
> > > >> > > > > > >> > > > > > indexing
> > > >> > > > > > >> > > > > > > topology splitter bolt? Seems like
> > > >> that
> > > >> > > > would
> > > >> > > > > > have
> > > >> > > > > > >> > the
> > > >> > > > > > >> > > > > smallest
> > > >> > > > > > >> > > > > > impact on
> > > >> > > > > > >> > > > > > > current implementation, no?
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > > > Sorry if these are already answered
> > > >> in
> > > >> > > > > PR-415, I
> > > >> > > > > > >> > > haven’t
> > > >> > > > > > >> > > > had
> > > >> > > > > > >> > > > > time to
> > > >> > > > > > >> > > > > > > review that one yet.
> > > >> > > > > > >> > > > > > > Thanks,
> > > >> > > > > > >> > > > > > > --Matt
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > > > On 1/12/17, 12:55 PM, "Michael
> > > >> > Miklavcic"
> > > >> > > <
> > > >> > > > > > >> > > > > > [email protected]>
> > > >> > > > > > >> > > > > > > wrote:
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > > > I like the flexibility and
> > > >> > > > expressibility
> > > >> > > > > of
> > > >> > > > > > >> the
> > > >> > > > > > >> > > > first
> > > >> > > > > > >> > > > > option
> > > >> > > > > > >> > > > > > with
> > > >> > > > > > >> > > > > > > Stellar
> > > >> > > > > > >> > > > > > > filters.
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > > > M
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > > > On Thu, Jan 12, 2017 at 1:51 PM,
> > > >> > Casey
> > > >> > > > > > Stella <
> > > >> > > > > > >> > > > > > [email protected]>
> > > >> > > > > > >> > > > > > > wrote:
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > > > > As of METRON-652 <
> > > >> > > > > > https://github.com/apache/
> > > >> > > > > > >> > > > > > > incubator-metron/pull/415>, we
> > > >> > > > > > >> > > > > > > > will have decoupled the
> > > >> indexing
> > > >> > > > > > >> configuration
> > > >> > > > > > >> > > > from the
> > > >> > > > > > >> > > > > > enrichment
> > > >> > > > > > >> > > > > > > > configuration. As an immediate
> > > >> > > > > follow-up
> > > >> > > > > > to
> > > >> > > > > > >> > > that,
> > > >> > > > > > >> > > > I'd
> > > >> > > > > > >> > > > > like to
> > > >> > > > > > >> > > > > > > provide the
> > > >> > > > > > >> > > > > > > > ability to turn off and on
> > > >> writers
> > > >> > > via
> > > >> > > > > the
> > > >> > > > > > >> > > > configs. I'd
> > > >> > > > > > >> > > > > like
> > > >> > > > > > >> > > > > > to get
> > > >> > > > > > >> > > > > > > some
> > > >> > > > > > >> > > > > > > > community feedback on how the
> > > >> > > > > > functionality
> > > >> > > > > > >> > > should
> > > >> > > > > > >> > > > work,
> > > >> > > > > > >> > > > > if
> > > >> > > > > > >> > > > > > y'all are
> > > >> > > > > > >> > > > > > > > amenable. :)
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > As of now, we have 3 possible
> > > >> > > writers
> > > >> > > > > > which
> > > >> > > > > > >> can
> > > >> > > > > > >> > > be
> > > >> > > > > > >> > > > used
> > > >> > > > > > >> > > > > in the
> > > >> > > > > > >> > > > > > > indexing
> > > >> > > > > > >> > > > > > > > topology:
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > - Solr
> > > >> > > > > > >> > > > > > > > - Elasticsearch
> > > >> > > > > > >> > > > > > > > - HDFS
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > HDFS is always used,
> > > >> elasticsearch
> > > >> > > or
> > > >> > > > > > solr is
> > > >> > > > > > >> > > used
> > > >> > > > > > >> > > > > depending
> > > >> > > > > > >> > > > > > on how
> > > >> > > > > > >> > > > > > > you
> > > >> > > > > > >> > > > > > > > start the indexing topology.
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > A couple of proposals come to
> > > >> mind
> > > >> > > > > > >> immediately:
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > *Index Filtering*
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > You would be able to specify a
> > > >> > > filter
> > > >> > > > as
> > > >> > > > > > >> > defined
> > > >> > > > > > >> > > > by a
> > > >> > > > > > >> > > > > stellar
> > > >> > > > > > >> > > > > > > statement
> > > >> > > > > > >> > > > > > > > (likely a reuse of the
> > > >> > StellarFilter
> > > >> > > > > that
> > > >> > > > > > >> > exists
> > > >> > > > > > >> > > > in the
> > > >> > > > > > >> > > > > > Parsers)
> > > >> > > > > > >> > > > > > > which
> > > >> > > > > > >> > > > > > > > would allow you to indicate on
> > > >> a
> > > >> > > > > > >> > > > message-by-message basis
> > > >> > > > > > >> > > > > > whether or
> > > >> > > > > > >> > > > > > > not to
> > > >> > > > > > >> > > > > > > > write the message.
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > The semantics of this would be
> > > >> as
> > > >> > > > > follows:
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > - Default (i.e.
> > > >> unspecified) is
> > > >> > > to
> > > >> > > > > pass
> > > >> > > > > > >> > > > everything
> > > >> > > > > > >> > > > > through
> > > >> > > > > > >> > > > > > (hence
> > > >> > > > > > >> > > > > > > > backwards compatible with
> > > >> the
> > > >> > > > current
> > > >> > > > > > >> > default
> > > >> > > > > > >> > > > config).
> > > >> > > > > > >> > > > > > > > - Messages which have the
> > > >> > > > associated
> > > >> > > > > > >> stellar
> > > >> > > > > > >> > > > statement
> > > >> > > > > > >> > > > > > evaluate
> > > >> > > > > > >> > > > > > > to true
> > > >> > > > > > >> > > > > > > > for the writer type will be
> > > >> > > > written,
> > > >> > > > > > >> > otherwise
> > > >> > > > > > >> > > > not.
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > Sample indexing config which
> > > >> would
> > > >> > > > write
> > > >> > > > > > out
> > > >> > > > > > >> no
> > > >> > > > > > >> > > > messages
> > > >> > > > > > >> > > > > to
> > > >> > > > > > >> > > > > > HDFS and
> > > >> > > > > > >> > > > > > > write
> > > >> > > > > > >> > > > > > > > out only messages containing a
> > > >> > field
> > > >> > > > > > called
> > > >> > > > > > >> > > > "field1":
> > > >> > > > > > >> > > > > > > > {
> > > >> > > > > > >> > > > > > > > "index" : "squid"
> > > >> > > > > > >> > > > > > > > ,"batchSize" : 100
> > > >> > > > > > >> > > > > > > > ,"filters" : {
> > > >> > > > > > >> > > > > > > > "HDFS" : "false"
> > > >> > > > > > >> > > > > > > > ,"ES" : "exists(field1)"
> > > >> > > > > > >> > > > > > > > }
> > > >> > > > > > >> > > > > > > > }
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > *Index On/Off Switch*
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > A simpler solution would be to
> > > >> > just
> > > >> > > > > > provide a
> > > >> > > > > > >> > > list
> > > >> > > > > > >> > > > of
> > > >> > > > > > >> > > > > writers
> > > >> > > > > > >> > > > > > to
> > > >> > > > > > >> > > > > > > write
> > > >> > > > > > >> > > > > > > > messages. The semantics would
> > > >> be
> > > >> > as
> > > >> > > > > > follows:
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > - If the list is
> > > >> unspecified,
> > > >> > > then
> > > >> > > > > the
> > > >> > > > > > >> > default
> > > >> > > > > > >> > > > is to
> > > >> > > > > > >> > > > > write
> > > >> > > > > > >> > > > > > all
> > > >> > > > > > >> > > > > > > messages
> > > >> > > > > > >> > > > > > > > for every writer in the
> > > >> > indexing
> > > >> > > > > > topology
> > > >> > > > > > >> > > > > > > > - If the list is specified,
> > > >> > then
> > > >> > > a
> > > >> > > > > > writer
> > > >> > > > > > >> > will
> > > >> > > > > > >> > > > write
> > > >> > > > > > >> > > > > all
> > > >> > > > > > >> > > > > > messages
> > > >> > > > > > >> > > > > > > if and
> > > >> > > > > > >> > > > > > > > only if it is named in the
> > > >> > list.
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > Sample indexing config which
> > > >> turns
> > > >> > > off
> > > >> > > > > > HDFS
> > > >> > > > > > >> and
> > > >> > > > > > >> > > > keeps on
> > > >> > > > > > >> > > > > > > Elasticsearch:
> > > >> > > > > > >> > > > > > > > {
> > > >> > > > > > >> > > > > > > > "index" : "squid"
> > > >> > > > > > >> > > > > > > > ,"batchSize" : 100
> > > >> > > > > > >> > > > > > > > ,"writers" : [ "ES" ]
> > > >> > > > > > >> > > > > > > > }
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > Thanks in advance for the
> > > >> > feedback!
> > > >> > > > > > Also, if
> > > >> > > > > > >> > you
> > > >> > > > > > >> > > > have
> > > >> > > > > > >> > > > > any
> > > >> > > > > > >> > > > > > other,
> > > >> > > > > > >> > > > > > > better
> > > >> > > > > > >> > > > > > > > ideas than the ones presented
> > > >> > here,
> > > >> > > > let
> > > >> > > > > me
> > > >> > > > > > >> know
> > > >> > > > > > >> > > > too.
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > Best,
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > > > Casey
> > > >> > > > > > >> > > > > > > >
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > > >
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > >
> > > >> > > > > > >> >
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > > >> --
> > > >> > > > > > >> Nick Allen <[email protected]>
> > > >> > > > > > >>
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Nick Allen <[email protected]>
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Nick Allen <[email protected]>
> > > >
> > >
> > >
> > >
> > > --
> > > Nick Allen <[email protected]>
> > >
> >
> --
>
> Jon
>
> Sent from my mobile device
>
>
>

Re: [DISCUSS] Turning off indexing writers feature discussion

Reply via email to