Re: [DISCUSS] Turning off indexing writers feature discussion

Casey Stella Mon, 16 Jan 2017 13:24:40 -0800

The problem with throwing up a warning is that this is a sensor-specific
configuration and the Indexing topology does not know at topology start
time all of the sensors.  Furthermore, you can start a new sensor in the
middle of a running topology.


I'd suggest a compromise and have the indexing topology use the defaults
for writing and log a warning to the Storm UI about the sensor not being
explicitly stated and recommend remedying that immediately.

On Mon, Jan 16, 2017 at 4:01 PM, James Sirota <[email protected]> wrote:

> The explicit on/off seems like a good option to have.  This way I don't
> have to completely remove the config block in order for me to test
> something.  I think if the config for the writer is unspecified we should
> throw up a warning.
>
> 16.01.2017, 08:54, "Nick Allen" <[email protected]>:
> >>  To recap, what I am +1 on is Nick's proposed syntax with the following
> >>  modifications:
> >>  1. An explicit enabled field
> >>  2. A default on for unspecified to match current semantics
> >
> > I'm +1 on all of this.
> >
> > On Sat, Jan 14, 2017 at 10:51 AM, Casey Stella <[email protected]>
> wrote:
> >
> >>  I'm +1 on an explicit enabled property and a filter (or when)
> property. I
> >>  think we are zeroing in on a decent design, so that is good.
> >>
> >>  To recap, what I am +1 on is Nick's proposed syntax with the following
> >>  modifications:
> >>  1. An explicit enabled field
> >>  2. A default on for unspecified to match current semantics
> >>
> >>  Casey
> >>  On Sat, Jan 14, 2017 at 10:45 [email protected] <[email protected]>
> wrote:
> >>
> >>  > This has the additional benefit of doing something like below when
> you
> >>  want
> >>  > to temporarily disable the hdfs writer, but don't want to remove the
> >>  > settings. This removes the need to store the path and batchSize (and
> >>  many
> >>  > additional settings) somewhere else so they can be brought back in
> when
> >>  you
> >>  > want to re-enable it, which is a nice workflow attribute for the end
> >>  user:
> >>  >
> >>  > {
> >>  > 'elasticsearch': {
> >>  > 'enabled': 'true',
> >>  > 'index': 'foo',
> >>  > 'batchSize': 100,
> >>  > },
> >>  > 'hdfs': {
> >>  > 'enabled': 'false',
> >>  > 'path': '/foo/bar/...',
> >>  > 'batchSize': 100,
> >>  > },
> >>  > 'solr': {
> >>  > 'enabled': 'false'
> >>  > }
> >>  > }
> >>  >
> >>  > Jon
> >>  >
> >>  > On Sat, Jan 14, 2017 at 9:24 AM [email protected] <[email protected]>
> >>  wrote:
> >>  >
> >>  > > I similarly have a concern there because I prefer being as
> explicit as
> >>  > > possible, which makes things easier to pick up for new users.
> Using my
> >>  > > example from earlier this could look like specifying while(false),
> but
> >>  an
> >>  > > even better and more obvious approach may be to use
> enabled(false). So
> >>  > the
> >>  > > current simple default would be:
> >>  > >
> >>  > > {
> >>  > > 'elasticsearch': { 'enabled': 'true' },
> >>  > > 'hdfs': { 'enabled': 'true' },
> >>  > > 'solr': { enabled': 'false' }
> >>  > > }
> >>  > >
> >>  > > And to use ES with some overrides but not HDFS or solr it would
> look
> >>  > like:
> >>  > >
> >>  > > {
> >>  > > 'elasticsearch': {
> >>  > > 'enabled': 'true',
> >>  > > 'index': 'foo',
> >>  > > 'batchSize': 100
> >>  > > },
> >>  > > 'hdfs': {
> >>  > > 'enabled': 'false'
> >>  > > },
> >>  > > 'solr': {
> >>  > > 'enabled': 'false'
> >>  > > }
> >>  > > }
> >>  > >
> >>  > > Jon
> >>  > >
> >>  > > On Fri, Jan 13, 2017 at 10:21 PM Casey Stella <[email protected]>
> >>  > wrote:
> >>  > >
> >>  > > One thing that I thought of that I very strenuous do not like in
> Nick's
> >>  > > proposal is that if a writer config is not specified then it is
> turned
> >>  > off
> >>  > > (I think; if I misunderstood let me know). In the situation where
> we
> >>  > have a
> >>  > > new sensor, right now if there are no index config and no
> enrichment
> >>  > > config, it still passes through to the index using defaults. In
> this
> >>  new
> >>  > > scheme it would not. This changes the default semantics for the
> system
> >>  > and
> >>  > > I think it changes it for the worse.
> >>  > >
> >>  > > I would strongly prefer a on-by-default indexing config as we have
> now.
> >>  > > On Fri, Jan 13, 2017 at 17:13 Casey Stella <[email protected]>
> wrote:
> >>  > >
> >>  > > > One thing that I really like about Nick's suggestion is that it
> >>  allows
> >>  > > > writer-specific configs in a clear and simple way. It is more
> >>  complex
> >>  > > for
> >>  > > > the default case (all writers write to indices named the same
> thing
> >>  > with
> >>  > > a
> >>  > > > fixed batch size), which I do not like, but maybe it's worth the
> >>  > > compromise
> >>  > > > to make it less complex for the advanced case.
> >>  > > >
> >>  > > > Thanks a lot for the suggestion, Nick, it's interesting; I'm
> >>  beginning
> >>  > > to
> >>  > > > lean your way.
> >>  > > >
> >>  > > > On Fri, Jan 13, 2017 at 2:51 PM, [email protected] <
> [email protected]>
> >>  > > > wrote:
> >>  > > >
> >>  > > > I like the suggestions you made, Nick. The only thing I would
> add is
> >>  > > that
> >>  > > > it's also nice to see an explicit when(false), as people newer
> to the
> >>  > > > platform may not know where to expect configs for the different
> >>  > writers.
> >>  > > > Being able to do it either way, which I think is already assumed
> in
> >>  > your
> >>  > > > model, would make sense. I would just suggest that, if we support
> >>  but
> >>  > > are
> >>  > > > disabling a writer, that the platform inserts a default
> when(false)
> >>  to
> >>  > be
> >>  > > > explicit.
> >>  > > >
> >>  > > > Jon
> >>  > > >
> >>  > > > On Fri, Jan 13, 2017 at 11:59 AM Casey Stella <
> [email protected]>
> >>  > > wrote:
> >>  > > >
> >>  > > > > Let me noodle on this over the weekend. Your syntax is looking
> >>  less
> >>  > > > > onerous to me and I like the following statement from Otto:
> "In the
> >>  > > end,
> >>  > > > > each write destination ‘type’ will need it’s own configuration.
> >>  This
> >>  > > is
> >>  > > > an
> >>  > > > > extension point."
> >>  > > > >
> >>  > > > > I may come around to your way of thinking.
> >>  > > > >
> >>  > > > > On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler <
> >>  > [email protected]
> >>  > > >
> >>  > > > > wrote:
> >>  > > > >
> >>  > > > > > In the end, each write destination ‘type’ will need it’s own
> >>  > > > > > configuration. This is an extension point.
> >>  > > > > > {
> >>  > > > > > HDFS:{
> >>  > > > > > outputAdapters:[
> >>  > > > > > {name: avro,
> >>  > > > > > settings:{
> >>  > > > > > avro stuff….
> >>  > > > > > when:{
> >>  > > > > > },
> >>  > > > > > {
> >>  > > > > > name: sequence file,
> >>  > > > > > …..
> >>  > > > > >
> >>  > > > > > or some such.
> >>  > > > > >
> >>  > > > > >
> >>  > > > > > On January 13, 2017 at 11:51:15, Nick Allen (
> [email protected])
> >>  > > > wrote:
> >>  > > > > >
> >>  > > > > > I will add also that instead of global overrides, like
> index, we
> >>  > > should
> >>  > > > > use
> >>  > > > > > configuration key names that are more appropriate to the
> output.
> >>  > > > > >
> >>  > > > > > For example, does 'index' really make sense for HDFS? Or
> would
> >>  > 'path'
> >>  > > > be
> >>  > > > > > more appropriate?
> >>  > > > > >
> >>  > > > > > {
> >>  > > > > > 'elasticsearch': {
> >>  > > > > > 'index': 'foo',
> >>  > > > > > 'batchSize': 1
> >>  > > > > > },
> >>  > > > > > 'hdfs': {
> >>  > > > > > 'path': '/foo/bar/...',
> >>  > > > > > 'batchSize': 100
> >>  > > > > > }
> >>  > > > > > }
> >>  > > > > >
> >>  > > > > > Ok, I've said my peace. Thanks for the effort in summarizing
> all
> >>  > > this,
> >>  > > > > > Casey.
> >>  > > > > >
> >>  > > > > >
> >>  > > > > > On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen <
> [email protected]
> >>  >
> >>  > > > wrote:
> >>  > > > > >
> >>  > > > > > > Nick's concerns about my suggestion were that it was overly
> >>  > complex
> >>  > > > and
> >>  > > > > > >> hard to grok and that we could dispense with backwards
> >>  > > compatibility
> >>  > > > > and
> >>  > > > > > >> make people do a bit more work on the default case for the
> >>  > > benefits
> >>  > > > > of a
> >>  > > > > > >> simpler advanced case. (Nick, make sure I don't misstate
> your
> >>  > > > > position)
> >>  > > > > > >
> >>  > > > > > >
> >>  > > > > > > I will add is that in my mind, the majority case would be a
> >>  user
> >>  > > > > > > specifying the outputs, but not things like 'batchSize' or
> >>  > 'when'.
> >>  > > I
> >>  > > > > > think
> >>  > > > > > > in the majority case, the user would accept whatever the
> >>  default
> >>  > > > batch
> >>  > > > > > size
> >>  > > > > > > is.
> >>  > > > > > >
> >>  > > > > > > Here are alternatives suggestions for all the examples
> that you
> >>  > > > > provided
> >>  > > > > > > previously.
> >>  > > > > > >
> >>  > > > > > > Base Case
> >>  > > > > > >
> >>  > > > > > > - The user must always specify the 'outputs' for clarity.
> >>  > > > > > > - Uses default index name, batch size and when = true.
> >>  > > > > > >
> >>  > > > > > > {
> >>  > > > > > > 'elasticsearch': {},
> >>  > > > > > > 'hdfs': {}
> >>  > > > > > > }
> >>  > > > > > >
> >>  > > > > > >
> >>  > > > > > > <
> >>  > > > > > https://gist.github.com/nickwallen/
> >>  489735b65cdb38aae6e45cec7633a0
> >>  > > > > > a1#writer-non-specific-case>Writer-non-specific
> >>  > > > > >
> >>  > > > > > > Case
> >>  > > > > > >
> >>  > > > > > > - There are no global overrides, as in Casey's proposal.
> >>  > > > > > > - Easier to grok IMO.
> >>  > > > > > >
> >>  > > > > > > {
> >>  > > > > > > 'elasticsearch': {
> >>  > > > > > > 'index': 'foo',
> >>  > > > > > > 'batchSize': 100
> >>  > > > > > > },
> >>  > > > > > > 'hdfs': {
> >>  > > > > > > 'index': 'foo',
> >>  > > > > > > 'batchSize': 100
> >>  > > > > > > }
> >>  > > > > > > }
> >>  > > > > > >
> >>  > > > > > >
> >>  > > > > > > <
> >>  > > > > > https://gist.github.com/nickwallen/
> >>  489735b65cdb38aae6e45cec7633a0
> >>  > > > > > a1#writer-specific-case-without-filters>Writer-specific
> >>  > > > > >
> >>  > > > > > > case without filters
> >>  > > > > > >
> >>  > > > > > > {
> >>  > > > > > > 'elasticsearch': {
> >>  > > > > > > 'index': 'foo',
> >>  > > > > > > 'batchSize': 1
> >>  > > > > > > },
> >>  > > > > > > 'hdfs': {
> >>  > > > > > > 'index': 'foo',
> >>  > > > > > > 'batchSize': 100
> >>  > > > > > > }
> >>  > > > > > > }
> >>  > > > > > >
> >>  > > > > > >
> >>  > > > > > > <
> >>  > > > > > https://gist.github.com/nickwallen/
> >>  489735b65cdb38aae6e45cec7633a0
> >>  > > > > > a1#writer-specific-case-with-filters>Writer-specific
> >>  > > > > >
> >>  > > > > > > case with filters
> >>  > > > > > >
> >>  > > > > > > - Instead of having to say when=false, just don't configure
> >>  HDFS
> >>  > > > > > >
> >>  > > > > > > {
> >>  > > > > > > 'elasticsearch': {
> >>  > > > > > > 'index': 'foo',
> >>  > > > > > > 'batchSize': 100,
> >>  > > > > > > 'when': 'exists(field1)'
> >>  > > > > > > }
> >>  > > > > > > }
> >>  > > > > > >
> >>  > > > > > >
> >>  > > > > > >
> >>  > > > > > >
> >>  > > > > > >
> >>  > > > > > > On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella <
> >>  > [email protected]
> >>  > > >
> >>  > > > > > wrote:
> >>  > > > > > >
> >>  > > > > > >> Dave,
> >>  > > > > > >> For the benefit of posterity and people who might not be
> as
> >>  > deeply
> >>  > > > > > >> entangled in the system as we have been, I'll recap
> things and
> >>  > > > > hopefully
> >>  > > > > > >> answer your question in the process.
> >>  > > > > > >>
> >>  > > > > > >> Historically the index configuration is split between the
> >>  > > enrichment
> >>  > > > > > >> configs and the global configs.
> >>  > > > > > >>
> >>  > > > > > >> - The global configs really controls configs that apply
> to all
> >>  > > > > sensors.
> >>  > > > > > >> Historically this has been stuff like index connection
> >>  strings,
> >>  > > etc.
> >>  > > > > > >> - The sensor-specific configs which control things that
> vary
> >>  by
> >>  > > > > sensor.
> >>  > > > > > >>
> >>  > > > > > >> As of Metron-652 (in review currently), we moved the
> sensor
> >>  > > specific
> >>  > > > > > >> configs from the enrichment configs. The proposal here is
> to
> >>  > > > increase
> >>  > > > > > the
> >>  > > > > > >> granularity of the the sensor specific files to make them
> >>  > support
> >>  > > > > index
> >>  > > > > > >> writer-specific configs. Right now in the indexing
> topology,
> >>  we
> >>  > > > have 2
> >>  > > > > > >> writers (fixed): ES/Solr and HDFS.
> >>  > > > > > >>
> >>  > > > > > >> The proposed configuration would allow you to either
> specify a
> >>  > > > blanket
> >>  > > > > > >> sensor-level config for the index name and batchSize
> and/or
> >>  > > override
> >>  > > > > at
> >>  > > > > > >> the
> >>  > > > > > >> writer level, thereby supporting a couple of use-cases:
> >>  > > > > > >>
> >>  > > > > > >> - Turning off certain index writers (e.g. HDFS)
> >>  > > > > > >> - Filtering the messages written to certain index writers
> >>  > > > > > >>
> >>  > > > > > >> The two competing configs between Nick and I are as
> follows:
> >>  > > > > > >>
> >>  > > > > > >> - I want to make sure we keep the old sensor-specific
> defaults
> >>  > > with
> >>  > > > > > >> writer-specific overrides available
> >>  > > > > > >> - Nick thought we could simplify the permutations by
> making
> >>  the
> >>  > > > > > >> indexing
> >>  > > > > > >> config only the writer-level configs.
> >>  > > > > > >>
> >>  > > > > > >> My concerns about Nick's suggestion were that the default
> and
> >>  > > > majority
> >>  > > > > > >> case, specifying the index and the batchSize for all
> writers
> >>  (th
> >>  > > > eone
> >>  > > > > we
> >>  > > > > > >> support now) would require more configuration.
> >>  > > > > > >>
> >>  > > > > > >> Nick's concerns about my suggestion were that it was
> overly
> >>  > > complex
> >>  > > > > and
> >>  > > > > > >> hard to grok and that we could dispense with backwards
> >>  > > compatibility
> >>  > > > > and
> >>  > > > > > >> make people do a bit more work on the default case for the
> >>  > > benefits
> >>  > > > > of a
> >>  > > > > > >> simpler advanced case. (Nick, make sure I don't misstate
> your
> >>  > > > > position).
> >>  > > > > > >>
> >>  > > > > > >> Casey
> >>  > > > > > >>
> >>  > > > > > >>
> >>  > > > > > >> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle <
> >>  > > [email protected]>
> >>  > > > > > >> wrote:
> >>  > > > > > >>
> >>  > > > > > >> > Casey,
> >>  > > > > > >> >
> >>  > > > > > >> > Can you give me a level set of what your thinking is
> now? I
> >>  > > think
> >>  > > > > it's
> >>  > > > > > >> > global control of all index types + overrides on a
> per-type
> >>  > > basis.
> >>  > > > > > Fwiw,
> >>  > > > > > >> > I'm totally for that, but I want to make sure I'm not
> >>  imposing
> >>  > > my
> >>  > > > > > >> > pre-concieved notions on your consensus-driven ones.
> >>  > > > > > >> >
> >>  > > > > > >> > -D....
> >>  > > > > > >> >
> >>  > > > > > >> > On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella <
> >>  > > > [email protected]>
> >>  > > > > > >> wrote:
> >>  > > > > > >> >
> >>  > > > > > >> > > I am suggesting that, yes. The configs are
> essentially the
> >>  > > same
> >>  > > > as
> >>  > > > > > >> > yours,
> >>  > > > > > >> > > except there is an override specified at the top
> level.
> >>  > > Without
> >>  > > > > > >> that, in
> >>  > > > > > >> > > order to specify both HDFS and ES have batch sizes of
> 100,
> >>  > you
> >>  > > > > have
> >>  > > > > > to
> >>  > > > > > >> > > explicitly configure each. It's less that I'm trying
> to
> >>  have
> >>  > > > > > >> backwards
> >>  > > > > > >> > > compatibility and more that I'm trying to make the
> >>  majority
> >>  > > case
> >>  > > > > > easy:
> >>  > > > > > >> > both
> >>  > > > > > >> > > writers write everything to a specified index name
> with a
> >>  > > > > specified
> >>  > > > > > >> batch
> >>  > > > > > >> > > size (which is what we have now). Beyond that, I want
> to
> >>  > allow
> >>  > > > for
> >>  > > > > > >> > > specifying an override for the config on a
> >>  writer-by-writer
> >>  > > > basis
> >>  > > > > > for
> >>  > > > > > >> > those
> >>  > > > > > >> > > who need it.
> >>  > > > > > >> > >
> >>  > > > > > >> > > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen <
> >>  > > > [email protected]>
> >>  > > > > > >> wrote:
> >>  > > > > > >> > >
> >>  > > > > > >> > > > Are you saying we support all of these variants? I
> >>  realize
> >>  > > you
> >>  > > > > are
> >>  > > > > > >> > > trying
> >>  > > > > > >> > > > to have some backwards compatibility, but this also
> >>  makes
> >>  > it
> >>  > > > > > harder
> >>  > > > > > >> > for a
> >>  > > > > > >> > > > user to grok (for me at least).
> >>  > > > > > >> > > >
> >>  > > > > > >> > > > Personally I like my original example as there are
> fewer
> >>  > > > > > >> > sub-structures,
> >>  > > > > > >> > > > like 'writerConfig', which makes the whole thing
> simpler
> >>  > and
> >>  > > > > > easier
> >>  > > > > > >> to
> >>  > > > > > >> > > > grok. But maybe others will think your proposal is
> just
> >>  as
> >>  > > > easy
> >>  > > > > to
> >>  > > > > > >> > grok.
> >>  > > > > > >> > > >
> >>  > > > > > >> > > >
> >>  > > > > > >> > > >
> >>  > > > > > >> > > > On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella <
> >>  > > > > > [email protected]>
> >>  > > > > >
> >>  > > > > > >> > > wrote:
> >>  > > > > > >> > > >
> >>  > > > > > >> > > > > Ok, so here's what I'm thinking based on the
> >>  discussion:
> >>  > > > > > >> > > > >
> >>  > > > > > >> > > > > - Keeping the configs that we have now (batchSize
> and
> >>  > > index)
> >>  > > > > as
> >>  > > > > > >> > > > defaults
> >>  > > > > > >> > > > > for the unspecified writer-specific case
> >>  > > > > > >> > > > > - Adding the config Nick suggested
> >>  > > > > > >> > > > >
> >>  > > > > > >> > > > > *Base Case*:
> >>  > > > > > >> > > > > {
> >>  > > > > > >> > > > > }
> >>  > > > > > >> > > > >
> >>  > > > > > >> > > > > - all writers write all messages
> >>  > > > > > >> > > > > - index named the same as the sensor for all
> writers
> >>  > > > > > >> > > > > - batchSize of 1 for all writers
> >>  > > > > > >> > > > >
> >>  > > > > > >> > > > > *Writer-non-specific case*:
> >>  > > > > > >> > > > > {
> >>  > > > > > >> > > > > "index" : "foo"
> >>  > > > > > >> > > > > ,"batchSize" : 100
> >>  > > > > > >> > > > > }
> >>  > > > > > >> > > > >
> >>  > > > > > >> > > > > - All writers write all messages
> >>  > > > > > >> > > > > - index is named "foo", different from the sensor
> for
> >>  > all
> >>  > > > > > >> writers
> >>  > > > > > >> > > > > - batchSize is 100 for all writers
> >>  > > > > > >> > > > >
> >>  > > > > > >> > > > > *Writer-specific case without filters*
> >>  > > > > > >> > > > > {
> >>  > > > > > >> > > > > "index" : "foo"
> >>  > > > > > >> > > > > ,"batchSize" : 1
> >>  > > > > > >> > > > > , "writerConfig" :
> >>  > > > > > >> > > > > {
> >>  > > > > > >> > > > > "elasticsearch" : {
> >>  > > > > > >> > > > > "batchSize" : 100
> >>  > > > > > >> > > > > }
> >>  > > > > > >> > > > > }
> >>  > > > > > >> > > > > }
> >>  > > > > > >> > > > >
> >>  > > > > > >> > > > > - All writers write all messages
> >>  > > > > > >> > > > > - index is named "foo", different from the sensor
> for
> >>  > all
> >>  > > > > > >> writers
> >>  > > > > > >> > > > > - batchSize is 1 for HDFS and 100 for
> elasticsearch
> >>  > > writers
> >>  > > > > > >> > > > > - NOTE: I could override the index name too
> >>  > > > > > >> > > > >
> >>  > > > > > >> > > > > *Writer-specific case with filters*
> >>  > > > > > >> > > > > {
> >>  > > > > > >> > > > > "index" : "foo"
> >>  > > > > > >> > > > > ,"batchSize" : 1
> >>  > > > > > >> > > > > , "writerConfig" :
> >>  > > > > > >> > > > > {
> >>  > > > > > >> > > > > "elasticsearch" : {
> >>  > > > > > >> > > > > "batchSize" : 100,
> >>  > > > > > >> > > > > "when" : "exists(field1)"
> >>  > > > > > >> > > > > },
> >>  > > > > > >> > > > > "hdfs" : {
> >>  > > > > > >> > > > > "when" : "false"
> >>  > > > > > >> > > > > }
> >>  > > > > > >> > > > > }
> >>  > > > > > >> > > > > }
> >>  > > > > > >> > > > >
> >>  > > > > > >> > > > > - ES writer writes messages which have field1,
> HDFS
> >>  > > doesn't
> >>  > > > > > >> > > > > - index is named "foo", different from the sensor
> for
> >>  > all
> >>  > > > > > >> writers
> >>  > > > > > >> > > > > - 100 for elasticsearch writers
> >>  > > > > > >> > > > >
> >>  > > > > > >> > > > > Thoughts?
> >>  > > > > > >> > > > >
> >>  > > > > > >> > > > > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <
> >>  > > > > > >> [email protected]
> >>  > > > > > >> > >
> >>  > > > > > >> > > > > wrote:
> >>  > > > > > >> > > > >
> >>  > > > > > >> > > > > > For larger installations you need to control
> what is
> >>  > > > indexed
> >>  > > > > > so
> >>  > > > > > >> you
> >>  > > > > > >> > > > don’t
> >>  > > > > > >> > > > > > end up with a nasty elastic search situation
> and so
> >>  > you
> >>  > > > can
> >>  > > > > > mine
> >>  > > > > > >> > the
> >>  > > > > > >> > > > data
> >>  > > > > > >> > > > > > later for reports and training ml models.
> >>  > > > > > >> > > > > >
> >>  > > > > > >> > > > > > Thanks
> >>  > > > > > >> > > > > > Carolyn
> >>  > > > > > >> > > > > >
> >>  > > > > > >> > > > > >
> >>  > > > > > >> > > > > >
> >>  > > > > > >> > > > > >
> >>  > > > > > >> > > > > > On 1/13/17, 9:40 AM, "Casey Stella" <
> >>  > [email protected]
> >>  > > >
> >>  > > > > > wrote:
> >>  > > > > > >> > > > > >
> >>  > > > > > >> > > > > > >OH that's a good idea!
> >>  > > > > > >> > > > > > >
> >>  > > > > > >> > > > > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <
> >>  > > > > > >> [email protected]>
> >>  > > > > > >> > > > wrote:
> >>  > > > > > >> > > > > > >
> >>  > > > > > >> > > > > > >> I like the "Index Filtering" option based on
> the
> >>  > > > > > flexibility
> >>  > > > > > >> > that
> >>  > > > > > >> > > it
> >>  > > > > > >> > > > > > >> provides. Should each output (HDFS, ES, etc)
> have
> >>  > its
> >>  > > > own
> >>  > > > > > >> > > > > configuration
> >>  > > > > > >> > > > > > >> settings? For example, aren't things like
> >>  batching
> >>  > > > > handled
> >>  > > > > > >> > > > separately
> >>  > > > > > >> > > > > > for
> >>  > > > > > >> > > > > > >> HDFS versus Elasticsearch?
> >>  > > > > > >> > > > > > >>
> >>  > > > > > >> > > > > > >> Something along the lines of...
> >>  > > > > > >> > > > > > >>
> >>  > > > > > >> > > > > > >> {
> >>  > > > > > >> > > > > > >> "hdfs" : {
> >>  > > > > > >> > > > > > >> "when": "exists(field1)",
> >>  > > > > > >> > > > > > >> "batchSize": 100
> >>  > > > > > >> > > > > > >> },
> >>  > > > > > >> > > > > > >>
> >>  > > > > > >> > > > > > >> "elasticsearch" : {
> >>  > > > > > >> > > > > > >> "when": "true",
> >>  > > > > > >> > > > > > >> "batchSize": 1000,
> >>  > > > > > >> > > > > > >> "index": "squid"
> >>  > > > > > >> > > > > > >> }
> >>  > > > > > >> > > > > > >> }
> >>  > > > > > >> > > > > > >>
> >>  > > > > > >> > > > > > >>
> >>  > > > > > >> > > > > > >>
> >>  > > > > > >> > > > > > >>
> >>  > > > > > >> > > > > > >>
> >>  > > > > > >> > > > > > >>
> >>  > > > > > >> > > > > > >>
> >>  > > > > > >> > > > > > >>
> >>  > > > > > >> > > > > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey
> Stella <
> >>  > > > > > >> > [email protected]
> >>  > > > > > >> > > >
> >>  > > > > > >> > > > > > wrote:
> >>  > > > > > >> > > > > > >>
> >>  > > > > > >> > > > > > >> > Yeah, I tend to like the first option too.
> Any
> >>  > > > > opposition
> >>  > > > > > >> to
> >>  > > > > > >> > > that
> >>  > > > > > >> > > > > > from
> >>  > > > > > >> > > > > > >> > anyone?
> >>  > > > > > >> > > > > > >> >
> >>  > > > > > >> > > > > > >> > The points brought up are good ones and I
> think
> >>  > > that
> >>  > > > it
> >>  > > > > > >> may be
> >>  > > > > > >> > > > > worth a
> >>  > > > > > >> > > > > > >> > broader discussion of the requirements of
> >>  > indexing
> >>  > > > in a
> >>  > > > > > >> > separate
> >>  > > > > > >> > > > dev
> >>  > > > > > >> > > > > > list
> >>  > > > > > >> > > > > > >> > thread. Maybe a list of desires with
> coherent
> >>  > > > use-cases
> >>  > > > > > >> > > > justifying
> >>  > > > > > >> > > > > > them
> >>  > > > > > >> > > > > > >> so
> >>  > > > > > >> > > > > > >> > we can think about how this stuff should
> work
> >>  and
> >>  > > > where
> >>  > > > > > the
> >>  > > > > > >> > > > natural
> >>  > > > > > >> > > > > > >> > extension points should be. Afterall, we
> need
> >>  to
> >>  > > toe
> >>  > > > > the
> >>  > > > > > >> line
> >>  > > > > > >> > > > > between
> >>  > > > > > >> > > > > > >> > engineering and overengineering for
> features
> >>  > nobody
> >>  > > > > will
> >>  > > > > > >> want.
> >>  > > > > > >> > > > > > >> >
> >>  > > > > > >> > > > > > >> > I'm not sure about the extensions to the
> >>  standard
> >>  > > > > fields.
> >>  > > > > > >> I'm
> >>  > > > > > >> > > > torn
> >>  > > > > > >> > > > > > >> between
> >>  > > > > > >> > > > > > >> > the notions that we should have no standard
> >>  > fields
> >>  > > vs
> >>  > > > > we
> >>  > > > > > >> > should
> >>  > > > > > >> > > > > have a
> >>  > > > > > >> > > > > > >> > boatload of standard fields (with most of
> them
> >>  > > > empty).
> >>  > > > > I
> >>  > > > > > >> > > exchange
> >>  > > > > > >> > > > > > >> > positions fairly regularly on that
> question. ;)
> >>  > It
> >>  > > > may
> >>  > > > > be
> >>  > > > > > >> > > worth a
> >>  > > > > > >> > > > > dev
> >>  > > > > > >> > > > > > >> list
> >>  > > > > > >> > > > > > >> > discussion to lay out how you imagine an
> >>  > extension
> >>  > > of
> >>  > > > > > >> standard
> >>  > > > > > >> > > > > fields
> >>  > > > > > >> > > > > > and
> >>  > > > > > >> > > > > > >> > how it might look as implemented in Metron.
> >>  > > > > > >> > > > > > >> >
> >>  > > > > > >> > > > > > >> > Casey
> >>  > > > > > >> > > > > > >> >
> >>  > > > > > >> > > > > > >> > Casey
> >>  > > > > > >> > > > > > >> >
> >>  > > > > > >> > > > > > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle
> >>  Richardson
> >>  > <
> >>  > > > > > >> > > > > > >> > [email protected]>
> >>  > > > > > >> > > > > > >> > wrote:
> >>  > > > > > >> > > > > > >> >
> >>  > > > > > >> > > > > > >> > > I'll second my preference for the first
> >>  > option. I
> >>  > > > > think
> >>  > > > > > >> the
> >>  > > > > > >> > > > > ability
> >>  > > > > > >> > > > > > to
> >>  > > > > > >> > > > > > >> > use
> >>  > > > > > >> > > > > > >> > > Stellar filters to customize indexing
> would
> >>  be
> >>  > a
> >>  > > > big
> >>  > > > > > win.
> >>  > > > > > >> > > > > > >> > >
> >>  > > > > > >> > > > > > >> > > I'm glad Matt brought up the point about
> data
> >>  > > lake
> >>  > > > > and
> >>  > > > > > >> CEP.
> >>  > > > > > >> > I
> >>  > > > > > >> > > > > think
> >>  > > > > > >> > > > > > >> this
> >>  > > > > > >> > > > > > >> > is
> >>  > > > > > >> > > > > > >> > > a really important use case that we need
> to
> >>  > > > consider.
> >>  > > > > > >> Take a
> >>  > > > > > >> > > > > simple
> >>  > > > > > >> > > > > > >> > > example... If I have data coming in from
> 3
> >>  > > > different
> >>  > > > > > >> > firewall
> >>  > > > > > >> > > > > > vendors
> >>  > > > > > >> > > > > > >> > and 2
> >>  > > > > > >> > > > > > >> > > different web proxy/url filtering vendors
> >>  and I
> >>  > > > want
> >>  > > > > to
> >>  > > > > > >> be
> >>  > > > > > >> > > able
> >>  > > > > > >> > > > to
> >>  > > > > > >> > > > > > >> > analyze
> >>  > > > > > >> > > > > > >> > > that data set, I need the data to be
> indexed
> >>  > all
> >>  > > > > > together
> >>  > > > > > >> > > > (likely
> >>  > > > > > >> > > > > in
> >>  > > > > > >> > > > > > >> > HDFS)
> >>  > > > > > >> > > > > > >> > > and to have a normalized schema such
> that IP
> >>  > > > address,
> >>  > > > > > >> URL,
> >>  > > > > > >> > and
> >>  > > > > > >> > > > > user
> >>  > > > > > >> > > > > > >> name
> >>  > > > > > >> > > > > > >> > > (to take a few) can be easily queried and
> >>  > > > > aggregated. I
> >>  > > > > > >> can
> >>  > > > > > >> > > also
> >>  > > > > > >> > > > > > >> envision
> >>  > > > > > >> > > > > > >> > > scenarios where I would want to index
> data
> >>  > based
> >>  > > on
> >>  > > > > > >> > attributes
> >>  > > > > > >> > > > > other
> >>  > > > > > >> > > > > > >> than
> >>  > > > > > >> > > > > > >> > > sensor, business unit or subsidiary for
> >>  > example.
> >>  > > > > > >> > > > > > >> > >
> >>  > > > > > >> > > > > > >> > > I've been wanted to propose extending
> our 7
> >>  > > > standard
> >>  > > > > > >> fields
> >>  > > > > > >> > to
> >>  > > > > > >> > > > > > include
> >>  > > > > > >> > > > > > >> > > things like URL and user. Is there
> community
> >>  > > > > > >> > interest/support
> >>  > > > > > >> > > > for
> >>  > > > > > >> > > > > > >> moving
> >>  > > > > > >> > > > > > >> > in
> >>  > > > > > >> > > > > > >> > > that direction? If so, I'll start a new
> >>  thread.
> >>  > > > > > >> > > > > > >> > >
> >>  > > > > > >> > > > > > >> > > Thanks!
> >>  > > > > > >> > > > > > >> > >
> >>  > > > > > >> > > > > > >> > > -Kyle
> >>  > > > > > >> > > > > > >> > >
> >>  > > > > > >> > > > > > >> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt
> Foley <
> >>  > > > > > >> > [email protected]
> >>  > > > > > >> > > >
> >>  > > > > > >> > > > > > wrote:
> >>  > > > > > >> > > > > > >> > >
> >>  > > > > > >> > > > > > >> > > > Ah, I see. If overriding the default
> index
> >>  > name
> >>  > > > > > allows
> >>  > > > > > >> > > using
> >>  > > > > > >> > > > > the
> >>  > > > > > >> > > > > > >> same
> >>  > > > > > >> > > > > > >> > > > name for multiple sensors, then the
> goal
> >>  can
> >>  > be
> >>  > > > > > >> achieved.
> >>  > > > > > >> > > > > > >> > > > Thanks,
> >>  > > > > > >> > > > > > >> > > > --Matt
> >>  > > > > > >> > > > > > >> > > >
> >>  > > > > > >> > > > > > >> > > >
> >>  > > > > > >> > > > > > >> > > > On 1/12/17, 3:30 PM, "Casey Stella" <
> >>  > > > > > >> [email protected]>
> >>  > > > > > >> > > > wrote:
> >>  > > > > > >> > > > > > >> > > >
> >>  > > > > > >> > > > > > >> > > > Oh, you could! Let's say you have a
> syslog
> >>  > > parser
> >>  > > > > > >> > with
> >>  > > > > > >> > > > data
> >>  > > > > > >> > > > > > from
> >>  > > > > > >> > > > > > >> > > > sources 1
> >>  > > > > > >> > > > > > >> > > > 2 and 3. You'd end up with one kafka
> queue
> >>  > > with 3
> >>  > > > > > >> > > parsers
> >>  > > > > > >> > > > > > >> attached
> >>  > > > > > >> > > > > > >> > > to
> >>  > > > > > >> > > > > > >> > > > that
> >>  > > > > > >> > > > > > >> > > > queue, each picking part the messages
> from
> >>  > > source
> >>  > > > > > >> 1, 2
> >>  > > > > > >> > > and
> >>  > > > > > >> > > > > 3.
> >>  > > > > > >> > > > > > >> > They'd
> >>  > > > > > >> > > > > > >> > > > go
> >>  > > > > > >> > > > > > >> > > > through separate enrichment and into
> the
> >>  > > indexing
> >>  > > > > > >> > > > topology.
> >>  > > > > > >> > > > > > In
> >>  > > > > > >> > > > > > >> the
> >>  > > > > > >> > > > > > >> > > > indexing topology, you could specify
> the
> >>  same
> >>  > > > index
> >>  > > > > > >> > name
> >>  > > > > > >> > > > > > "syslog"
> >>  > > > > > >> > > > > > >> > and
> >>  > > > > > >> > > > > > >> > > > all
> >>  > > > > > >> > > > > > >> > > > of the messages go into the same index
> for
> >>  > CEP
> >>  > > > > > >> > querying
> >>  > > > > > >> > > if
> >>  > > > > > >> > > > > so
> >>  > > > > > >> > > > > > >> > > desired.
> >>  > > > > > >> > > > > > >> > > >
> >>  > > > > > >> > > > > > >> > > > On Thu, Jan 12, 2017 at 6:27 PM, Matt
> >>  Foley <
> >>  > > > > > >> > > > > [email protected]
> >>  > > > > > >> > > > > > >
> >>  > > > > > >> > > > > > >> > > wrote:
> >>  > > > > > >> > > > > > >> > > >
> >>  > > > > > >> > > > > > >> > > > > Syslog is hell on parsers – I know, I
> >>  > worked
> >>  > > at
> >>  > > > > > >> > > LogLogic
> >>  > > > > > >> > > > > in
> >>  > > > > > >> > > > > > a
> >>  > > > > > >> > > > > > >> > > > previous
> >>  > > > > > >> > > > > > >> > > > > life. It makes perfect sense to route
> >>  > > different
> >>  > > > > > >> > lines
> >>  > > > > > >> > > > > from
> >>  > > > > > >> > > > > > >> > syslog
> >>  > > > > > >> > > > > > >> > > > through
> >>  > > > > > >> > > > > > >> > > > > different appropriate parsers. But a
> lot
> >>  of
> >>  > > > what
> >>  > > > > > >> > the
> >>  > > > > > >> > > > > > parsers
> >>  > > > > > >> > > > > > >> do
> >>  > > > > > >> > > > > > >> > is
> >>  > > > > > >> > > > > > >> > > > > identify consistent subsets of
> metadata
> >>  and
> >>  > > > > > >> annotate
> >>  > > > > > >> > > it
> >>  > > > > > >> > > > –
> >>  > > > > > >> > > > > > eg,
> >>  > > > > > >> > > > > > >> > > > src_ip_addr,
> >>  > > > > > >> > > > > > >> > > > > event timestamps, etc. Once those
> >>  metadata
> >>  > > are
> >>  > > > > > >> > > > annotated
> >>  > > > > > >> > > > > > and
> >>  > > > > > >> > > > > > >> > > > available
> >>  > > > > > >> > > > > > >> > > > > with common field names, why doesn’t
> it
> >>  > make
> >>  > > > > > >> sense
> >>  > > > > > >> > to
> >>  > > > > > >> > > > > index
> >>  > > > > > >> > > > > > the
> >>  > > > > > >> > > > > > >> > > > messages
> >>  > > > > > >> > > > > > >> > > > > together, for CEP querying? I think
> >>  Splunk
> >>  > > has
> >>  > > > > > >> > > > > illustrated
> >>  > > > > > >> > > > > > >> this
> >>  > > > > > >> > > > > > >> > > > model.
> >>  > > > > > >> > > > > > >> > > > >
> >>  > > > > > >> > > > > > >> > > > > On 1/12/17, 3:00 PM, "Casey Stella" <
> >>  > > > > > >> > > [email protected]
> >>  > > > > > >> > > > >
> >>  > > > > > >> > > > > > >> wrote:
> >>  > > > > > >> > > > > > >> > > > >
> >>  > > > > > >> > > > > > >> > > > > yeah, I mean, honestly, I think the
> >>  > approach
> >>  > > > > > >> > that
> >>  > > > > > >> > > > > we've
> >>  > > > > > >> > > > > > >> taken
> >>  > > > > > >> > > > > > >> > > for
> >>  > > > > > >> > > > > > >> > > > > sources
> >>  > > > > > >> > > > > > >> > > > > which aggregate different types of
> data
> >>  is
> >>  > to
> >>  > > > > > >> > > > provide
> >>  > > > > > >> > > > > > >> filters
> >>  > > > > > >> > > > > > >> > > at
> >>  > > > > > >> > > > > > >> > > > the
> >>  > > > > > >> > > > > > >> > > > > parser
> >>  > > > > > >> > > > > > >> > > > > level and have multiple parser
> topologies
> >>  > > > > > >> (with
> >>  > > > > > >> > > > > > different,
> >>  > > > > > >> > > > > > >> > > > possibly
> >>  > > > > > >> > > > > > >> > > > > mutually exclusive filters) running.
> This
> >>  > > > > > >> would
> >>  > > > > > >> > > be
> >>  > > > > > >> > > > a
> >>  > > > > > >> > > > > > >> > > completely
> >>  > > > > > >> > > > > > >> > > > > separate
> >>  > > > > > >> > > > > > >> > > > > sensor. Imagine a syslog data source
> that
> >>  > > > > > >> > > > aggregates
> >>  > > > > > >> > > > > > and
> >>  > > > > > >> > > > > > >> you
> >>  > > > > > >> > > > > > >> > > > want to
> >>  > > > > > >> > > > > > >> > > > > pick
> >>  > > > > > >> > > > > > >> > > > > apart certain pieces of messages.
> This is
> >>  > > > > > >> why
> >>  > > > > > >> > the
> >>  > > > > > >> > > > > > initial
> >>  > > > > > >> > > > > > >> > > > thought and
> >>  > > > > > >> > > > > > >> > > > > architecture was one index per
> sensor.
> >>  > > > > > >> > > > > > >> > > > >
> >>  > > > > > >> > > > > > >> > > > > On Thu, Jan 12, 2017 at 5:55 PM, Matt
> >>  > Foley <
> >>  > > > > > >> > > > > > >> > [email protected]>
> >>  > > > > > >> > > > > > >> > > > wrote:
> >>  > > > > > >> > > > > > >> > > > >
> >>  > > > > > >> > > > > > >> > > > > > I’m thinking that CEP (Complex
> Event
> >>  > > > > > >> > Processing)
> >>  > > > > > >> > > > is
> >>  > > > > > >> > > > > > >> > contrary
> >>  > > > > > >> > > > > > >> > > > to the
> >>  > > > > > >> > > > > > >> > > > > idea
> >>  > > > > > >> > > > > > >> > > > > > of silo-ing data per sensor.
> >>  > > > > > >> > > > > > >> > > > > > Now it’s true that some of those
> >>  sensors
> >>  > > > > > >> are
> >>  > > > > > >> > > > already
> >>  > > > > > >> > > > > > >> > > > aggregating
> >>  > > > > > >> > > > > > >> > > > > data from
> >>  > > > > > >> > > > > > >> > > > > > multiple sources, so maybe I’m
> wrong
> >>  > here.
> >>  > > > > > >> > > > > > >> > > > > > But it just seems to me that the
> “data
> >>  > > > > > >> lake”
> >>  > > > > > >> > > > > insights
> >>  > > > > > >> > > > > > >> come
> >>  > > > > > >> > > > > > >> > > from
> >>  > > > > > >> > > > > > >> > > > > being able
> >>  > > > > > >> > > > > > >> > > > > > to make decisions over the whole
> mass
> >>  of
> >>  > > > > > >> data
> >>  > > > > > >> > > > rather
> >>  > > > > > >> > > > > > than
> >>  > > > > > >> > > > > > >> > > just
> >>  > > > > > >> > > > > > >> > > > > vertical
> >>  > > > > > >> > > > > > >> > > > > > slices of it.
> >>  > > > > > >> > > > > > >> > > > > >
> >>  > > > > > >> > > > > > >> > > > > > On 1/12/17, 2:15 PM, "Casey
> Stella" <
> >>  > > > > > >> > > > > > [email protected]>
> >>  > > > > > >> > > > > > >> > > > wrote:
> >>  > > > > > >> > > > > > >> > > > > >
> >>  > > > > > >> > > > > > >> > > > > > Hey Matt,
> >>  > > > > > >> > > > > > >> > > > > >
> >>  > > > > > >> > > > > > >> > > > > > Thanks for the comment!
> >>  > > > > > >> > > > > > >> > > > > > 1. At the moment, we only have one
> >>  > > > > > >> index
> >>  > > > > > >> > > name,
> >>  > > > > > >> > > > > the
> >>  > > > > > >> > > > > > >> > > default
> >>  > > > > > >> > > > > > >> > > > of
> >>  > > > > > >> > > > > > >> > > > > which is
> >>  > > > > > >> > > > > > >> > > > > > the
> >>  > > > > > >> > > > > > >> > > > > > sensor name but that's entirely up
> to
> >>  > > > > > >> the
> >>  > > > > > >> > > > user.
> >>  > > > > > >> > > > > > This
> >>  > > > > > >> > > > > > >> > is
> >>  > > > > > >> > > > > > >> > > > sensor
> >>  > > > > > >> > > > > > >> > > > > > specific,
> >>  > > > > > >> > > > > > >> > > > > > so it'd be a separate config for
> each
> >>  > > > > > >> > > sensor.
> >>  > > > > > >> > > > > If
> >>  > > > > > >> > > > > > we
> >>  > > > > > >> > > > > > >> > want
> >>  > > > > > >> > > > > > >> > > > to
> >>  > > > > > >> > > > > > >> > > > > build
> >>  > > > > > >> > > > > > >> > > > > > multiple
> >>  > > > > > >> > > > > > >> > > > > > indices per sensor, we'd have to
> think
> >>  > > > > > >> > > > carefully
> >>  > > > > > >> > > > > > >> about
> >>  > > > > > >> > > > > > >> > > how
> >>  > > > > > >> > > > > > >> > > > to do
> >>  > > > > > >> > > > > > >> > > > > that
> >>  > > > > > >> > > > > > >> > > > > > and
> >>  > > > > > >> > > > > > >> > > > > > would be a bigger undertaking. I
> >>  > > > > > >> guess I
> >>  > > > > > >> > > can
> >>  > > > > > >> > > > > see
> >>  > > > > > >> > > > > > the
> >>  > > > > > >> > > > > > >> > > use,
> >>  > > > > > >> > > > > > >> > > > though
> >>  > > > > > >> > > > > > >> > > > > > (redirect
> >>  > > > > > >> > > > > > >> > > > > > messages to one index vs another
> based
> >>  > > > > > >> on
> >>  > > > > > >> > a
> >>  > > > > > >> > > > > > predicate
> >>  > > > > > >> > > > > > >> > for
> >>  > > > > > >> > > > > > >> > > > a given
> >>  > > > > > >> > > > > > >> > > > > > sensor).
> >>  > > > > > >> > > > > > >> > > > > > Anyway, not where I was originally
> >>  > > > > > >> > thinking
> >>  > > > > > >> > > > that
> >>  > > > > > >> > > > > > this
> >>  > > > > > >> > > > > > >> > > > discussion
> >>  > > > > > >> > > > > > >> > > > > would
> >>  > > > > > >> > > > > > >> > > > > > go,
> >>  > > > > > >> > > > > > >> > > > > > but it's an interesting point.
> >>  > > > > > >> > > > > > >> > > > > >
> >>  > > > > > >> > > > > > >> > > > > > 2. I hadn't thought through the
> >>  > > > > > >> > > implementation
> >>  > > > > > >> > > > > > quite
> >>  > > > > > >> > > > > > >> > yet,
> >>  > > > > > >> > > > > > >> > > > but we
> >>  > > > > > >> > > > > > >> > > > > don't
> >>  > > > > > >> > > > > > >> > > > > > actually have a splitter bolt in
> that
> >>  > > > > > >> > > > topology,
> >>  > > > > > >> > > > > > just
> >>  > > > > > >> > > > > > >> a
> >>  > > > > > >> > > > > > >> > > > spout
> >>  > > > > > >> > > > > > >> > > > > that goes
> >>  > > > > > >> > > > > > >> > > > > > to
> >>  > > > > > >> > > > > > >> > > > > > the elasticsearch writer and also
> to
> >>  > > > > > >> the
> >>  > > > > > >> > > hdfs
> >>  > > > > > >> > > > > > writer.
> >>  > > > > > >> > > > > > >> > > > > >
> >>  > > > > > >> > > > > > >> > > > > > On Thu, Jan 12, 2017 at 4:52 PM,
> Matt
> >>  > > > > > >> > Foley
> >>  > > > > > >> > > <
> >>  > > > > > >> > > > > > >> > > > [email protected]>
> >>  > > > > > >> > > > > > >> > > > > wrote:
> >>  > > > > > >> > > > > > >> > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > Casey, good to have controls like
> >>  > > > > > >> this.
> >>  > > > > > >> > > > > Couple
> >>  > > > > > >> > > > > > >> > > > questions:
> >>  > > > > > >> > > > > > >> > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > 1. Regarding the “index” :
> “squid”
> >>  > > > > > >> > > > name/value
> >>  > > > > > >> > > > > > pair,
> >>  > > > > > >> > > > > > >> > is
> >>  > > > > > >> > > > > > >> > > > the
> >>  > > > > > >> > > > > > >> > > > > index name
> >>  > > > > > >> > > > > > >> > > > > > > expected to always be a sensor
> >>  > > > > > >> name? Or
> >>  > > > > > >> > > is
> >>  > > > > > >> > > > > the
> >>  > > > > > >> > > > > > >> given
> >>  > > > > > >> > > > > > >> > > > json
> >>  > > > > > >> > > > > > >> > > > > structure
> >>  > > > > > >> > > > > > >> > > > > > > subordinate to a sensor name in
> >>  > > > > > >> > zookeeper?
> >>  > > > > > >> > > > Or
> >>  > > > > > >> > > > > > can
> >>  > > > > > >> > > > > > >> we
> >>  > > > > > >> > > > > > >> > > > build
> >>  > > > > > >> > > > > > >> > > > > arbitrary
> >>  > > > > > >> > > > > > >> > > > > > > indexes with this new
> specification,
> >>  > > > > > >> > > > > > independent of
> >>  > > > > > >> > > > > > >> > > > sensor?
> >>  > > > > > >> > > > > > >> > > > > Should
> >>  > > > > > >> > > > > > >> > > > > > there
> >>  > > > > > >> > > > > > >> > > > > > > actually be a list of “indexes”,
> ie
> >>  > > > > > >> > > > > > >> > > > > > > { “indexes” : [
> >>  > > > > > >> > > > > > >> > > > > > > {“index” : “name1”,
> >>  > > > > > >> > > > > > >> > > > > > > …
> >>  > > > > > >> > > > > > >> > > > > > > },
> >>  > > > > > >> > > > > > >> > > > > > > {“index” : “name2”,
> >>  > > > > > >> > > > > > >> > > > > > > …
> >>  > > > > > >> > > > > > >> > > > > > > } ]
> >>  > > > > > >> > > > > > >> > > > > > > }
> >>  > > > > > >> > > > > > >> > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > 2. Would the filtering / writer
> >>  > > > > > >> > selection
> >>  > > > > > >> > > > > logic
> >>  > > > > > >> > > > > > >> take
> >>  > > > > > >> > > > > > >> > > > place in
> >>  > > > > > >> > > > > > >> > > > > the
> >>  > > > > > >> > > > > > >> > > > > > indexing
> >>  > > > > > >> > > > > > >> > > > > > > topology splitter bolt? Seems
> like
> >>  > > > > > >> that
> >>  > > > > > >> > > > would
> >>  > > > > > >> > > > > > have
> >>  > > > > > >> > > > > > >> > the
> >>  > > > > > >> > > > > > >> > > > > smallest
> >>  > > > > > >> > > > > > >> > > > > > impact on
> >>  > > > > > >> > > > > > >> > > > > > > current implementation, no?
> >>  > > > > > >> > > > > > >> > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > Sorry if these are already
> answered
> >>  > > > > > >> in
> >>  > > > > > >> > > > > PR-415, I
> >>  > > > > > >> > > > > > >> > > haven’t
> >>  > > > > > >> > > > > > >> > > > had
> >>  > > > > > >> > > > > > >> > > > > time to
> >>  > > > > > >> > > > > > >> > > > > > > review that one yet.
> >>  > > > > > >> > > > > > >> > > > > > > Thanks,
> >>  > > > > > >> > > > > > >> > > > > > > --Matt
> >>  > > > > > >> > > > > > >> > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > On 1/12/17, 12:55 PM, "Michael
> >>  > > > > > >> > Miklavcic"
> >>  > > > > > >> > > <
> >>  > > > > > >> > > > > > >> > > > > > [email protected]>
> >>  > > > > > >> > > > > > >> > > > > > > wrote:
> >>  > > > > > >> > > > > > >> > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > I like the flexibility and
> >>  > > > > > >> > > > expressibility
> >>  > > > > > >> > > > > of
> >>  > > > > > >> > > > > > >> the
> >>  > > > > > >> > > > > > >> > > > first
> >>  > > > > > >> > > > > > >> > > > > option
> >>  > > > > > >> > > > > > >> > > > > > with
> >>  > > > > > >> > > > > > >> > > > > > > Stellar
> >>  > > > > > >> > > > > > >> > > > > > > filters.
> >>  > > > > > >> > > > > > >> > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > M
> >>  > > > > > >> > > > > > >> > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > On Thu, Jan 12, 2017 at 1:51 PM,
> >>  > > > > > >> > Casey
> >>  > > > > > >> > > > > > Stella <
> >>  > > > > > >> > > > > > >> > > > > > [email protected]>
> >>  > > > > > >> > > > > > >> > > > > > > wrote:
> >>  > > > > > >> > > > > > >> > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > As of METRON-652 <
> >>  > > > > > >> > > > > > https://github.com/apache/
> >>  > > > > > >> > > > > > >> > > > > > > incubator-metron/pull/415>, we
> >>  > > > > > >> > > > > > >> > > > > > > > will have decoupled the
> >>  > > > > > >> indexing
> >>  > > > > > >> > > > > > >> configuration
> >>  > > > > > >> > > > > > >> > > > from the
> >>  > > > > > >> > > > > > >> > > > > > enrichment
> >>  > > > > > >> > > > > > >> > > > > > > > configuration. As an immediate
> >>  > > > > > >> > > > > follow-up
> >>  > > > > > >> > > > > > to
> >>  > > > > > >> > > > > > >> > > that,
> >>  > > > > > >> > > > > > >> > > > I'd
> >>  > > > > > >> > > > > > >> > > > > like to
> >>  > > > > > >> > > > > > >> > > > > > > provide the
> >>  > > > > > >> > > > > > >> > > > > > > > ability to turn off and on
> >>  > > > > > >> writers
> >>  > > > > > >> > > via
> >>  > > > > > >> > > > > the
> >>  > > > > > >> > > > > > >> > > > configs. I'd
> >>  > > > > > >> > > > > > >> > > > > like
> >>  > > > > > >> > > > > > >> > > > > > to get
> >>  > > > > > >> > > > > > >> > > > > > > some
> >>  > > > > > >> > > > > > >> > > > > > > > community feedback on how the
> >>  > > > > > >> > > > > > functionality
> >>  > > > > > >> > > > > > >> > > should
> >>  > > > > > >> > > > > > >> > > > work,
> >>  > > > > > >> > > > > > >> > > > > if
> >>  > > > > > >> > > > > > >> > > > > > y'all are
> >>  > > > > > >> > > > > > >> > > > > > > > amenable. :)
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > As of now, we have 3 possible
> >>  > > > > > >> > > writers
> >>  > > > > > >> > > > > > which
> >>  > > > > > >> > > > > > >> can
> >>  > > > > > >> > > > > > >> > > be
> >>  > > > > > >> > > > > > >> > > > used
> >>  > > > > > >> > > > > > >> > > > > in the
> >>  > > > > > >> > > > > > >> > > > > > > indexing
> >>  > > > > > >> > > > > > >> > > > > > > > topology:
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > - Solr
> >>  > > > > > >> > > > > > >> > > > > > > > - Elasticsearch
> >>  > > > > > >> > > > > > >> > > > > > > > - HDFS
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > HDFS is always used,
> >>  > > > > > >> elasticsearch
> >>  > > > > > >> > > or
> >>  > > > > > >> > > > > > solr is
> >>  > > > > > >> > > > > > >> > > used
> >>  > > > > > >> > > > > > >> > > > > depending
> >>  > > > > > >> > > > > > >> > > > > > on how
> >>  > > > > > >> > > > > > >> > > > > > > you
> >>  > > > > > >> > > > > > >> > > > > > > > start the indexing topology.
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > A couple of proposals come to
> >>  > > > > > >> mind
> >>  > > > > > >> > > > > > >> immediately:
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > *Index Filtering*
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > You would be able to specify a
> >>  > > > > > >> > > filter
> >>  > > > > > >> > > > as
> >>  > > > > > >> > > > > > >> > defined
> >>  > > > > > >> > > > > > >> > > > by a
> >>  > > > > > >> > > > > > >> > > > > stellar
> >>  > > > > > >> > > > > > >> > > > > > > statement
> >>  > > > > > >> > > > > > >> > > > > > > > (likely a reuse of the
> >>  > > > > > >> > StellarFilter
> >>  > > > > > >> > > > > that
> >>  > > > > > >> > > > > > >> > exists
> >>  > > > > > >> > > > > > >> > > > in the
> >>  > > > > > >> > > > > > >> > > > > > Parsers)
> >>  > > > > > >> > > > > > >> > > > > > > which
> >>  > > > > > >> > > > > > >> > > > > > > > would allow you to indicate on
> >>  > > > > > >> a
> >>  > > > > > >> > > > > > >> > > > message-by-message basis
> >>  > > > > > >> > > > > > >> > > > > > whether or
> >>  > > > > > >> > > > > > >> > > > > > > not to
> >>  > > > > > >> > > > > > >> > > > > > > > write the message.
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > The semantics of this would be
> >>  > > > > > >> as
> >>  > > > > > >> > > > > follows:
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > - Default (i.e.
> >>  > > > > > >> unspecified) is
> >>  > > > > > >> > > to
> >>  > > > > > >> > > > > pass
> >>  > > > > > >> > > > > > >> > > > everything
> >>  > > > > > >> > > > > > >> > > > > through
> >>  > > > > > >> > > > > > >> > > > > > (hence
> >>  > > > > > >> > > > > > >> > > > > > > > backwards compatible with
> >>  > > > > > >> the
> >>  > > > > > >> > > > current
> >>  > > > > > >> > > > > > >> > default
> >>  > > > > > >> > > > > > >> > > > config).
> >>  > > > > > >> > > > > > >> > > > > > > > - Messages which have the
> >>  > > > > > >> > > > associated
> >>  > > > > > >> > > > > > >> stellar
> >>  > > > > > >> > > > > > >> > > > statement
> >>  > > > > > >> > > > > > >> > > > > > evaluate
> >>  > > > > > >> > > > > > >> > > > > > > to true
> >>  > > > > > >> > > > > > >> > > > > > > > for the writer type will be
> >>  > > > > > >> > > > written,
> >>  > > > > > >> > > > > > >> > otherwise
> >>  > > > > > >> > > > > > >> > > > not.
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > Sample indexing config which
> >>  > > > > > >> would
> >>  > > > > > >> > > > write
> >>  > > > > > >> > > > > > out
> >>  > > > > > >> > > > > > >> no
> >>  > > > > > >> > > > > > >> > > > messages
> >>  > > > > > >> > > > > > >> > > > > to
> >>  > > > > > >> > > > > > >> > > > > > HDFS and
> >>  > > > > > >> > > > > > >> > > > > > > write
> >>  > > > > > >> > > > > > >> > > > > > > > out only messages containing a
> >>  > > > > > >> > field
> >>  > > > > > >> > > > > > called
> >>  > > > > > >> > > > > > >> > > > "field1":
> >>  > > > > > >> > > > > > >> > > > > > > > {
> >>  > > > > > >> > > > > > >> > > > > > > > "index" : "squid"
> >>  > > > > > >> > > > > > >> > > > > > > > ,"batchSize" : 100
> >>  > > > > > >> > > > > > >> > > > > > > > ,"filters" : {
> >>  > > > > > >> > > > > > >> > > > > > > > "HDFS" : "false"
> >>  > > > > > >> > > > > > >> > > > > > > > ,"ES" : "exists(field1)"
> >>  > > > > > >> > > > > > >> > > > > > > > }
> >>  > > > > > >> > > > > > >> > > > > > > > }
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > *Index On/Off Switch*
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > A simpler solution would be to
> >>  > > > > > >> > just
> >>  > > > > > >> > > > > > provide a
> >>  > > > > > >> > > > > > >> > > list
> >>  > > > > > >> > > > > > >> > > > of
> >>  > > > > > >> > > > > > >> > > > > writers
> >>  > > > > > >> > > > > > >> > > > > > to
> >>  > > > > > >> > > > > > >> > > > > > > write
> >>  > > > > > >> > > > > > >> > > > > > > > messages. The semantics would
> >>  > > > > > >> be
> >>  > > > > > >> > as
> >>  > > > > > >> > > > > > follows:
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > - If the list is
> >>  > > > > > >> unspecified,
> >>  > > > > > >> > > then
> >>  > > > > > >> > > > > the
> >>  > > > > > >> > > > > > >> > default
> >>  > > > > > >> > > > > > >> > > > is to
> >>  > > > > > >> > > > > > >> > > > > write
> >>  > > > > > >> > > > > > >> > > > > > all
> >>  > > > > > >> > > > > > >> > > > > > > messages
> >>  > > > > > >> > > > > > >> > > > > > > > for every writer in the
> >>  > > > > > >> > indexing
> >>  > > > > > >> > > > > > topology
> >>  > > > > > >> > > > > > >> > > > > > > > - If the list is specified,
> >>  > > > > > >> > then
> >>  > > > > > >> > > a
> >>  > > > > > >> > > > > > writer
> >>  > > > > > >> > > > > > >> > will
> >>  > > > > > >> > > > > > >> > > > write
> >>  > > > > > >> > > > > > >> > > > > all
> >>  > > > > > >> > > > > > >> > > > > > messages
> >>  > > > > > >> > > > > > >> > > > > > > if and
> >>  > > > > > >> > > > > > >> > > > > > > > only if it is named in the
> >>  > > > > > >> > list.
> >>  > > > > > >> > > > > > >> > > > > > > >
> >>  > > > > > >> > > > > > >> > > > > > > > Sample indexing config which
> >>  > > > > > >> turns
> >>  > > > > > >> > > off
> >>  > > > > > >> > > > > > HDFS
> >>  > > > > > >> > > > > > >> and
> >>  > > > > > >> > > > > > >> > > > keeps on
> >>  > > > > > >> > > > > > >> > > > > > > Elasticsearch:
> >>  > > > > > >> > > > > > >> > > > > > > > {
> >>  > > > > > >> > > > > > >> > > > > > > > "index" : "squid"
> >>  > > > > > >> > > > > > >> > > > > > > > ,"batchSize" : 100
> >>  > > > > > >> > > > > > >> > > > > > > > ,"writers" : [ "ES" ]
> >>  > >
> >>  > > --
> >>  >
> >>  > Jon
> >>  >
> >>  > Sent from my mobile device
> >>  >
> >
> > --
> > Nick Allen <[email protected]>
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>

Re: [DISCUSS] Turning off indexing writers feature discussion

Reply via email to