The problem with throwing up a warning is that this is a sensor-specific configuration and the Indexing topology does not know at topology start time all of the sensors. Furthermore, you can start a new sensor in the middle of a running topology.
I'd suggest a compromise and have the indexing topology use the defaults for writing and log a warning to the Storm UI about the sensor not being explicitly stated and recommend remedying that immediately. On Mon, Jan 16, 2017 at 4:01 PM, James Sirota <[email protected]> wrote: > The explicit on/off seems like a good option to have. This way I don't > have to completely remove the config block in order for me to test > something. I think if the config for the writer is unspecified we should > throw up a warning. > > 16.01.2017, 08:54, "Nick Allen" <[email protected]>: > >> To recap, what I am +1 on is Nick's proposed syntax with the following > >> modifications: > >> 1. An explicit enabled field > >> 2. A default on for unspecified to match current semantics > > > > I'm +1 on all of this. > > > > On Sat, Jan 14, 2017 at 10:51 AM, Casey Stella <[email protected]> > wrote: > > > >> I'm +1 on an explicit enabled property and a filter (or when) > property. I > >> think we are zeroing in on a decent design, so that is good. > >> > >> To recap, what I am +1 on is Nick's proposed syntax with the following > >> modifications: > >> 1. An explicit enabled field > >> 2. A default on for unspecified to match current semantics > >> > >> Casey > >> On Sat, Jan 14, 2017 at 10:45 [email protected] <[email protected]> > wrote: > >> > >> > This has the additional benefit of doing something like below when > you > >> want > >> > to temporarily disable the hdfs writer, but don't want to remove the > >> > settings. This removes the need to store the path and batchSize (and > >> many > >> > additional settings) somewhere else so they can be brought back in > when > >> you > >> > want to re-enable it, which is a nice workflow attribute for the end > >> user: > >> > > >> > { > >> > 'elasticsearch': { > >> > 'enabled': 'true', > >> > 'index': 'foo', > >> > 'batchSize': 100, > >> > }, > >> > 'hdfs': { > >> > 'enabled': 'false', > >> > 'path': '/foo/bar/...', > >> > 'batchSize': 100, > >> > }, > >> > 'solr': { > >> > 'enabled': 'false' > >> > } > >> > } > >> > > >> > Jon > >> > > >> > On Sat, Jan 14, 2017 at 9:24 AM [email protected] <[email protected]> > >> wrote: > >> > > >> > > I similarly have a concern there because I prefer being as > explicit as > >> > > possible, which makes things easier to pick up for new users. > Using my > >> > > example from earlier this could look like specifying while(false), > but > >> an > >> > > even better and more obvious approach may be to use > enabled(false). So > >> > the > >> > > current simple default would be: > >> > > > >> > > { > >> > > 'elasticsearch': { 'enabled': 'true' }, > >> > > 'hdfs': { 'enabled': 'true' }, > >> > > 'solr': { enabled': 'false' } > >> > > } > >> > > > >> > > And to use ES with some overrides but not HDFS or solr it would > look > >> > like: > >> > > > >> > > { > >> > > 'elasticsearch': { > >> > > 'enabled': 'true', > >> > > 'index': 'foo', > >> > > 'batchSize': 100 > >> > > }, > >> > > 'hdfs': { > >> > > 'enabled': 'false' > >> > > }, > >> > > 'solr': { > >> > > 'enabled': 'false' > >> > > } > >> > > } > >> > > > >> > > Jon > >> > > > >> > > On Fri, Jan 13, 2017 at 10:21 PM Casey Stella <[email protected]> > >> > wrote: > >> > > > >> > > One thing that I thought of that I very strenuous do not like in > Nick's > >> > > proposal is that if a writer config is not specified then it is > turned > >> > off > >> > > (I think; if I misunderstood let me know). In the situation where > we > >> > have a > >> > > new sensor, right now if there are no index config and no > enrichment > >> > > config, it still passes through to the index using defaults. In > this > >> new > >> > > scheme it would not. This changes the default semantics for the > system > >> > and > >> > > I think it changes it for the worse. > >> > > > >> > > I would strongly prefer a on-by-default indexing config as we have > now. > >> > > On Fri, Jan 13, 2017 at 17:13 Casey Stella <[email protected]> > wrote: > >> > > > >> > > > One thing that I really like about Nick's suggestion is that it > >> allows > >> > > > writer-specific configs in a clear and simple way. It is more > >> complex > >> > > for > >> > > > the default case (all writers write to indices named the same > thing > >> > with > >> > > a > >> > > > fixed batch size), which I do not like, but maybe it's worth the > >> > > compromise > >> > > > to make it less complex for the advanced case. > >> > > > > >> > > > Thanks a lot for the suggestion, Nick, it's interesting; I'm > >> beginning > >> > > to > >> > > > lean your way. > >> > > > > >> > > > On Fri, Jan 13, 2017 at 2:51 PM, [email protected] < > [email protected]> > >> > > > wrote: > >> > > > > >> > > > I like the suggestions you made, Nick. The only thing I would > add is > >> > > that > >> > > > it's also nice to see an explicit when(false), as people newer > to the > >> > > > platform may not know where to expect configs for the different > >> > writers. > >> > > > Being able to do it either way, which I think is already assumed > in > >> > your > >> > > > model, would make sense. I would just suggest that, if we support > >> but > >> > > are > >> > > > disabling a writer, that the platform inserts a default > when(false) > >> to > >> > be > >> > > > explicit. > >> > > > > >> > > > Jon > >> > > > > >> > > > On Fri, Jan 13, 2017 at 11:59 AM Casey Stella < > [email protected]> > >> > > wrote: > >> > > > > >> > > > > Let me noodle on this over the weekend. Your syntax is looking > >> less > >> > > > > onerous to me and I like the following statement from Otto: > "In the > >> > > end, > >> > > > > each write destination ‘type’ will need it’s own configuration. > >> This > >> > > is > >> > > > an > >> > > > > extension point." > >> > > > > > >> > > > > I may come around to your way of thinking. > >> > > > > > >> > > > > On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler < > >> > [email protected] > >> > > > > >> > > > > wrote: > >> > > > > > >> > > > > > In the end, each write destination ‘type’ will need it’s own > >> > > > > > configuration. This is an extension point. > >> > > > > > { > >> > > > > > HDFS:{ > >> > > > > > outputAdapters:[ > >> > > > > > {name: avro, > >> > > > > > settings:{ > >> > > > > > avro stuff…. > >> > > > > > when:{ > >> > > > > > }, > >> > > > > > { > >> > > > > > name: sequence file, > >> > > > > > ….. > >> > > > > > > >> > > > > > or some such. > >> > > > > > > >> > > > > > > >> > > > > > On January 13, 2017 at 11:51:15, Nick Allen ( > [email protected]) > >> > > > wrote: > >> > > > > > > >> > > > > > I will add also that instead of global overrides, like > index, we > >> > > should > >> > > > > use > >> > > > > > configuration key names that are more appropriate to the > output. > >> > > > > > > >> > > > > > For example, does 'index' really make sense for HDFS? Or > would > >> > 'path' > >> > > > be > >> > > > > > more appropriate? > >> > > > > > > >> > > > > > { > >> > > > > > 'elasticsearch': { > >> > > > > > 'index': 'foo', > >> > > > > > 'batchSize': 1 > >> > > > > > }, > >> > > > > > 'hdfs': { > >> > > > > > 'path': '/foo/bar/...', > >> > > > > > 'batchSize': 100 > >> > > > > > } > >> > > > > > } > >> > > > > > > >> > > > > > Ok, I've said my peace. Thanks for the effort in summarizing > all > >> > > this, > >> > > > > > Casey. > >> > > > > > > >> > > > > > > >> > > > > > On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen < > [email protected] > >> > > >> > > > wrote: > >> > > > > > > >> > > > > > > Nick's concerns about my suggestion were that it was overly > >> > complex > >> > > > and > >> > > > > > >> hard to grok and that we could dispense with backwards > >> > > compatibility > >> > > > > and > >> > > > > > >> make people do a bit more work on the default case for the > >> > > benefits > >> > > > > of a > >> > > > > > >> simpler advanced case. (Nick, make sure I don't misstate > your > >> > > > > position) > >> > > > > > > > >> > > > > > > > >> > > > > > > I will add is that in my mind, the majority case would be a > >> user > >> > > > > > > specifying the outputs, but not things like 'batchSize' or > >> > 'when'. > >> > > I > >> > > > > > think > >> > > > > > > in the majority case, the user would accept whatever the > >> default > >> > > > batch > >> > > > > > size > >> > > > > > > is. > >> > > > > > > > >> > > > > > > Here are alternatives suggestions for all the examples > that you > >> > > > > provided > >> > > > > > > previously. > >> > > > > > > > >> > > > > > > Base Case > >> > > > > > > > >> > > > > > > - The user must always specify the 'outputs' for clarity. > >> > > > > > > - Uses default index name, batch size and when = true. > >> > > > > > > > >> > > > > > > { > >> > > > > > > 'elasticsearch': {}, > >> > > > > > > 'hdfs': {} > >> > > > > > > } > >> > > > > > > > >> > > > > > > > >> > > > > > > < > >> > > > > > https://gist.github.com/nickwallen/ > >> 489735b65cdb38aae6e45cec7633a0 > >> > > > > > a1#writer-non-specific-case>Writer-non-specific > >> > > > > > > >> > > > > > > Case > >> > > > > > > > >> > > > > > > - There are no global overrides, as in Casey's proposal. > >> > > > > > > - Easier to grok IMO. > >> > > > > > > > >> > > > > > > { > >> > > > > > > 'elasticsearch': { > >> > > > > > > 'index': 'foo', > >> > > > > > > 'batchSize': 100 > >> > > > > > > }, > >> > > > > > > 'hdfs': { > >> > > > > > > 'index': 'foo', > >> > > > > > > 'batchSize': 100 > >> > > > > > > } > >> > > > > > > } > >> > > > > > > > >> > > > > > > > >> > > > > > > < > >> > > > > > https://gist.github.com/nickwallen/ > >> 489735b65cdb38aae6e45cec7633a0 > >> > > > > > a1#writer-specific-case-without-filters>Writer-specific > >> > > > > > > >> > > > > > > case without filters > >> > > > > > > > >> > > > > > > { > >> > > > > > > 'elasticsearch': { > >> > > > > > > 'index': 'foo', > >> > > > > > > 'batchSize': 1 > >> > > > > > > }, > >> > > > > > > 'hdfs': { > >> > > > > > > 'index': 'foo', > >> > > > > > > 'batchSize': 100 > >> > > > > > > } > >> > > > > > > } > >> > > > > > > > >> > > > > > > > >> > > > > > > < > >> > > > > > https://gist.github.com/nickwallen/ > >> 489735b65cdb38aae6e45cec7633a0 > >> > > > > > a1#writer-specific-case-with-filters>Writer-specific > >> > > > > > > >> > > > > > > case with filters > >> > > > > > > > >> > > > > > > - Instead of having to say when=false, just don't configure > >> HDFS > >> > > > > > > > >> > > > > > > { > >> > > > > > > 'elasticsearch': { > >> > > > > > > 'index': 'foo', > >> > > > > > > 'batchSize': 100, > >> > > > > > > 'when': 'exists(field1)' > >> > > > > > > } > >> > > > > > > } > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella < > >> > [email protected] > >> > > > > >> > > > > > wrote: > >> > > > > > > > >> > > > > > >> Dave, > >> > > > > > >> For the benefit of posterity and people who might not be > as > >> > deeply > >> > > > > > >> entangled in the system as we have been, I'll recap > things and > >> > > > > hopefully > >> > > > > > >> answer your question in the process. > >> > > > > > >> > >> > > > > > >> Historically the index configuration is split between the > >> > > enrichment > >> > > > > > >> configs and the global configs. > >> > > > > > >> > >> > > > > > >> - The global configs really controls configs that apply > to all > >> > > > > sensors. > >> > > > > > >> Historically this has been stuff like index connection > >> strings, > >> > > etc. > >> > > > > > >> - The sensor-specific configs which control things that > vary > >> by > >> > > > > sensor. > >> > > > > > >> > >> > > > > > >> As of Metron-652 (in review currently), we moved the > sensor > >> > > specific > >> > > > > > >> configs from the enrichment configs. The proposal here is > to > >> > > > increase > >> > > > > > the > >> > > > > > >> granularity of the the sensor specific files to make them > >> > support > >> > > > > index > >> > > > > > >> writer-specific configs. Right now in the indexing > topology, > >> we > >> > > > have 2 > >> > > > > > >> writers (fixed): ES/Solr and HDFS. > >> > > > > > >> > >> > > > > > >> The proposed configuration would allow you to either > specify a > >> > > > blanket > >> > > > > > >> sensor-level config for the index name and batchSize > and/or > >> > > override > >> > > > > at > >> > > > > > >> the > >> > > > > > >> writer level, thereby supporting a couple of use-cases: > >> > > > > > >> > >> > > > > > >> - Turning off certain index writers (e.g. HDFS) > >> > > > > > >> - Filtering the messages written to certain index writers > >> > > > > > >> > >> > > > > > >> The two competing configs between Nick and I are as > follows: > >> > > > > > >> > >> > > > > > >> - I want to make sure we keep the old sensor-specific > defaults > >> > > with > >> > > > > > >> writer-specific overrides available > >> > > > > > >> - Nick thought we could simplify the permutations by > making > >> the > >> > > > > > >> indexing > >> > > > > > >> config only the writer-level configs. > >> > > > > > >> > >> > > > > > >> My concerns about Nick's suggestion were that the default > and > >> > > > majority > >> > > > > > >> case, specifying the index and the batchSize for all > writers > >> (th > >> > > > eone > >> > > > > we > >> > > > > > >> support now) would require more configuration. > >> > > > > > >> > >> > > > > > >> Nick's concerns about my suggestion were that it was > overly > >> > > complex > >> > > > > and > >> > > > > > >> hard to grok and that we could dispense with backwards > >> > > compatibility > >> > > > > and > >> > > > > > >> make people do a bit more work on the default case for the > >> > > benefits > >> > > > > of a > >> > > > > > >> simpler advanced case. (Nick, make sure I don't misstate > your > >> > > > > position). > >> > > > > > >> > >> > > > > > >> Casey > >> > > > > > >> > >> > > > > > >> > >> > > > > > >> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle < > >> > > [email protected]> > >> > > > > > >> wrote: > >> > > > > > >> > >> > > > > > >> > Casey, > >> > > > > > >> > > >> > > > > > >> > Can you give me a level set of what your thinking is > now? I > >> > > think > >> > > > > it's > >> > > > > > >> > global control of all index types + overrides on a > per-type > >> > > basis. > >> > > > > > Fwiw, > >> > > > > > >> > I'm totally for that, but I want to make sure I'm not > >> imposing > >> > > my > >> > > > > > >> > pre-concieved notions on your consensus-driven ones. > >> > > > > > >> > > >> > > > > > >> > -D.... > >> > > > > > >> > > >> > > > > > >> > On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella < > >> > > > [email protected]> > >> > > > > > >> wrote: > >> > > > > > >> > > >> > > > > > >> > > I am suggesting that, yes. The configs are > essentially the > >> > > same > >> > > > as > >> > > > > > >> > yours, > >> > > > > > >> > > except there is an override specified at the top > level. > >> > > Without > >> > > > > > >> that, in > >> > > > > > >> > > order to specify both HDFS and ES have batch sizes of > 100, > >> > you > >> > > > > have > >> > > > > > to > >> > > > > > >> > > explicitly configure each. It's less that I'm trying > to > >> have > >> > > > > > >> backwards > >> > > > > > >> > > compatibility and more that I'm trying to make the > >> majority > >> > > case > >> > > > > > easy: > >> > > > > > >> > both > >> > > > > > >> > > writers write everything to a specified index name > with a > >> > > > > specified > >> > > > > > >> batch > >> > > > > > >> > > size (which is what we have now). Beyond that, I want > to > >> > allow > >> > > > for > >> > > > > > >> > > specifying an override for the config on a > >> writer-by-writer > >> > > > basis > >> > > > > > for > >> > > > > > >> > those > >> > > > > > >> > > who need it. > >> > > > > > >> > > > >> > > > > > >> > > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen < > >> > > > [email protected]> > >> > > > > > >> wrote: > >> > > > > > >> > > > >> > > > > > >> > > > Are you saying we support all of these variants? I > >> realize > >> > > you > >> > > > > are > >> > > > > > >> > > trying > >> > > > > > >> > > > to have some backwards compatibility, but this also > >> makes > >> > it > >> > > > > > harder > >> > > > > > >> > for a > >> > > > > > >> > > > user to grok (for me at least). > >> > > > > > >> > > > > >> > > > > > >> > > > Personally I like my original example as there are > fewer > >> > > > > > >> > sub-structures, > >> > > > > > >> > > > like 'writerConfig', which makes the whole thing > simpler > >> > and > >> > > > > > easier > >> > > > > > >> to > >> > > > > > >> > > > grok. But maybe others will think your proposal is > just > >> as > >> > > > easy > >> > > > > to > >> > > > > > >> > grok. > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella < > >> > > > > > [email protected]> > >> > > > > > > >> > > > > > >> > > wrote: > >> > > > > > >> > > > > >> > > > > > >> > > > > Ok, so here's what I'm thinking based on the > >> discussion: > >> > > > > > >> > > > > > >> > > > > > >> > > > > - Keeping the configs that we have now (batchSize > and > >> > > index) > >> > > > > as > >> > > > > > >> > > > defaults > >> > > > > > >> > > > > for the unspecified writer-specific case > >> > > > > > >> > > > > - Adding the config Nick suggested > >> > > > > > >> > > > > > >> > > > > > >> > > > > *Base Case*: > >> > > > > > >> > > > > { > >> > > > > > >> > > > > } > >> > > > > > >> > > > > > >> > > > > > >> > > > > - all writers write all messages > >> > > > > > >> > > > > - index named the same as the sensor for all > writers > >> > > > > > >> > > > > - batchSize of 1 for all writers > >> > > > > > >> > > > > > >> > > > > > >> > > > > *Writer-non-specific case*: > >> > > > > > >> > > > > { > >> > > > > > >> > > > > "index" : "foo" > >> > > > > > >> > > > > ,"batchSize" : 100 > >> > > > > > >> > > > > } > >> > > > > > >> > > > > > >> > > > > > >> > > > > - All writers write all messages > >> > > > > > >> > > > > - index is named "foo", different from the sensor > for > >> > all > >> > > > > > >> writers > >> > > > > > >> > > > > - batchSize is 100 for all writers > >> > > > > > >> > > > > > >> > > > > > >> > > > > *Writer-specific case without filters* > >> > > > > > >> > > > > { > >> > > > > > >> > > > > "index" : "foo" > >> > > > > > >> > > > > ,"batchSize" : 1 > >> > > > > > >> > > > > , "writerConfig" : > >> > > > > > >> > > > > { > >> > > > > > >> > > > > "elasticsearch" : { > >> > > > > > >> > > > > "batchSize" : 100 > >> > > > > > >> > > > > } > >> > > > > > >> > > > > } > >> > > > > > >> > > > > } > >> > > > > > >> > > > > > >> > > > > > >> > > > > - All writers write all messages > >> > > > > > >> > > > > - index is named "foo", different from the sensor > for > >> > all > >> > > > > > >> writers > >> > > > > > >> > > > > - batchSize is 1 for HDFS and 100 for > elasticsearch > >> > > writers > >> > > > > > >> > > > > - NOTE: I could override the index name too > >> > > > > > >> > > > > > >> > > > > > >> > > > > *Writer-specific case with filters* > >> > > > > > >> > > > > { > >> > > > > > >> > > > > "index" : "foo" > >> > > > > > >> > > > > ,"batchSize" : 1 > >> > > > > > >> > > > > , "writerConfig" : > >> > > > > > >> > > > > { > >> > > > > > >> > > > > "elasticsearch" : { > >> > > > > > >> > > > > "batchSize" : 100, > >> > > > > > >> > > > > "when" : "exists(field1)" > >> > > > > > >> > > > > }, > >> > > > > > >> > > > > "hdfs" : { > >> > > > > > >> > > > > "when" : "false" > >> > > > > > >> > > > > } > >> > > > > > >> > > > > } > >> > > > > > >> > > > > } > >> > > > > > >> > > > > > >> > > > > > >> > > > > - ES writer writes messages which have field1, > HDFS > >> > > doesn't > >> > > > > > >> > > > > - index is named "foo", different from the sensor > for > >> > all > >> > > > > > >> writers > >> > > > > > >> > > > > - 100 for elasticsearch writers > >> > > > > > >> > > > > > >> > > > > > >> > > > > Thoughts? > >> > > > > > >> > > > > > >> > > > > > >> > > > > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby < > >> > > > > > >> [email protected] > >> > > > > > >> > > > >> > > > > > >> > > > > wrote: > >> > > > > > >> > > > > > >> > > > > > >> > > > > > For larger installations you need to control > what is > >> > > > indexed > >> > > > > > so > >> > > > > > >> you > >> > > > > > >> > > > don’t > >> > > > > > >> > > > > > end up with a nasty elastic search situation > and so > >> > you > >> > > > can > >> > > > > > mine > >> > > > > > >> > the > >> > > > > > >> > > > data > >> > > > > > >> > > > > > later for reports and training ml models. > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > Thanks > >> > > > > > >> > > > > > Carolyn > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > On 1/13/17, 9:40 AM, "Casey Stella" < > >> > [email protected] > >> > > > > >> > > > > > wrote: > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > >OH that's a good idea! > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen < > >> > > > > > >> [email protected]> > >> > > > > > >> > > > wrote: > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> I like the "Index Filtering" option based on > the > >> > > > > > flexibility > >> > > > > > >> > that > >> > > > > > >> > > it > >> > > > > > >> > > > > > >> provides. Should each output (HDFS, ES, etc) > have > >> > its > >> > > > own > >> > > > > > >> > > > > configuration > >> > > > > > >> > > > > > >> settings? For example, aren't things like > >> batching > >> > > > > handled > >> > > > > > >> > > > separately > >> > > > > > >> > > > > > for > >> > > > > > >> > > > > > >> HDFS versus Elasticsearch? > >> > > > > > >> > > > > > >> > >> > > > > > >> > > > > > >> Something along the lines of... > >> > > > > > >> > > > > > >> > >> > > > > > >> > > > > > >> { > >> > > > > > >> > > > > > >> "hdfs" : { > >> > > > > > >> > > > > > >> "when": "exists(field1)", > >> > > > > > >> > > > > > >> "batchSize": 100 > >> > > > > > >> > > > > > >> }, > >> > > > > > >> > > > > > >> > >> > > > > > >> > > > > > >> "elasticsearch" : { > >> > > > > > >> > > > > > >> "when": "true", > >> > > > > > >> > > > > > >> "batchSize": 1000, > >> > > > > > >> > > > > > >> "index": "squid" > >> > > > > > >> > > > > > >> } > >> > > > > > >> > > > > > >> } > >> > > > > > >> > > > > > >> > >> > > > > > >> > > > > > >> > >> > > > > > >> > > > > > >> > >> > > > > > >> > > > > > >> > >> > > > > > >> > > > > > >> > >> > > > > > >> > > > > > >> > >> > > > > > >> > > > > > >> > >> > > > > > >> > > > > > >> > >> > > > > > >> > > > > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey > Stella < > >> > > > > > >> > [email protected] > >> > > > > > >> > > > > >> > > > > > >> > > > > > wrote: > >> > > > > > >> > > > > > >> > >> > > > > > >> > > > > > >> > Yeah, I tend to like the first option too. > Any > >> > > > > opposition > >> > > > > > >> to > >> > > > > > >> > > that > >> > > > > > >> > > > > > from > >> > > > > > >> > > > > > >> > anyone? > >> > > > > > >> > > > > > >> > > >> > > > > > >> > > > > > >> > The points brought up are good ones and I > think > >> > > that > >> > > > it > >> > > > > > >> may be > >> > > > > > >> > > > > worth a > >> > > > > > >> > > > > > >> > broader discussion of the requirements of > >> > indexing > >> > > > in a > >> > > > > > >> > separate > >> > > > > > >> > > > dev > >> > > > > > >> > > > > > list > >> > > > > > >> > > > > > >> > thread. Maybe a list of desires with > coherent > >> > > > use-cases > >> > > > > > >> > > > justifying > >> > > > > > >> > > > > > them > >> > > > > > >> > > > > > >> so > >> > > > > > >> > > > > > >> > we can think about how this stuff should > work > >> and > >> > > > where > >> > > > > > the > >> > > > > > >> > > > natural > >> > > > > > >> > > > > > >> > extension points should be. Afterall, we > need > >> to > >> > > toe > >> > > > > the > >> > > > > > >> line > >> > > > > > >> > > > > between > >> > > > > > >> > > > > > >> > engineering and overengineering for > features > >> > nobody > >> > > > > will > >> > > > > > >> want. > >> > > > > > >> > > > > > >> > > >> > > > > > >> > > > > > >> > I'm not sure about the extensions to the > >> standard > >> > > > > fields. > >> > > > > > >> I'm > >> > > > > > >> > > > torn > >> > > > > > >> > > > > > >> between > >> > > > > > >> > > > > > >> > the notions that we should have no standard > >> > fields > >> > > vs > >> > > > > we > >> > > > > > >> > should > >> > > > > > >> > > > > have a > >> > > > > > >> > > > > > >> > boatload of standard fields (with most of > them > >> > > > empty). > >> > > > > I > >> > > > > > >> > > exchange > >> > > > > > >> > > > > > >> > positions fairly regularly on that > question. ;) > >> > It > >> > > > may > >> > > > > be > >> > > > > > >> > > worth a > >> > > > > > >> > > > > dev > >> > > > > > >> > > > > > >> list > >> > > > > > >> > > > > > >> > discussion to lay out how you imagine an > >> > extension > >> > > of > >> > > > > > >> standard > >> > > > > > >> > > > > fields > >> > > > > > >> > > > > > and > >> > > > > > >> > > > > > >> > how it might look as implemented in Metron. > >> > > > > > >> > > > > > >> > > >> > > > > > >> > > > > > >> > Casey > >> > > > > > >> > > > > > >> > > >> > > > > > >> > > > > > >> > Casey > >> > > > > > >> > > > > > >> > > >> > > > > > >> > > > > > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle > >> Richardson > >> > < > >> > > > > > >> > > > > > >> > [email protected]> > >> > > > > > >> > > > > > >> > wrote: > >> > > > > > >> > > > > > >> > > >> > > > > > >> > > > > > >> > > I'll second my preference for the first > >> > option. I > >> > > > > think > >> > > > > > >> the > >> > > > > > >> > > > > ability > >> > > > > > >> > > > > > to > >> > > > > > >> > > > > > >> > use > >> > > > > > >> > > > > > >> > > Stellar filters to customize indexing > would > >> be > >> > a > >> > > > big > >> > > > > > win. > >> > > > > > >> > > > > > >> > > > >> > > > > > >> > > > > > >> > > I'm glad Matt brought up the point about > data > >> > > lake > >> > > > > and > >> > > > > > >> CEP. > >> > > > > > >> > I > >> > > > > > >> > > > > think > >> > > > > > >> > > > > > >> this > >> > > > > > >> > > > > > >> > is > >> > > > > > >> > > > > > >> > > a really important use case that we need > to > >> > > > consider. > >> > > > > > >> Take a > >> > > > > > >> > > > > simple > >> > > > > > >> > > > > > >> > > example... If I have data coming in from > 3 > >> > > > different > >> > > > > > >> > firewall > >> > > > > > >> > > > > > vendors > >> > > > > > >> > > > > > >> > and 2 > >> > > > > > >> > > > > > >> > > different web proxy/url filtering vendors > >> and I > >> > > > want > >> > > > > to > >> > > > > > >> be > >> > > > > > >> > > able > >> > > > > > >> > > > to > >> > > > > > >> > > > > > >> > analyze > >> > > > > > >> > > > > > >> > > that data set, I need the data to be > indexed > >> > all > >> > > > > > together > >> > > > > > >> > > > (likely > >> > > > > > >> > > > > in > >> > > > > > >> > > > > > >> > HDFS) > >> > > > > > >> > > > > > >> > > and to have a normalized schema such > that IP > >> > > > address, > >> > > > > > >> URL, > >> > > > > > >> > and > >> > > > > > >> > > > > user > >> > > > > > >> > > > > > >> name > >> > > > > > >> > > > > > >> > > (to take a few) can be easily queried and > >> > > > > aggregated. I > >> > > > > > >> can > >> > > > > > >> > > also > >> > > > > > >> > > > > > >> envision > >> > > > > > >> > > > > > >> > > scenarios where I would want to index > data > >> > based > >> > > on > >> > > > > > >> > attributes > >> > > > > > >> > > > > other > >> > > > > > >> > > > > > >> than > >> > > > > > >> > > > > > >> > > sensor, business unit or subsidiary for > >> > example. > >> > > > > > >> > > > > > >> > > > >> > > > > > >> > > > > > >> > > I've been wanted to propose extending > our 7 > >> > > > standard > >> > > > > > >> fields > >> > > > > > >> > to > >> > > > > > >> > > > > > include > >> > > > > > >> > > > > > >> > > things like URL and user. Is there > community > >> > > > > > >> > interest/support > >> > > > > > >> > > > for > >> > > > > > >> > > > > > >> moving > >> > > > > > >> > > > > > >> > in > >> > > > > > >> > > > > > >> > > that direction? If so, I'll start a new > >> thread. > >> > > > > > >> > > > > > >> > > > >> > > > > > >> > > > > > >> > > Thanks! > >> > > > > > >> > > > > > >> > > > >> > > > > > >> > > > > > >> > > -Kyle > >> > > > > > >> > > > > > >> > > > >> > > > > > >> > > > > > >> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt > Foley < > >> > > > > > >> > [email protected] > >> > > > > > >> > > > > >> > > > > > >> > > > > > wrote: > >> > > > > > >> > > > > > >> > > > >> > > > > > >> > > > > > >> > > > Ah, I see. If overriding the default > index > >> > name > >> > > > > > allows > >> > > > > > >> > > using > >> > > > > > >> > > > > the > >> > > > > > >> > > > > > >> same > >> > > > > > >> > > > > > >> > > > name for multiple sensors, then the > goal > >> can > >> > be > >> > > > > > >> achieved. > >> > > > > > >> > > > > > >> > > > Thanks, > >> > > > > > >> > > > > > >> > > > --Matt > >> > > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > > >> > > > On 1/12/17, 3:30 PM, "Casey Stella" < > >> > > > > > >> [email protected]> > >> > > > > > >> > > > wrote: > >> > > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > > >> > > > Oh, you could! Let's say you have a > syslog > >> > > parser > >> > > > > > >> > with > >> > > > > > >> > > > data > >> > > > > > >> > > > > > from > >> > > > > > >> > > > > > >> > > > sources 1 > >> > > > > > >> > > > > > >> > > > 2 and 3. You'd end up with one kafka > queue > >> > > with 3 > >> > > > > > >> > > parsers > >> > > > > > >> > > > > > >> attached > >> > > > > > >> > > > > > >> > > to > >> > > > > > >> > > > > > >> > > > that > >> > > > > > >> > > > > > >> > > > queue, each picking part the messages > from > >> > > source > >> > > > > > >> 1, 2 > >> > > > > > >> > > and > >> > > > > > >> > > > > 3. > >> > > > > > >> > > > > > >> > They'd > >> > > > > > >> > > > > > >> > > > go > >> > > > > > >> > > > > > >> > > > through separate enrichment and into > the > >> > > indexing > >> > > > > > >> > > > topology. > >> > > > > > >> > > > > > In > >> > > > > > >> > > > > > >> the > >> > > > > > >> > > > > > >> > > > indexing topology, you could specify > the > >> same > >> > > > index > >> > > > > > >> > name > >> > > > > > >> > > > > > "syslog" > >> > > > > > >> > > > > > >> > and > >> > > > > > >> > > > > > >> > > > all > >> > > > > > >> > > > > > >> > > > of the messages go into the same index > for > >> > CEP > >> > > > > > >> > querying > >> > > > > > >> > > if > >> > > > > > >> > > > > so > >> > > > > > >> > > > > > >> > > desired. > >> > > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > > >> > > > On Thu, Jan 12, 2017 at 6:27 PM, Matt > >> Foley < > >> > > > > > >> > > > > [email protected] > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > wrote: > >> > > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > > >> > > > > Syslog is hell on parsers – I know, I > >> > worked > >> > > at > >> > > > > > >> > > LogLogic > >> > > > > > >> > > > > in > >> > > > > > >> > > > > > a > >> > > > > > >> > > > > > >> > > > previous > >> > > > > > >> > > > > > >> > > > > life. It makes perfect sense to route > >> > > different > >> > > > > > >> > lines > >> > > > > > >> > > > > from > >> > > > > > >> > > > > > >> > syslog > >> > > > > > >> > > > > > >> > > > through > >> > > > > > >> > > > > > >> > > > > different appropriate parsers. But a > lot > >> of > >> > > > what > >> > > > > > >> > the > >> > > > > > >> > > > > > parsers > >> > > > > > >> > > > > > >> do > >> > > > > > >> > > > > > >> > is > >> > > > > > >> > > > > > >> > > > > identify consistent subsets of > metadata > >> and > >> > > > > > >> annotate > >> > > > > > >> > > it > >> > > > > > >> > > > – > >> > > > > > >> > > > > > eg, > >> > > > > > >> > > > > > >> > > > src_ip_addr, > >> > > > > > >> > > > > > >> > > > > event timestamps, etc. Once those > >> metadata > >> > > are > >> > > > > > >> > > > annotated > >> > > > > > >> > > > > > and > >> > > > > > >> > > > > > >> > > > available > >> > > > > > >> > > > > > >> > > > > with common field names, why doesn’t > it > >> > make > >> > > > > > >> sense > >> > > > > > >> > to > >> > > > > > >> > > > > index > >> > > > > > >> > > > > > the > >> > > > > > >> > > > > > >> > > > messages > >> > > > > > >> > > > > > >> > > > > together, for CEP querying? I think > >> Splunk > >> > > has > >> > > > > > >> > > > > illustrated > >> > > > > > >> > > > > > >> this > >> > > > > > >> > > > > > >> > > > model. > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > On 1/12/17, 3:00 PM, "Casey Stella" < > >> > > > > > >> > > [email protected] > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> wrote: > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > yeah, I mean, honestly, I think the > >> > approach > >> > > > > > >> > that > >> > > > > > >> > > > > we've > >> > > > > > >> > > > > > >> taken > >> > > > > > >> > > > > > >> > > for > >> > > > > > >> > > > > > >> > > > > sources > >> > > > > > >> > > > > > >> > > > > which aggregate different types of > data > >> is > >> > to > >> > > > > > >> > > > provide > >> > > > > > >> > > > > > >> filters > >> > > > > > >> > > > > > >> > > at > >> > > > > > >> > > > > > >> > > > the > >> > > > > > >> > > > > > >> > > > > parser > >> > > > > > >> > > > > > >> > > > > level and have multiple parser > topologies > >> > > > > > >> (with > >> > > > > > >> > > > > > different, > >> > > > > > >> > > > > > >> > > > possibly > >> > > > > > >> > > > > > >> > > > > mutually exclusive filters) running. > This > >> > > > > > >> would > >> > > > > > >> > > be > >> > > > > > >> > > > a > >> > > > > > >> > > > > > >> > > completely > >> > > > > > >> > > > > > >> > > > > separate > >> > > > > > >> > > > > > >> > > > > sensor. Imagine a syslog data source > that > >> > > > > > >> > > > aggregates > >> > > > > > >> > > > > > and > >> > > > > > >> > > > > > >> you > >> > > > > > >> > > > > > >> > > > want to > >> > > > > > >> > > > > > >> > > > > pick > >> > > > > > >> > > > > > >> > > > > apart certain pieces of messages. > This is > >> > > > > > >> why > >> > > > > > >> > the > >> > > > > > >> > > > > > initial > >> > > > > > >> > > > > > >> > > > thought and > >> > > > > > >> > > > > > >> > > > > architecture was one index per > sensor. > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > On Thu, Jan 12, 2017 at 5:55 PM, Matt > >> > Foley < > >> > > > > > >> > > > > > >> > [email protected]> > >> > > > > > >> > > > > > >> > > > wrote: > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > I’m thinking that CEP (Complex > Event > >> > > > > > >> > Processing) > >> > > > > > >> > > > is > >> > > > > > >> > > > > > >> > contrary > >> > > > > > >> > > > > > >> > > > to the > >> > > > > > >> > > > > > >> > > > > idea > >> > > > > > >> > > > > > >> > > > > > of silo-ing data per sensor. > >> > > > > > >> > > > > > >> > > > > > Now it’s true that some of those > >> sensors > >> > > > > > >> are > >> > > > > > >> > > > already > >> > > > > > >> > > > > > >> > > > aggregating > >> > > > > > >> > > > > > >> > > > > data from > >> > > > > > >> > > > > > >> > > > > > multiple sources, so maybe I’m > wrong > >> > here. > >> > > > > > >> > > > > > >> > > > > > But it just seems to me that the > “data > >> > > > > > >> lake” > >> > > > > > >> > > > > insights > >> > > > > > >> > > > > > >> come > >> > > > > > >> > > > > > >> > > from > >> > > > > > >> > > > > > >> > > > > being able > >> > > > > > >> > > > > > >> > > > > > to make decisions over the whole > mass > >> of > >> > > > > > >> data > >> > > > > > >> > > > rather > >> > > > > > >> > > > > > than > >> > > > > > >> > > > > > >> > > just > >> > > > > > >> > > > > > >> > > > > vertical > >> > > > > > >> > > > > > >> > > > > > slices of it. > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > On 1/12/17, 2:15 PM, "Casey > Stella" < > >> > > > > > >> > > > > > [email protected]> > >> > > > > > >> > > > > > >> > > > wrote: > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > Hey Matt, > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > Thanks for the comment! > >> > > > > > >> > > > > > >> > > > > > 1. At the moment, we only have one > >> > > > > > >> index > >> > > > > > >> > > name, > >> > > > > > >> > > > > the > >> > > > > > >> > > > > > >> > > default > >> > > > > > >> > > > > > >> > > > of > >> > > > > > >> > > > > > >> > > > > which is > >> > > > > > >> > > > > > >> > > > > > the > >> > > > > > >> > > > > > >> > > > > > sensor name but that's entirely up > to > >> > > > > > >> the > >> > > > > > >> > > > user. > >> > > > > > >> > > > > > This > >> > > > > > >> > > > > > >> > is > >> > > > > > >> > > > > > >> > > > sensor > >> > > > > > >> > > > > > >> > > > > > specific, > >> > > > > > >> > > > > > >> > > > > > so it'd be a separate config for > each > >> > > > > > >> > > sensor. > >> > > > > > >> > > > > If > >> > > > > > >> > > > > > we > >> > > > > > >> > > > > > >> > want > >> > > > > > >> > > > > > >> > > > to > >> > > > > > >> > > > > > >> > > > > build > >> > > > > > >> > > > > > >> > > > > > multiple > >> > > > > > >> > > > > > >> > > > > > indices per sensor, we'd have to > think > >> > > > > > >> > > > carefully > >> > > > > > >> > > > > > >> about > >> > > > > > >> > > > > > >> > > how > >> > > > > > >> > > > > > >> > > > to do > >> > > > > > >> > > > > > >> > > > > that > >> > > > > > >> > > > > > >> > > > > > and > >> > > > > > >> > > > > > >> > > > > > would be a bigger undertaking. I > >> > > > > > >> guess I > >> > > > > > >> > > can > >> > > > > > >> > > > > see > >> > > > > > >> > > > > > the > >> > > > > > >> > > > > > >> > > use, > >> > > > > > >> > > > > > >> > > > though > >> > > > > > >> > > > > > >> > > > > > (redirect > >> > > > > > >> > > > > > >> > > > > > messages to one index vs another > based > >> > > > > > >> on > >> > > > > > >> > a > >> > > > > > >> > > > > > predicate > >> > > > > > >> > > > > > >> > for > >> > > > > > >> > > > > > >> > > > a given > >> > > > > > >> > > > > > >> > > > > > sensor). > >> > > > > > >> > > > > > >> > > > > > Anyway, not where I was originally > >> > > > > > >> > thinking > >> > > > > > >> > > > that > >> > > > > > >> > > > > > this > >> > > > > > >> > > > > > >> > > > discussion > >> > > > > > >> > > > > > >> > > > > would > >> > > > > > >> > > > > > >> > > > > > go, > >> > > > > > >> > > > > > >> > > > > > but it's an interesting point. > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > 2. I hadn't thought through the > >> > > > > > >> > > implementation > >> > > > > > >> > > > > > quite > >> > > > > > >> > > > > > >> > yet, > >> > > > > > >> > > > > > >> > > > but we > >> > > > > > >> > > > > > >> > > > > don't > >> > > > > > >> > > > > > >> > > > > > actually have a splitter bolt in > that > >> > > > > > >> > > > topology, > >> > > > > > >> > > > > > just > >> > > > > > >> > > > > > >> a > >> > > > > > >> > > > > > >> > > > spout > >> > > > > > >> > > > > > >> > > > > that goes > >> > > > > > >> > > > > > >> > > > > > to > >> > > > > > >> > > > > > >> > > > > > the elasticsearch writer and also > to > >> > > > > > >> the > >> > > > > > >> > > hdfs > >> > > > > > >> > > > > > writer. > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > On Thu, Jan 12, 2017 at 4:52 PM, > Matt > >> > > > > > >> > Foley > >> > > > > > >> > > < > >> > > > > > >> > > > > > >> > > > [email protected]> > >> > > > > > >> > > > > > >> > > > > wrote: > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > > Casey, good to have controls like > >> > > > > > >> this. > >> > > > > > >> > > > > Couple > >> > > > > > >> > > > > > >> > > > questions: > >> > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > > 1. Regarding the “index” : > “squid” > >> > > > > > >> > > > name/value > >> > > > > > >> > > > > > pair, > >> > > > > > >> > > > > > >> > is > >> > > > > > >> > > > > > >> > > > the > >> > > > > > >> > > > > > >> > > > > index name > >> > > > > > >> > > > > > >> > > > > > > expected to always be a sensor > >> > > > > > >> name? Or > >> > > > > > >> > > is > >> > > > > > >> > > > > the > >> > > > > > >> > > > > > >> given > >> > > > > > >> > > > > > >> > > > json > >> > > > > > >> > > > > > >> > > > > structure > >> > > > > > >> > > > > > >> > > > > > > subordinate to a sensor name in > >> > > > > > >> > zookeeper? > >> > > > > > >> > > > Or > >> > > > > > >> > > > > > can > >> > > > > > >> > > > > > >> we > >> > > > > > >> > > > > > >> > > > build > >> > > > > > >> > > > > > >> > > > > arbitrary > >> > > > > > >> > > > > > >> > > > > > > indexes with this new > specification, > >> > > > > > >> > > > > > independent of > >> > > > > > >> > > > > > >> > > > sensor? > >> > > > > > >> > > > > > >> > > > > Should > >> > > > > > >> > > > > > >> > > > > > there > >> > > > > > >> > > > > > >> > > > > > > actually be a list of “indexes”, > ie > >> > > > > > >> > > > > > >> > > > > > > { “indexes” : [ > >> > > > > > >> > > > > > >> > > > > > > {“index” : “name1”, > >> > > > > > >> > > > > > >> > > > > > > … > >> > > > > > >> > > > > > >> > > > > > > }, > >> > > > > > >> > > > > > >> > > > > > > {“index” : “name2”, > >> > > > > > >> > > > > > >> > > > > > > … > >> > > > > > >> > > > > > >> > > > > > > } ] > >> > > > > > >> > > > > > >> > > > > > > } > >> > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > > 2. Would the filtering / writer > >> > > > > > >> > selection > >> > > > > > >> > > > > logic > >> > > > > > >> > > > > > >> take > >> > > > > > >> > > > > > >> > > > place in > >> > > > > > >> > > > > > >> > > > > the > >> > > > > > >> > > > > > >> > > > > > indexing > >> > > > > > >> > > > > > >> > > > > > > topology splitter bolt? Seems > like > >> > > > > > >> that > >> > > > > > >> > > > would > >> > > > > > >> > > > > > have > >> > > > > > >> > > > > > >> > the > >> > > > > > >> > > > > > >> > > > > smallest > >> > > > > > >> > > > > > >> > > > > > impact on > >> > > > > > >> > > > > > >> > > > > > > current implementation, no? > >> > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > > Sorry if these are already > answered > >> > > > > > >> in > >> > > > > > >> > > > > PR-415, I > >> > > > > > >> > > > > > >> > > haven’t > >> > > > > > >> > > > > > >> > > > had > >> > > > > > >> > > > > > >> > > > > time to > >> > > > > > >> > > > > > >> > > > > > > review that one yet. > >> > > > > > >> > > > > > >> > > > > > > Thanks, > >> > > > > > >> > > > > > >> > > > > > > --Matt > >> > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > > On 1/12/17, 12:55 PM, "Michael > >> > > > > > >> > Miklavcic" > >> > > > > > >> > > < > >> > > > > > >> > > > > > >> > > > > > [email protected]> > >> > > > > > >> > > > > > >> > > > > > > wrote: > >> > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > > I like the flexibility and > >> > > > > > >> > > > expressibility > >> > > > > > >> > > > > of > >> > > > > > >> > > > > > >> the > >> > > > > > >> > > > > > >> > > > first > >> > > > > > >> > > > > > >> > > > > option > >> > > > > > >> > > > > > >> > > > > > with > >> > > > > > >> > > > > > >> > > > > > > Stellar > >> > > > > > >> > > > > > >> > > > > > > filters. > >> > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > > M > >> > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > > On Thu, Jan 12, 2017 at 1:51 PM, > >> > > > > > >> > Casey > >> > > > > > >> > > > > > Stella < > >> > > > > > >> > > > > > >> > > > > > [email protected]> > >> > > > > > >> > > > > > >> > > > > > > wrote: > >> > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > As of METRON-652 < > >> > > > > > >> > > > > > https://github.com/apache/ > >> > > > > > >> > > > > > >> > > > > > > incubator-metron/pull/415>, we > >> > > > > > >> > > > > > >> > > > > > > > will have decoupled the > >> > > > > > >> indexing > >> > > > > > >> > > > > > >> configuration > >> > > > > > >> > > > > > >> > > > from the > >> > > > > > >> > > > > > >> > > > > > enrichment > >> > > > > > >> > > > > > >> > > > > > > > configuration. As an immediate > >> > > > > > >> > > > > follow-up > >> > > > > > >> > > > > > to > >> > > > > > >> > > > > > >> > > that, > >> > > > > > >> > > > > > >> > > > I'd > >> > > > > > >> > > > > > >> > > > > like to > >> > > > > > >> > > > > > >> > > > > > > provide the > >> > > > > > >> > > > > > >> > > > > > > > ability to turn off and on > >> > > > > > >> writers > >> > > > > > >> > > via > >> > > > > > >> > > > > the > >> > > > > > >> > > > > > >> > > > configs. I'd > >> > > > > > >> > > > > > >> > > > > like > >> > > > > > >> > > > > > >> > > > > > to get > >> > > > > > >> > > > > > >> > > > > > > some > >> > > > > > >> > > > > > >> > > > > > > > community feedback on how the > >> > > > > > >> > > > > > functionality > >> > > > > > >> > > > > > >> > > should > >> > > > > > >> > > > > > >> > > > work, > >> > > > > > >> > > > > > >> > > > > if > >> > > > > > >> > > > > > >> > > > > > y'all are > >> > > > > > >> > > > > > >> > > > > > > > amenable. :) > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > As of now, we have 3 possible > >> > > > > > >> > > writers > >> > > > > > >> > > > > > which > >> > > > > > >> > > > > > >> can > >> > > > > > >> > > > > > >> > > be > >> > > > > > >> > > > > > >> > > > used > >> > > > > > >> > > > > > >> > > > > in the > >> > > > > > >> > > > > > >> > > > > > > indexing > >> > > > > > >> > > > > > >> > > > > > > > topology: > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > - Solr > >> > > > > > >> > > > > > >> > > > > > > > - Elasticsearch > >> > > > > > >> > > > > > >> > > > > > > > - HDFS > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > HDFS is always used, > >> > > > > > >> elasticsearch > >> > > > > > >> > > or > >> > > > > > >> > > > > > solr is > >> > > > > > >> > > > > > >> > > used > >> > > > > > >> > > > > > >> > > > > depending > >> > > > > > >> > > > > > >> > > > > > on how > >> > > > > > >> > > > > > >> > > > > > > you > >> > > > > > >> > > > > > >> > > > > > > > start the indexing topology. > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > A couple of proposals come to > >> > > > > > >> mind > >> > > > > > >> > > > > > >> immediately: > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > *Index Filtering* > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > You would be able to specify a > >> > > > > > >> > > filter > >> > > > > > >> > > > as > >> > > > > > >> > > > > > >> > defined > >> > > > > > >> > > > > > >> > > > by a > >> > > > > > >> > > > > > >> > > > > stellar > >> > > > > > >> > > > > > >> > > > > > > statement > >> > > > > > >> > > > > > >> > > > > > > > (likely a reuse of the > >> > > > > > >> > StellarFilter > >> > > > > > >> > > > > that > >> > > > > > >> > > > > > >> > exists > >> > > > > > >> > > > > > >> > > > in the > >> > > > > > >> > > > > > >> > > > > > Parsers) > >> > > > > > >> > > > > > >> > > > > > > which > >> > > > > > >> > > > > > >> > > > > > > > would allow you to indicate on > >> > > > > > >> a > >> > > > > > >> > > > > > >> > > > message-by-message basis > >> > > > > > >> > > > > > >> > > > > > whether or > >> > > > > > >> > > > > > >> > > > > > > not to > >> > > > > > >> > > > > > >> > > > > > > > write the message. > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > The semantics of this would be > >> > > > > > >> as > >> > > > > > >> > > > > follows: > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > - Default (i.e. > >> > > > > > >> unspecified) is > >> > > > > > >> > > to > >> > > > > > >> > > > > pass > >> > > > > > >> > > > > > >> > > > everything > >> > > > > > >> > > > > > >> > > > > through > >> > > > > > >> > > > > > >> > > > > > (hence > >> > > > > > >> > > > > > >> > > > > > > > backwards compatible with > >> > > > > > >> the > >> > > > > > >> > > > current > >> > > > > > >> > > > > > >> > default > >> > > > > > >> > > > > > >> > > > config). > >> > > > > > >> > > > > > >> > > > > > > > - Messages which have the > >> > > > > > >> > > > associated > >> > > > > > >> > > > > > >> stellar > >> > > > > > >> > > > > > >> > > > statement > >> > > > > > >> > > > > > >> > > > > > evaluate > >> > > > > > >> > > > > > >> > > > > > > to true > >> > > > > > >> > > > > > >> > > > > > > > for the writer type will be > >> > > > > > >> > > > written, > >> > > > > > >> > > > > > >> > otherwise > >> > > > > > >> > > > > > >> > > > not. > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > Sample indexing config which > >> > > > > > >> would > >> > > > > > >> > > > write > >> > > > > > >> > > > > > out > >> > > > > > >> > > > > > >> no > >> > > > > > >> > > > > > >> > > > messages > >> > > > > > >> > > > > > >> > > > > to > >> > > > > > >> > > > > > >> > > > > > HDFS and > >> > > > > > >> > > > > > >> > > > > > > write > >> > > > > > >> > > > > > >> > > > > > > > out only messages containing a > >> > > > > > >> > field > >> > > > > > >> > > > > > called > >> > > > > > >> > > > > > >> > > > "field1": > >> > > > > > >> > > > > > >> > > > > > > > { > >> > > > > > >> > > > > > >> > > > > > > > "index" : "squid" > >> > > > > > >> > > > > > >> > > > > > > > ,"batchSize" : 100 > >> > > > > > >> > > > > > >> > > > > > > > ,"filters" : { > >> > > > > > >> > > > > > >> > > > > > > > "HDFS" : "false" > >> > > > > > >> > > > > > >> > > > > > > > ,"ES" : "exists(field1)" > >> > > > > > >> > > > > > >> > > > > > > > } > >> > > > > > >> > > > > > >> > > > > > > > } > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > *Index On/Off Switch* > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > A simpler solution would be to > >> > > > > > >> > just > >> > > > > > >> > > > > > provide a > >> > > > > > >> > > > > > >> > > list > >> > > > > > >> > > > > > >> > > > of > >> > > > > > >> > > > > > >> > > > > writers > >> > > > > > >> > > > > > >> > > > > > to > >> > > > > > >> > > > > > >> > > > > > > write > >> > > > > > >> > > > > > >> > > > > > > > messages. The semantics would > >> > > > > > >> be > >> > > > > > >> > as > >> > > > > > >> > > > > > follows: > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > - If the list is > >> > > > > > >> unspecified, > >> > > > > > >> > > then > >> > > > > > >> > > > > the > >> > > > > > >> > > > > > >> > default > >> > > > > > >> > > > > > >> > > > is to > >> > > > > > >> > > > > > >> > > > > write > >> > > > > > >> > > > > > >> > > > > > all > >> > > > > > >> > > > > > >> > > > > > > messages > >> > > > > > >> > > > > > >> > > > > > > > for every writer in the > >> > > > > > >> > indexing > >> > > > > > >> > > > > > topology > >> > > > > > >> > > > > > >> > > > > > > > - If the list is specified, > >> > > > > > >> > then > >> > > > > > >> > > a > >> > > > > > >> > > > > > writer > >> > > > > > >> > > > > > >> > will > >> > > > > > >> > > > > > >> > > > write > >> > > > > > >> > > > > > >> > > > > all > >> > > > > > >> > > > > > >> > > > > > messages > >> > > > > > >> > > > > > >> > > > > > > if and > >> > > > > > >> > > > > > >> > > > > > > > only if it is named in the > >> > > > > > >> > list. > >> > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > Sample indexing config which > >> > > > > > >> turns > >> > > > > > >> > > off > >> > > > > > >> > > > > > HDFS > >> > > > > > >> > > > > > >> and > >> > > > > > >> > > > > > >> > > > keeps on > >> > > > > > >> > > > > > >> > > > > > > Elasticsearch: > >> > > > > > >> > > > > > >> > > > > > > > { > >> > > > > > >> > > > > > >> > > > > > > > "index" : "squid" > >> > > > > > >> > > > > > >> > > > > > > > ,"batchSize" : 100 > >> > > > > > >> > > > > > >> > > > > > > > ,"writers" : [ "ES" ] > >> > > > >> > > -- > >> > > >> > Jon > >> > > >> > Sent from my mobile device > >> > > > > > -- > > Nick Allen <[email protected]> > > ------------------- > Thank you, > > James Sirota > PPMC- Apache Metron (Incubating) > jsirota AT apache DOT org >
