Well, I like it for a couple of reasons: - It's explicit and clear that the writer is on or off - It enables people to keep their writer config in the file without having the writer on (so I don't have to adjust the when clause to "false" - It enables us to not have to execute a stellar statement for "off" writers.
On Mon, Jan 16, 2017 at 10:40 AM, Nick Allen <n...@nickallen.org> wrote: > I'm all for a compromise here. Sounds like we're getting close. > > Just one thing. Can you layout the reasoning for having 'enabled' and > 'when'? I don't follow the reasoning, but maybe I am missing something. > > On Sat, Jan 14, 2017 at 12:13 PM, Kyle Richardson < > kylerichards...@gmail.com > > wrote: > > > I'm +1 on the current proposal. I like Nick's syntax and agree with Jon's > > enabled property. I also like the idea of a path property for HDFS. > > > > -Kyle > > > > > On Jan 14, 2017, at 10:51 AM, Casey Stella <ceste...@gmail.com> wrote: > > > > > > I'm +1 on an explicit enabled property and a filter (or when) > property. I > > > think we are zeroing in on a decent design, so that is good. > > > > > > To recap, what I am +1 on is Nick's proposed syntax with the following > > > modifications: > > > 1. An explicit enabled field > > > 2. A default on for unspecified to match current semantics > > > > > > Casey > > >> On Sat, Jan 14, 2017 at 10:45 zeo...@gmail.com <zeo...@gmail.com> > > wrote: > > >> > > >> This has the additional benefit of doing something like below when you > > want > > >> to temporarily disable the hdfs writer, but don't want to remove the > > >> settings. This removes the need to store the path and batchSize (and > > many > > >> additional settings) somewhere else so they can be brought back in > when > > you > > >> want to re-enable it, which is a nice workflow attribute for the end > > user: > > >> > > >> { > > >> 'elasticsearch': { > > >> 'enabled': 'true', > > >> 'index': 'foo', > > >> 'batchSize': 100, > > >> }, > > >> 'hdfs': { > > >> 'enabled': 'false', > > >> 'path': '/foo/bar/...', > > >> 'batchSize': 100, > > >> }, > > >> 'solr': { > > >> 'enabled': 'false' > > >> } > > >> } > > >> > > >> Jon > > >> > > >>> On Sat, Jan 14, 2017 at 9:24 AM zeo...@gmail.com <zeo...@gmail.com> > > wrote: > > >>> > > >>> I similarly have a concern there because I prefer being as explicit > as > > >>> possible, which makes things easier to pick up for new users. Using > my > > >>> example from earlier this could look like specifying while(false), > but > > an > > >>> even better and more obvious approach may be to use enabled(false). > So > > >> the > > >>> current simple default would be: > > >>> > > >>> { > > >>> 'elasticsearch': { 'enabled': 'true' }, > > >>> 'hdfs': { 'enabled': 'true' }, > > >>> 'solr': { enabled': 'false' } > > >>> } > > >>> > > >>> And to use ES with some overrides but not HDFS or solr it would look > > >> like: > > >>> > > >>> { > > >>> 'elasticsearch': { > > >>> 'enabled': 'true', > > >>> 'index': 'foo', > > >>> 'batchSize': 100 > > >>> }, > > >>> 'hdfs': { > > >>> 'enabled': 'false' > > >>> }, > > >>> 'solr': { > > >>> 'enabled': 'false' > > >>> } > > >>> } > > >>> > > >>> Jon > > >>> > > >>> On Fri, Jan 13, 2017 at 10:21 PM Casey Stella <ceste...@gmail.com> > > >> wrote: > > >>> > > >>> One thing that I thought of that I very strenuous do not like in > Nick's > > >>> proposal is that if a writer config is not specified then it is > turned > > >> off > > >>> (I think; if I misunderstood let me know). In the situation where we > > >> have a > > >>> new sensor, right now if there are no index config and no enrichment > > >>> config, it still passes through to the index using defaults. In this > > new > > >>> scheme it would not. This changes the default semantics for the > system > > >> and > > >>> I think it changes it for the worse. > > >>> > > >>> I would strongly prefer a on-by-default indexing config as we have > now. > > >>>> On Fri, Jan 13, 2017 at 17:13 Casey Stella <ceste...@gmail.com> > > wrote: > > >>>> > > >>>> One thing that I really like about Nick's suggestion is that it > allows > > >>>> writer-specific configs in a clear and simple way. It is more > complex > > >>> for > > >>>> the default case (all writers write to indices named the same thing > > >> with > > >>> a > > >>>> fixed batch size), which I do not like, but maybe it's worth the > > >>> compromise > > >>>> to make it less complex for the advanced case. > > >>>> > > >>>> Thanks a lot for the suggestion, Nick, it's interesting; I'm > > beginning > > >>> to > > >>>> lean your way. > > >>>> > > >>>> On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com <zeo...@gmail.com > > > > >>>> wrote: > > >>>> > > >>>> I like the suggestions you made, Nick. The only thing I would add > is > > >>> that > > >>>> it's also nice to see an explicit when(false), as people newer to > the > > >>>> platform may not know where to expect configs for the different > > >> writers. > > >>>> Being able to do it either way, which I think is already assumed in > > >> your > > >>>> model, would make sense. I would just suggest that, if we support > but > > >>> are > > >>>> disabling a writer, that the platform inserts a default when(false) > to > > >> be > > >>>> explicit. > > >>>> > > >>>> Jon > > >>>> > > >>>> On Fri, Jan 13, 2017 at 11:59 AM Casey Stella <ceste...@gmail.com> > > >>> wrote: > > >>>> > > >>>>> Let me noodle on this over the weekend. Your syntax is looking > less > > >>>>> onerous to me and I like the following statement from Otto: "In the > > >>> end, > > >>>>> each write destination ‘type’ will need it’s own configuration. > This > > >>> is > > >>>> an > > >>>>> extension point." > > >>>>> > > >>>>> I may come around to your way of thinking. > > >>>>> > > >>>>> On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler < > > >> ottobackwa...@gmail.com > > >>>> > > >>>>> wrote: > > >>>>> > > >>>>>> In the end, each write destination ‘type’ will need it’s own > > >>>>>> configuration. This is an extension point. > > >>>>>> { > > >>>>>> HDFS:{ > > >>>>>> outputAdapters:[ > > >>>>>> {name: avro, > > >>>>>> settings:{ > > >>>>>> avro stuff…. > > >>>>>> when:{ > > >>>>>> }, > > >>>>>> { > > >>>>>> name: sequence file, > > >>>>>> ….. > > >>>>>> > > >>>>>> or some such. > > >>>>>> > > >>>>>> > > >>>>>> On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org) > > >>>> wrote: > > >>>>>> > > >>>>>> I will add also that instead of global overrides, like index, we > > >>> should > > >>>>> use > > >>>>>> configuration key names that are more appropriate to the output. > > >>>>>> > > >>>>>> For example, does 'index' really make sense for HDFS? Or would > > >> 'path' > > >>>> be > > >>>>>> more appropriate? > > >>>>>> > > >>>>>> { > > >>>>>> 'elasticsearch': { > > >>>>>> 'index': 'foo', > > >>>>>> 'batchSize': 1 > > >>>>>> }, > > >>>>>> 'hdfs': { > > >>>>>> 'path': '/foo/bar/...', > > >>>>>> 'batchSize': 100 > > >>>>>> } > > >>>>>> } > > >>>>>> > > >>>>>> Ok, I've said my peace. Thanks for the effort in summarizing all > > >>> this, > > >>>>>> Casey. > > >>>>>> > > >>>>>> > > >>>>>> On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen <n...@nickallen.org> > > >>>> wrote: > > >>>>>> > > >>>>>>> Nick's concerns about my suggestion were that it was overly > > >> complex > > >>>> and > > >>>>>>>> hard to grok and that we could dispense with backwards > > >>> compatibility > > >>>>> and > > >>>>>>>> make people do a bit more work on the default case for the > > >>> benefits > > >>>>> of a > > >>>>>>>> simpler advanced case. (Nick, make sure I don't misstate your > > >>>>> position) > > >>>>>>> > > >>>>>>> > > >>>>>>> I will add is that in my mind, the majority case would be a user > > >>>>>>> specifying the outputs, but not things like 'batchSize' or > > >> 'when'. > > >>> I > > >>>>>> think > > >>>>>>> in the majority case, the user would accept whatever the default > > >>>> batch > > >>>>>> size > > >>>>>>> is. > > >>>>>>> > > >>>>>>> Here are alternatives suggestions for all the examples that you > > >>>>> provided > > >>>>>>> previously. > > >>>>>>> > > >>>>>>> Base Case > > >>>>>>> > > >>>>>>> - The user must always specify the 'outputs' for clarity. > > >>>>>>> - Uses default index name, batch size and when = true. > > >>>>>>> > > >>>>>>> { > > >>>>>>> 'elasticsearch': {}, > > >>>>>>> 'hdfs': {} > > >>>>>>> } > > >>>>>>> > > >>>>>>> > > >>>>>>> < > > >>>>>> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0 > > >>>>>> a1#writer-non-specific-case>Writer-non-specific > > >>>>>> > > >>>>>>> Case > > >>>>>>> > > >>>>>>> - There are no global overrides, as in Casey's proposal. > > >>>>>>> - Easier to grok IMO. > > >>>>>>> > > >>>>>>> { > > >>>>>>> 'elasticsearch': { > > >>>>>>> 'index': 'foo', > > >>>>>>> 'batchSize': 100 > > >>>>>>> }, > > >>>>>>> 'hdfs': { > > >>>>>>> 'index': 'foo', > > >>>>>>> 'batchSize': 100 > > >>>>>>> } > > >>>>>>> } > > >>>>>>> > > >>>>>>> > > >>>>>>> < > > >>>>>> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0 > > >>>>>> a1#writer-specific-case-without-filters>Writer-specific > > >>>>>> > > >>>>>>> case without filters > > >>>>>>> > > >>>>>>> { > > >>>>>>> 'elasticsearch': { > > >>>>>>> 'index': 'foo', > > >>>>>>> 'batchSize': 1 > > >>>>>>> }, > > >>>>>>> 'hdfs': { > > >>>>>>> 'index': 'foo', > > >>>>>>> 'batchSize': 100 > > >>>>>>> } > > >>>>>>> } > > >>>>>>> > > >>>>>>> > > >>>>>>> < > > >>>>>> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0 > > >>>>>> a1#writer-specific-case-with-filters>Writer-specific > > >>>>>> > > >>>>>>> case with filters > > >>>>>>> > > >>>>>>> - Instead of having to say when=false, just don't configure HDFS > > >>>>>>> > > >>>>>>> { > > >>>>>>> 'elasticsearch': { > > >>>>>>> 'index': 'foo', > > >>>>>>> 'batchSize': 100, > > >>>>>>> 'when': 'exists(field1)' > > >>>>>>> } > > >>>>>>> } > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella < > > >> ceste...@gmail.com > > >>>> > > >>>>>> wrote: > > >>>>>>> > > >>>>>>>> Dave, > > >>>>>>>> For the benefit of posterity and people who might not be as > > >> deeply > > >>>>>>>> entangled in the system as we have been, I'll recap things and > > >>>>> hopefully > > >>>>>>>> answer your question in the process. > > >>>>>>>> > > >>>>>>>> Historically the index configuration is split between the > > >>> enrichment > > >>>>>>>> configs and the global configs. > > >>>>>>>> > > >>>>>>>> - The global configs really controls configs that apply to all > > >>>>> sensors. > > >>>>>>>> Historically this has been stuff like index connection strings, > > >>> etc. > > >>>>>>>> - The sensor-specific configs which control things that vary by > > >>>>> sensor. > > >>>>>>>> > > >>>>>>>> As of Metron-652 (in review currently), we moved the sensor > > >>> specific > > >>>>>>>> configs from the enrichment configs. The proposal here is to > > >>>> increase > > >>>>>> the > > >>>>>>>> granularity of the the sensor specific files to make them > > >> support > > >>>>> index > > >>>>>>>> writer-specific configs. Right now in the indexing topology, we > > >>>> have 2 > > >>>>>>>> writers (fixed): ES/Solr and HDFS. > > >>>>>>>> > > >>>>>>>> The proposed configuration would allow you to either specify a > > >>>> blanket > > >>>>>>>> sensor-level config for the index name and batchSize and/or > > >>> override > > >>>>> at > > >>>>>>>> the > > >>>>>>>> writer level, thereby supporting a couple of use-cases: > > >>>>>>>> > > >>>>>>>> - Turning off certain index writers (e.g. HDFS) > > >>>>>>>> - Filtering the messages written to certain index writers > > >>>>>>>> > > >>>>>>>> The two competing configs between Nick and I are as follows: > > >>>>>>>> > > >>>>>>>> - I want to make sure we keep the old sensor-specific defaults > > >>> with > > >>>>>>>> writer-specific overrides available > > >>>>>>>> - Nick thought we could simplify the permutations by making the > > >>>>>>>> indexing > > >>>>>>>> config only the writer-level configs. > > >>>>>>>> > > >>>>>>>> My concerns about Nick's suggestion were that the default and > > >>>> majority > > >>>>>>>> case, specifying the index and the batchSize for all writers (th > > >>>> eone > > >>>>> we > > >>>>>>>> support now) would require more configuration. > > >>>>>>>> > > >>>>>>>> Nick's concerns about my suggestion were that it was overly > > >>> complex > > >>>>> and > > >>>>>>>> hard to grok and that we could dispense with backwards > > >>> compatibility > > >>>>> and > > >>>>>>>> make people do a bit more work on the default case for the > > >>> benefits > > >>>>> of a > > >>>>>>>> simpler advanced case. (Nick, make sure I don't misstate your > > >>>>> position). > > >>>>>>>> > > >>>>>>>> Casey > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle < > > >>> dlyle65...@gmail.com> > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> Casey, > > >>>>>>>>> > > >>>>>>>>> Can you give me a level set of what your thinking is now? I > > >>> think > > >>>>> it's > > >>>>>>>>> global control of all index types + overrides on a per-type > > >>> basis. > > >>>>>> Fwiw, > > >>>>>>>>> I'm totally for that, but I want to make sure I'm not imposing > > >>> my > > >>>>>>>>> pre-concieved notions on your consensus-driven ones. > > >>>>>>>>> > > >>>>>>>>> -D.... > > >>>>>>>>> > > >>>>>>>>> On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella < > > >>>> ceste...@gmail.com> > > >>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> I am suggesting that, yes. The configs are essentially the > > >>> same > > >>>> as > > >>>>>>>>> yours, > > >>>>>>>>>> except there is an override specified at the top level. > > >>> Without > > >>>>>>>> that, in > > >>>>>>>>>> order to specify both HDFS and ES have batch sizes of 100, > > >> you > > >>>>> have > > >>>>>> to > > >>>>>>>>>> explicitly configure each. It's less that I'm trying to have > > >>>>>>>> backwards > > >>>>>>>>>> compatibility and more that I'm trying to make the majority > > >>> case > > >>>>>> easy: > > >>>>>>>>> both > > >>>>>>>>>> writers write everything to a specified index name with a > > >>>>> specified > > >>>>>>>> batch > > >>>>>>>>>> size (which is what we have now). Beyond that, I want to > > >> allow > > >>>> for > > >>>>>>>>>> specifying an override for the config on a writer-by-writer > > >>>> basis > > >>>>>> for > > >>>>>>>>> those > > >>>>>>>>>> who need it. > > >>>>>>>>>> > > >>>>>>>>>> On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen < > > >>>> n...@nickallen.org> > > >>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> Are you saying we support all of these variants? I realize > > >>> you > > >>>>> are > > >>>>>>>>>> trying > > >>>>>>>>>>> to have some backwards compatibility, but this also makes > > >> it > > >>>>>> harder > > >>>>>>>>> for a > > >>>>>>>>>>> user to grok (for me at least). > > >>>>>>>>>>> > > >>>>>>>>>>> Personally I like my original example as there are fewer > > >>>>>>>>> sub-structures, > > >>>>>>>>>>> like 'writerConfig', which makes the whole thing simpler > > >> and > > >>>>>> easier > > >>>>>>>> to > > >>>>>>>>>>> grok. But maybe others will think your proposal is just as > > >>>> easy > > >>>>> to > > >>>>>>>>> grok. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella < > > >>>>>> ceste...@gmail.com> > > >>>>>> > > >>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>>> Ok, so here's what I'm thinking based on the discussion: > > >>>>>>>>>>>> > > >>>>>>>>>>>> - Keeping the configs that we have now (batchSize and > > >>> index) > > >>>>> as > > >>>>>>>>>>> defaults > > >>>>>>>>>>>> for the unspecified writer-specific case > > >>>>>>>>>>>> - Adding the config Nick suggested > > >>>>>>>>>>>> > > >>>>>>>>>>>> *Base Case*: > > >>>>>>>>>>>> { > > >>>>>>>>>>>> } > > >>>>>>>>>>>> > > >>>>>>>>>>>> - all writers write all messages > > >>>>>>>>>>>> - index named the same as the sensor for all writers > > >>>>>>>>>>>> - batchSize of 1 for all writers > > >>>>>>>>>>>> > > >>>>>>>>>>>> *Writer-non-specific case*: > > >>>>>>>>>>>> { > > >>>>>>>>>>>> "index" : "foo" > > >>>>>>>>>>>> ,"batchSize" : 100 > > >>>>>>>>>>>> } > > >>>>>>>>>>>> > > >>>>>>>>>>>> - All writers write all messages > > >>>>>>>>>>>> - index is named "foo", different from the sensor for > > >> all > > >>>>>>>> writers > > >>>>>>>>>>>> - batchSize is 100 for all writers > > >>>>>>>>>>>> > > >>>>>>>>>>>> *Writer-specific case without filters* > > >>>>>>>>>>>> { > > >>>>>>>>>>>> "index" : "foo" > > >>>>>>>>>>>> ,"batchSize" : 1 > > >>>>>>>>>>>> , "writerConfig" : > > >>>>>>>>>>>> { > > >>>>>>>>>>>> "elasticsearch" : { > > >>>>>>>>>>>> "batchSize" : 100 > > >>>>>>>>>>>> } > > >>>>>>>>>>>> } > > >>>>>>>>>>>> } > > >>>>>>>>>>>> > > >>>>>>>>>>>> - All writers write all messages > > >>>>>>>>>>>> - index is named "foo", different from the sensor for > > >> all > > >>>>>>>> writers > > >>>>>>>>>>>> - batchSize is 1 for HDFS and 100 for elasticsearch > > >>> writers > > >>>>>>>>>>>> - NOTE: I could override the index name too > > >>>>>>>>>>>> > > >>>>>>>>>>>> *Writer-specific case with filters* > > >>>>>>>>>>>> { > > >>>>>>>>>>>> "index" : "foo" > > >>>>>>>>>>>> ,"batchSize" : 1 > > >>>>>>>>>>>> , "writerConfig" : > > >>>>>>>>>>>> { > > >>>>>>>>>>>> "elasticsearch" : { > > >>>>>>>>>>>> "batchSize" : 100, > > >>>>>>>>>>>> "when" : "exists(field1)" > > >>>>>>>>>>>> }, > > >>>>>>>>>>>> "hdfs" : { > > >>>>>>>>>>>> "when" : "false" > > >>>>>>>>>>>> } > > >>>>>>>>>>>> } > > >>>>>>>>>>>> } > > >>>>>>>>>>>> > > >>>>>>>>>>>> - ES writer writes messages which have field1, HDFS > > >>> doesn't > > >>>>>>>>>>>> - index is named "foo", different from the sensor for > > >> all > > >>>>>>>> writers > > >>>>>>>>>>>> - 100 for elasticsearch writers > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thoughts? > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby < > > >>>>>>>> cd...@hortonworks.com > > >>>>>>>>>> > > >>>>>>>>>>>> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>>> For larger installations you need to control what is > > >>>> indexed > > >>>>>> so > > >>>>>>>> you > > >>>>>>>>>>> don’t > > >>>>>>>>>>>>> end up with a nasty elastic search situation and so > > >> you > > >>>> can > > >>>>>> mine > > >>>>>>>>> the > > >>>>>>>>>>> data > > >>>>>>>>>>>>> later for reports and training ml models. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Thanks > > >>>>>>>>>>>>> Carolyn > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On 1/13/17, 9:40 AM, "Casey Stella" < > > >> ceste...@gmail.com > > >>>> > > >>>>>> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> OH that's a good idea! > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen < > > >>>>>>>> n...@nickallen.org> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> I like the "Index Filtering" option based on the > > >>>>>> flexibility > > >>>>>>>>> that > > >>>>>>>>>> it > > >>>>>>>>>>>>>>> provides. Should each output (HDFS, ES, etc) have > > >> its > > >>>> own > > >>>>>>>>>>>> configuration > > >>>>>>>>>>>>>>> settings? For example, aren't things like batching > > >>>>> handled > > >>>>>>>>>>> separately > > >>>>>>>>>>>>> for > > >>>>>>>>>>>>>>> HDFS versus Elasticsearch? > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Something along the lines of... > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> { > > >>>>>>>>>>>>>>> "hdfs" : { > > >>>>>>>>>>>>>>> "when": "exists(field1)", > > >>>>>>>>>>>>>>> "batchSize": 100 > > >>>>>>>>>>>>>>> }, > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> "elasticsearch" : { > > >>>>>>>>>>>>>>> "when": "true", > > >>>>>>>>>>>>>>> "batchSize": 1000, > > >>>>>>>>>>>>>>> "index": "squid" > > >>>>>>>>>>>>>>> } > > >>>>>>>>>>>>>>> } > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella < > > >>>>>>>>> ceste...@gmail.com > > >>>>>>>>>>> > > >>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Yeah, I tend to like the first option too. Any > > >>>>> opposition > > >>>>>>>> to > > >>>>>>>>>> that > > >>>>>>>>>>>>> from > > >>>>>>>>>>>>>>>> anyone? > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> The points brought up are good ones and I think > > >>> that > > >>>> it > > >>>>>>>> may be > > >>>>>>>>>>>> worth a > > >>>>>>>>>>>>>>>> broader discussion of the requirements of > > >> indexing > > >>>> in a > > >>>>>>>>> separate > > >>>>>>>>>>> dev > > >>>>>>>>>>>>> list > > >>>>>>>>>>>>>>>> thread. Maybe a list of desires with coherent > > >>>> use-cases > > >>>>>>>>>>> justifying > > >>>>>>>>>>>>> them > > >>>>>>>>>>>>>>> so > > >>>>>>>>>>>>>>>> we can think about how this stuff should work and > > >>>> where > > >>>>>> the > > >>>>>>>>>>> natural > > >>>>>>>>>>>>>>>> extension points should be. Afterall, we need to > > >>> toe > > >>>>> the > > >>>>>>>> line > > >>>>>>>>>>>> between > > >>>>>>>>>>>>>>>> engineering and overengineering for features > > >> nobody > > >>>>> will > > >>>>>>>> want. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> I'm not sure about the extensions to the standard > > >>>>> fields. > > >>>>>>>> I'm > > >>>>>>>>>>> torn > > >>>>>>>>>>>>>>> between > > >>>>>>>>>>>>>>>> the notions that we should have no standard > > >> fields > > >>> vs > > >>>>> we > > >>>>>>>>> should > > >>>>>>>>>>>> have a > > >>>>>>>>>>>>>>>> boatload of standard fields (with most of them > > >>>> empty). > > >>>>> I > > >>>>>>>>>> exchange > > >>>>>>>>>>>>>>>> positions fairly regularly on that question. ;) > > >> It > > >>>> may > > >>>>> be > > >>>>>>>>>> worth a > > >>>>>>>>>>>> dev > > >>>>>>>>>>>>>>> list > > >>>>>>>>>>>>>>>> discussion to lay out how you imagine an > > >> extension > > >>> of > > >>>>>>>> standard > > >>>>>>>>>>>> fields > > >>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>> how it might look as implemented in Metron. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Casey > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Casey > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson > > >> < > > >>>>>>>>>>>>>>>> kylerichards...@gmail.com> > > >>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> I'll second my preference for the first > > >> option. I > > >>>>> think > > >>>>>>>> the > > >>>>>>>>>>>> ability > > >>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>> use > > >>>>>>>>>>>>>>>>> Stellar filters to customize indexing would be > > >> a > > >>>> big > > >>>>>> win. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> I'm glad Matt brought up the point about data > > >>> lake > > >>>>> and > > >>>>>>>> CEP. > > >>>>>>>>> I > > >>>>>>>>>>>> think > > >>>>>>>>>>>>>>> this > > >>>>>>>>>>>>>>>> is > > >>>>>>>>>>>>>>>>> a really important use case that we need to > > >>>> consider. > > >>>>>>>> Take a > > >>>>>>>>>>>> simple > > >>>>>>>>>>>>>>>>> example... If I have data coming in from 3 > > >>>> different > > >>>>>>>>> firewall > > >>>>>>>>>>>>> vendors > > >>>>>>>>>>>>>>>> and 2 > > >>>>>>>>>>>>>>>>> different web proxy/url filtering vendors and I > > >>>> want > > >>>>> to > > >>>>>>>> be > > >>>>>>>>>> able > > >>>>>>>>>>> to > > >>>>>>>>>>>>>>>> analyze > > >>>>>>>>>>>>>>>>> that data set, I need the data to be indexed > > >> all > > >>>>>> together > > >>>>>>>>>>> (likely > > >>>>>>>>>>>> in > > >>>>>>>>>>>>>>>> HDFS) > > >>>>>>>>>>>>>>>>> and to have a normalized schema such that IP > > >>>> address, > > >>>>>>>> URL, > > >>>>>>>>> and > > >>>>>>>>>>>> user > > >>>>>>>>>>>>>>> name > > >>>>>>>>>>>>>>>>> (to take a few) can be easily queried and > > >>>>> aggregated. I > > >>>>>>>> can > > >>>>>>>>>> also > > >>>>>>>>>>>>>>> envision > > >>>>>>>>>>>>>>>>> scenarios where I would want to index data > > >> based > > >>> on > > >>>>>>>>> attributes > > >>>>>>>>>>>> other > > >>>>>>>>>>>>>>> than > > >>>>>>>>>>>>>>>>> sensor, business unit or subsidiary for > > >> example. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> I've been wanted to propose extending our 7 > > >>>> standard > > >>>>>>>> fields > > >>>>>>>>> to > > >>>>>>>>>>>>> include > > >>>>>>>>>>>>>>>>> things like URL and user. Is there community > > >>>>>>>>> interest/support > > >>>>>>>>>>> for > > >>>>>>>>>>>>>>> moving > > >>>>>>>>>>>>>>>> in > > >>>>>>>>>>>>>>>>> that direction? If so, I'll start a new thread. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks! > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> -Kyle > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley < > > >>>>>>>>> ma...@apache.org > > >>>>>>>>>>> > > >>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Ah, I see. If overriding the default index > > >> name > > >>>>>> allows > > >>>>>>>>>> using > > >>>>>>>>>>>> the > > >>>>>>>>>>>>>>> same > > >>>>>>>>>>>>>>>>>> name for multiple sensors, then the goal can > > >> be > > >>>>>>>> achieved. > > >>>>>>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>>>>>> --Matt > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> On 1/12/17, 3:30 PM, "Casey Stella" < > > >>>>>>>> ceste...@gmail.com> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Oh, you could! Let's say you have a syslog > > >>> parser > > >>>>>>>>> with > > >>>>>>>>>>> data > > >>>>>>>>>>>>> from > > >>>>>>>>>>>>>>>>>> sources 1 > > >>>>>>>>>>>>>>>>>> 2 and 3. You'd end up with one kafka queue > > >>> with 3 > > >>>>>>>>>> parsers > > >>>>>>>>>>>>>>> attached > > >>>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>> that > > >>>>>>>>>>>>>>>>>> queue, each picking part the messages from > > >>> source > > >>>>>>>> 1, 2 > > >>>>>>>>>> and > > >>>>>>>>>>>> 3. > > >>>>>>>>>>>>>>>> They'd > > >>>>>>>>>>>>>>>>>> go > > >>>>>>>>>>>>>>>>>> through separate enrichment and into the > > >>> indexing > > >>>>>>>>>>> topology. > > >>>>>>>>>>>>> In > > >>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>> indexing topology, you could specify the same > > >>>> index > > >>>>>>>>> name > > >>>>>>>>>>>>> "syslog" > > >>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>> all > > >>>>>>>>>>>>>>>>>> of the messages go into the same index for > > >> CEP > > >>>>>>>>> querying > > >>>>>>>>>> if > > >>>>>>>>>>>> so > > >>>>>>>>>>>>>>>>> desired. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley < > > >>>>>>>>>>>> ma...@apache.org > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Syslog is hell on parsers – I know, I > > >> worked > > >>> at > > >>>>>>>>>> LogLogic > > >>>>>>>>>>>> in > > >>>>>>>>>>>>> a > > >>>>>>>>>>>>>>>>>> previous > > >>>>>>>>>>>>>>>>>>> life. It makes perfect sense to route > > >>> different > > >>>>>>>>> lines > > >>>>>>>>>>>> from > > >>>>>>>>>>>>>>>> syslog > > >>>>>>>>>>>>>>>>>> through > > >>>>>>>>>>>>>>>>>>> different appropriate parsers. But a lot of > > >>>> what > > >>>>>>>>> the > > >>>>>>>>>>>>> parsers > > >>>>>>>>>>>>>>> do > > >>>>>>>>>>>>>>>> is > > >>>>>>>>>>>>>>>>>>> identify consistent subsets of metadata and > > >>>>>>>> annotate > > >>>>>>>>>> it > > >>>>>>>>>>> – > > >>>>>>>>>>>>> eg, > > >>>>>>>>>>>>>>>>>> src_ip_addr, > > >>>>>>>>>>>>>>>>>>> event timestamps, etc. Once those metadata > > >>> are > > >>>>>>>>>>> annotated > > >>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>> available > > >>>>>>>>>>>>>>>>>>> with common field names, why doesn’t it > > >> make > > >>>>>>>> sense > > >>>>>>>>> to > > >>>>>>>>>>>> index > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>> messages > > >>>>>>>>>>>>>>>>>>> together, for CEP querying? I think Splunk > > >>> has > > >>>>>>>>>>>> illustrated > > >>>>>>>>>>>>>>> this > > >>>>>>>>>>>>>>>>>> model. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> On 1/12/17, 3:00 PM, "Casey Stella" < > > >>>>>>>>>> ceste...@gmail.com > > >>>>>>>>>>>> > > >>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> yeah, I mean, honestly, I think the > > >> approach > > >>>>>>>>> that > > >>>>>>>>>>>> we've > > >>>>>>>>>>>>>>> taken > > >>>>>>>>>>>>>>>>> for > > >>>>>>>>>>>>>>>>>>> sources > > >>>>>>>>>>>>>>>>>>> which aggregate different types of data is > > >> to > > >>>>>>>>>>> provide > > >>>>>>>>>>>>>>> filters > > >>>>>>>>>>>>>>>>> at > > >>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>> parser > > >>>>>>>>>>>>>>>>>>> level and have multiple parser topologies > > >>>>>>>> (with > > >>>>>>>>>>>>> different, > > >>>>>>>>>>>>>>>>>> possibly > > >>>>>>>>>>>>>>>>>>> mutually exclusive filters) running. This > > >>>>>>>> would > > >>>>>>>>>> be > > >>>>>>>>>>> a > > >>>>>>>>>>>>>>>>> completely > > >>>>>>>>>>>>>>>>>>> separate > > >>>>>>>>>>>>>>>>>>> sensor. Imagine a syslog data source that > > >>>>>>>>>>> aggregates > > >>>>>>>>>>>>> and > > >>>>>>>>>>>>>>> you > > >>>>>>>>>>>>>>>>>> want to > > >>>>>>>>>>>>>>>>>>> pick > > >>>>>>>>>>>>>>>>>>> apart certain pieces of messages. This is > > >>>>>>>> why > > >>>>>>>>> the > > >>>>>>>>>>>>> initial > > >>>>>>>>>>>>>>>>>> thought and > > >>>>>>>>>>>>>>>>>>> architecture was one index per sensor. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 5:55 PM, Matt > > >> Foley < > > >>>>>>>>>>>>>>>> ma...@apache.org> > > >>>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> I’m thinking that CEP (Complex Event > > >>>>>>>>> Processing) > > >>>>>>>>>>> is > > >>>>>>>>>>>>>>>> contrary > > >>>>>>>>>>>>>>>>>> to the > > >>>>>>>>>>>>>>>>>>> idea > > >>>>>>>>>>>>>>>>>>>> of silo-ing data per sensor. > > >>>>>>>>>>>>>>>>>>>> Now it’s true that some of those sensors > > >>>>>>>> are > > >>>>>>>>>>> already > > >>>>>>>>>>>>>>>>>> aggregating > > >>>>>>>>>>>>>>>>>>> data from > > >>>>>>>>>>>>>>>>>>>> multiple sources, so maybe I’m wrong > > >> here. > > >>>>>>>>>>>>>>>>>>>> But it just seems to me that the “data > > >>>>>>>> lake” > > >>>>>>>>>>>> insights > > >>>>>>>>>>>>>>> come > > >>>>>>>>>>>>>>>>> from > > >>>>>>>>>>>>>>>>>>> being able > > >>>>>>>>>>>>>>>>>>>> to make decisions over the whole mass of > > >>>>>>>> data > > >>>>>>>>>>> rather > > >>>>>>>>>>>>> than > > >>>>>>>>>>>>>>>>> just > > >>>>>>>>>>>>>>>>>>> vertical > > >>>>>>>>>>>>>>>>>>>> slices of it. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> On 1/12/17, 2:15 PM, "Casey Stella" < > > >>>>>>>>>>>>> ceste...@gmail.com> > > >>>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Hey Matt, > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks for the comment! > > >>>>>>>>>>>>>>>>>>>> 1. At the moment, we only have one > > >>>>>>>> index > > >>>>>>>>>> name, > > >>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>> default > > >>>>>>>>>>>>>>>>>> of > > >>>>>>>>>>>>>>>>>>> which is > > >>>>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>>> sensor name but that's entirely up to > > >>>>>>>> the > > >>>>>>>>>>> user. > > >>>>>>>>>>>>> This > > >>>>>>>>>>>>>>>> is > > >>>>>>>>>>>>>>>>>> sensor > > >>>>>>>>>>>>>>>>>>>> specific, > > >>>>>>>>>>>>>>>>>>>> so it'd be a separate config for each > > >>>>>>>>>> sensor. > > >>>>>>>>>>>> If > > >>>>>>>>>>>>> we > > >>>>>>>>>>>>>>>> want > > >>>>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>>> build > > >>>>>>>>>>>>>>>>>>>> multiple > > >>>>>>>>>>>>>>>>>>>> indices per sensor, we'd have to think > > >>>>>>>>>>> carefully > > >>>>>>>>>>>>>>> about > > >>>>>>>>>>>>>>>>> how > > >>>>>>>>>>>>>>>>>> to do > > >>>>>>>>>>>>>>>>>>> that > > >>>>>>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>>>> would be a bigger undertaking. I > > >>>>>>>> guess I > > >>>>>>>>>> can > > >>>>>>>>>>>> see > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>> use, > > >>>>>>>>>>>>>>>>>> though > > >>>>>>>>>>>>>>>>>>>> (redirect > > >>>>>>>>>>>>>>>>>>>> messages to one index vs another based > > >>>>>>>> on > > >>>>>>>>> a > > >>>>>>>>>>>>> predicate > > >>>>>>>>>>>>>>>> for > > >>>>>>>>>>>>>>>>>> a given > > >>>>>>>>>>>>>>>>>>>> sensor). > > >>>>>>>>>>>>>>>>>>>> Anyway, not where I was originally > > >>>>>>>>> thinking > > >>>>>>>>>>> that > > >>>>>>>>>>>>> this > > >>>>>>>>>>>>>>>>>> discussion > > >>>>>>>>>>>>>>>>>>> would > > >>>>>>>>>>>>>>>>>>>> go, > > >>>>>>>>>>>>>>>>>>>> but it's an interesting point. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> 2. I hadn't thought through the > > >>>>>>>>>> implementation > > >>>>>>>>>>>>> quite > > >>>>>>>>>>>>>>>> yet, > > >>>>>>>>>>>>>>>>>> but we > > >>>>>>>>>>>>>>>>>>> don't > > >>>>>>>>>>>>>>>>>>>> actually have a splitter bolt in that > > >>>>>>>>>>> topology, > > >>>>>>>>>>>>> just > > >>>>>>>>>>>>>>> a > > >>>>>>>>>>>>>>>>>> spout > > >>>>>>>>>>>>>>>>>>> that goes > > >>>>>>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>>>> the elasticsearch writer and also to > > >>>>>>>> the > > >>>>>>>>>> hdfs > > >>>>>>>>>>>>> writer. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 4:52 PM, Matt > > >>>>>>>>> Foley > > >>>>>>>>>> < > > >>>>>>>>>>>>>>>>>> ma...@apache.org> > > >>>>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Casey, good to have controls like > > >>>>>>>> this. > > >>>>>>>>>>>> Couple > > >>>>>>>>>>>>>>>>>> questions: > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> 1. Regarding the “index” : “squid” > > >>>>>>>>>>> name/value > > >>>>>>>>>>>>> pair, > > >>>>>>>>>>>>>>>> is > > >>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>> index name > > >>>>>>>>>>>>>>>>>>>>> expected to always be a sensor > > >>>>>>>> name? Or > > >>>>>>>>>> is > > >>>>>>>>>>>> the > > >>>>>>>>>>>>>>> given > > >>>>>>>>>>>>>>>>>> json > > >>>>>>>>>>>>>>>>>>> structure > > >>>>>>>>>>>>>>>>>>>>> subordinate to a sensor name in > > >>>>>>>>> zookeeper? > > >>>>>>>>>>> Or > > >>>>>>>>>>>>> can > > >>>>>>>>>>>>>>> we > > >>>>>>>>>>>>>>>>>> build > > >>>>>>>>>>>>>>>>>>> arbitrary > > >>>>>>>>>>>>>>>>>>>>> indexes with this new specification, > > >>>>>>>>>>>>> independent of > > >>>>>>>>>>>>>>>>>> sensor? > > >>>>>>>>>>>>>>>>>>> Should > > >>>>>>>>>>>>>>>>>>>> there > > >>>>>>>>>>>>>>>>>>>>> actually be a list of “indexes”, ie > > >>>>>>>>>>>>>>>>>>>>> { “indexes” : [ > > >>>>>>>>>>>>>>>>>>>>> {“index” : “name1”, > > >>>>>>>>>>>>>>>>>>>>> … > > >>>>>>>>>>>>>>>>>>>>> }, > > >>>>>>>>>>>>>>>>>>>>> {“index” : “name2”, > > >>>>>>>>>>>>>>>>>>>>> … > > >>>>>>>>>>>>>>>>>>>>> } ] > > >>>>>>>>>>>>>>>>>>>>> } > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> 2. Would the filtering / writer > > >>>>>>>>> selection > > >>>>>>>>>>>> logic > > >>>>>>>>>>>>>>> take > > >>>>>>>>>>>>>>>>>> place in > > >>>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>>> indexing > > >>>>>>>>>>>>>>>>>>>>> topology splitter bolt? Seems like > > >>>>>>>> that > > >>>>>>>>>>> would > > >>>>>>>>>>>>> have > > >>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>> smallest > > >>>>>>>>>>>>>>>>>>>> impact on > > >>>>>>>>>>>>>>>>>>>>> current implementation, no? > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Sorry if these are already answered > > >>>>>>>> in > > >>>>>>>>>>>> PR-415, I > > >>>>>>>>>>>>>>>>> haven’t > > >>>>>>>>>>>>>>>>>> had > > >>>>>>>>>>>>>>>>>>> time to > > >>>>>>>>>>>>>>>>>>>>> review that one yet. > > >>>>>>>>>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>>>>>>>>> --Matt > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> On 1/12/17, 12:55 PM, "Michael > > >>>>>>>>> Miklavcic" > > >>>>>>>>>> < > > >>>>>>>>>>>>>>>>>>>> michael.miklav...@gmail.com> > > >>>>>>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> I like the flexibility and > > >>>>>>>>>>> expressibility > > >>>>>>>>>>>> of > > >>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>> first > > >>>>>>>>>>>>>>>>>>> option > > >>>>>>>>>>>>>>>>>>>> with > > >>>>>>>>>>>>>>>>>>>>> Stellar > > >>>>>>>>>>>>>>>>>>>>> filters. > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> M > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 1:51 PM, > > >>>>>>>>> Casey > > >>>>>>>>>>>>> Stella < > > >>>>>>>>>>>>>>>>>>>> ceste...@gmail.com> > > >>>>>>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> As of METRON-652 < > > >>>>>>>>>>>>> https://github.com/apache/ > > >>>>>>>>>>>>>>>>>>>>> incubator-metron/pull/415>, we > > >>>>>>>>>>>>>>>>>>>>>> will have decoupled the > > >>>>>>>> indexing > > >>>>>>>>>>>>>>> configuration > > >>>>>>>>>>>>>>>>>> from the > > >>>>>>>>>>>>>>>>>>>> enrichment > > >>>>>>>>>>>>>>>>>>>>>> configuration. As an immediate > > >>>>>>>>>>>> follow-up > > >>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>> that, > > >>>>>>>>>>>>>>>>>> I'd > > >>>>>>>>>>>>>>>>>>> like to > > >>>>>>>>>>>>>>>>>>>>> provide the > > >>>>>>>>>>>>>>>>>>>>>> ability to turn off and on > > >>>>>>>> writers > > >>>>>>>>>> via > > >>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>> configs. I'd > > >>>>>>>>>>>>>>>>>>> like > > >>>>>>>>>>>>>>>>>>>> to get > > >>>>>>>>>>>>>>>>>>>>> some > > >>>>>>>>>>>>>>>>>>>>>> community feedback on how the > > >>>>>>>>>>>>> functionality > > >>>>>>>>>>>>>>>>> should > > >>>>>>>>>>>>>>>>>> work, > > >>>>>>>>>>>>>>>>>>> if > > >>>>>>>>>>>>>>>>>>>> y'all are > > >>>>>>>>>>>>>>>>>>>>>> amenable. :) > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> As of now, we have 3 possible > > >>>>>>>>>> writers > > >>>>>>>>>>>>> which > > >>>>>>>>>>>>>>> can > > >>>>>>>>>>>>>>>>> be > > >>>>>>>>>>>>>>>>>> used > > >>>>>>>>>>>>>>>>>>> in the > > >>>>>>>>>>>>>>>>>>>>> indexing > > >>>>>>>>>>>>>>>>>>>>>> topology: > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> - Solr > > >>>>>>>>>>>>>>>>>>>>>> - Elasticsearch > > >>>>>>>>>>>>>>>>>>>>>> - HDFS > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> HDFS is always used, > > >>>>>>>> elasticsearch > > >>>>>>>>>> or > > >>>>>>>>>>>>> solr is > > >>>>>>>>>>>>>>>>> used > > >>>>>>>>>>>>>>>>>>> depending > > >>>>>>>>>>>>>>>>>>>> on how > > >>>>>>>>>>>>>>>>>>>>> you > > >>>>>>>>>>>>>>>>>>>>>> start the indexing topology. > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> A couple of proposals come to > > >>>>>>>> mind > > >>>>>>>>>>>>>>> immediately: > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> *Index Filtering* > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> You would be able to specify a > > >>>>>>>>>> filter > > >>>>>>>>>>> as > > >>>>>>>>>>>>>>>> defined > > >>>>>>>>>>>>>>>>>> by a > > >>>>>>>>>>>>>>>>>>> stellar > > >>>>>>>>>>>>>>>>>>>>> statement > > >>>>>>>>>>>>>>>>>>>>>> (likely a reuse of the > > >>>>>>>>> StellarFilter > > >>>>>>>>>>>> that > > >>>>>>>>>>>>>>>> exists > > >>>>>>>>>>>>>>>>>> in the > > >>>>>>>>>>>>>>>>>>>> Parsers) > > >>>>>>>>>>>>>>>>>>>>> which > > >>>>>>>>>>>>>>>>>>>>>> would allow you to indicate on > > >>>>>>>> a > > >>>>>>>>>>>>>>>>>> message-by-message basis > > >>>>>>>>>>>>>>>>>>>> whether or > > >>>>>>>>>>>>>>>>>>>>> not to > > >>>>>>>>>>>>>>>>>>>>>> write the message. > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> The semantics of this would be > > >>>>>>>> as > > >>>>>>>>>>>> follows: > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> - Default (i.e. > > >>>>>>>> unspecified) is > > >>>>>>>>>> to > > >>>>>>>>>>>> pass > > >>>>>>>>>>>>>>>>>> everything > > >>>>>>>>>>>>>>>>>>> through > > >>>>>>>>>>>>>>>>>>>> (hence > > >>>>>>>>>>>>>>>>>>>>>> backwards compatible with > > >>>>>>>> the > > >>>>>>>>>>> current > > >>>>>>>>>>>>>>>> default > > >>>>>>>>>>>>>>>>>> config). > > >>>>>>>>>>>>>>>>>>>>>> - Messages which have the > > >>>>>>>>>>> associated > > >>>>>>>>>>>>>>> stellar > > >>>>>>>>>>>>>>>>>> statement > > >>>>>>>>>>>>>>>>>>>> evaluate > > >>>>>>>>>>>>>>>>>>>>> to true > > >>>>>>>>>>>>>>>>>>>>>> for the writer type will be > > >>>>>>>>>>> written, > > >>>>>>>>>>>>>>>> otherwise > > >>>>>>>>>>>>>>>>>> not. > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Sample indexing config which > > >>>>>>>> would > > >>>>>>>>>>> write > > >>>>>>>>>>>>> out > > >>>>>>>>>>>>>>> no > > >>>>>>>>>>>>>>>>>> messages > > >>>>>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>>>> HDFS and > > >>>>>>>>>>>>>>>>>>>>> write > > >>>>>>>>>>>>>>>>>>>>>> out only messages containing a > > >>>>>>>>> field > > >>>>>>>>>>>>> called > > >>>>>>>>>>>>>>>>>> "field1": > > >>>>>>>>>>>>>>>>>>>>>> { > > >>>>>>>>>>>>>>>>>>>>>> "index" : "squid" > > >>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100 > > >>>>>>>>>>>>>>>>>>>>>> ,"filters" : { > > >>>>>>>>>>>>>>>>>>>>>> "HDFS" : "false" > > >>>>>>>>>>>>>>>>>>>>>> ,"ES" : "exists(field1)" > > >>>>>>>>>>>>>>>>>>>>>> } > > >>>>>>>>>>>>>>>>>>>>>> } > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> *Index On/Off Switch* > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> A simpler solution would be to > > >>>>>>>>> just > > >>>>>>>>>>>>> provide a > > >>>>>>>>>>>>>>>>> list > > >>>>>>>>>>>>>>>>>> of > > >>>>>>>>>>>>>>>>>>> writers > > >>>>>>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>>>>> write > > >>>>>>>>>>>>>>>>>>>>>> messages. The semantics would > > >>>>>>>> be > > >>>>>>>>> as > > >>>>>>>>>>>>> follows: > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> - If the list is > > >>>>>>>> unspecified, > > >>>>>>>>>> then > > >>>>>>>>>>>> the > > >>>>>>>>>>>>>>>> default > > >>>>>>>>>>>>>>>>>> is to > > >>>>>>>>>>>>>>>>>>> write > > >>>>>>>>>>>>>>>>>>>> all > > >>>>>>>>>>>>>>>>>>>>> messages > > >>>>>>>>>>>>>>>>>>>>>> for every writer in the > > >>>>>>>>> indexing > > >>>>>>>>>>>>> topology > > >>>>>>>>>>>>>>>>>>>>>> - If the list is specified, > > >>>>>>>>> then > > >>>>>>>>>> a > > >>>>>>>>>>>>> writer > > >>>>>>>>>>>>>>>> will > > >>>>>>>>>>>>>>>>>> write > > >>>>>>>>>>>>>>>>>>> all > > >>>>>>>>>>>>>>>>>>>> messages > > >>>>>>>>>>>>>>>>>>>>> if and > > >>>>>>>>>>>>>>>>>>>>>> only if it is named in the > > >>>>>>>>> list. > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Sample indexing config which > > >>>>>>>> turns > > >>>>>>>>>> off > > >>>>>>>>>>>>> HDFS > > >>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>> keeps on > > >>>>>>>>>>>>>>>>>>>>> Elasticsearch: > > >>>>>>>>>>>>>>>>>>>>>> { > > >>>>>>>>>>>>>>>>>>>>>> "index" : "squid" > > >>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100 > > >>>>>>>>>>>>>>>>>>>>>> ,"writers" : [ "ES" ] > > >>> > > >>> -- > > >> > > >> Jon > > >> > > >> Sent from my mobile device > > >> > > > > > > > -- > Nick Allen <n...@nickallen.org> >