How does it look with 50 whens
On January 13, 2017 at 10:02:02, Casey Stella ([email protected]) wrote: Ok, so here's what I'm thinking based on the discussion: - Keeping the configs that we have now (batchSize and index) as defaults for the unspecified writer-specific case - Adding the config Nick suggested *Base Case*: { } - all writers write all messages - index named the same as the sensor for all writers - batchSize of 1 for all writers *Writer-non-specific case*: { "index" : "foo" ,"batchSize" : 100 } - All writers write all messages - index is named "foo", different from the sensor for all writers - batchSize is 100 for all writers *Writer-specific case without filters* { "index" : "foo" ,"batchSize" : 1 , "writerConfig" : { "elasticsearch" : { "batchSize" : 100 } } } - All writers write all messages - index is named "foo", different from the sensor for all writers - batchSize is 1 for HDFS and 100 for elasticsearch writers - NOTE: I could override the index name too *Writer-specific case with filters* { "index" : "foo" ,"batchSize" : 1 , "writerConfig" : { "elasticsearch" : { "batchSize" : 100, "when" : "exists(field1)" }, "hdfs" : { "when" : "false" } } } - ES writer writes messages which have field1, HDFS doesn't - index is named "foo", different from the sensor for all writers - 100 for elasticsearch writers Thoughts? On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <[email protected]> wrote: > For larger installations you need to control what is indexed so you don’t > end up with a nasty elastic search situation and so you can mine the data > later for reports and training ml models. > > Thanks > Carolyn > > > > > On 1/13/17, 9:40 AM, "Casey Stella" <[email protected]> wrote: > > >OH that's a good idea! > > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <[email protected]> wrote: > > > >> I like the "Index Filtering" option based on the flexibility that it > >> provides. Should each output (HDFS, ES, etc) have its own configuration > >> settings? For example, aren't things like batching handled separately > for > >> HDFS versus Elasticsearch? > >> > >> Something along the lines of... > >> > >> { > >> "hdfs" : { > >> "when": "exists(field1)", > >> "batchSize": 100 > >> }, > >> > >> "elasticsearch" : { > >> "when": "true", > >> "batchSize": 1000, > >> "index": "squid" > >> } > >> } > >> > >> > >> > >> > >> > >> > >> > >> > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <[email protected]> > wrote: > >> > >> > Yeah, I tend to like the first option too. Any opposition to that > from > >> > anyone? > >> > > >> > The points brought up are good ones and I think that it may be worth a > >> > broader discussion of the requirements of indexing in a separate dev > list > >> > thread. Maybe a list of desires with coherent use-cases justifying > them > >> so > >> > we can think about how this stuff should work and where the natural > >> > extension points should be. Afterall, we need to toe the line between > >> > engineering and overengineering for features nobody will want. > >> > > >> > I'm not sure about the extensions to the standard fields. I'm torn > >> between > >> > the notions that we should have no standard fields vs we should have a > >> > boatload of standard fields (with most of them empty). I exchange > >> > positions fairly regularly on that question. ;) It may be worth a dev > >> list > >> > discussion to lay out how you imagine an extension of standard fields > and > >> > how it might look as implemented in Metron. > >> > > >> > Casey > >> > > >> > Casey > >> > > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson < > >> > [email protected]> > >> > wrote: > >> > > >> > > I'll second my preference for the first option. I think the ability > to > >> > use > >> > > Stellar filters to customize indexing would be a big win. > >> > > > >> > > I'm glad Matt brought up the point about data lake and CEP. I think > >> this > >> > is > >> > > a really important use case that we need to consider. Take a simple > >> > > example... If I have data coming in from 3 different firewall > vendors > >> > and 2 > >> > > different web proxy/url filtering vendors and I want to be able to > >> > analyze > >> > > that data set, I need the data to be indexed all together (likely in > >> > HDFS) > >> > > and to have a normalized schema such that IP address, URL, and user > >> name > >> > > (to take a few) can be easily queried and aggregated. I can also > >> envision > >> > > scenarios where I would want to index data based on attributes other > >> than > >> > > sensor, business unit or subsidiary for example. > >> > > > >> > > I've been wanted to propose extending our 7 standard fields to > include > >> > > things like URL and user. Is there community interest/support for > >> moving > >> > in > >> > > that direction? If so, I'll start a new thread. > >> > > > >> > > Thanks! > >> > > > >> > > -Kyle > >> > > > >> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <[email protected]> > wrote: > >> > > > >> > > > Ah, I see. If overriding the default index name allows using the > >> same > >> > > > name for multiple sensors, then the goal can be achieved. > >> > > > Thanks, > >> > > > --Matt > >> > > > > >> > > > > >> > > > On 1/12/17, 3:30 PM, "Casey Stella" <[email protected]> wrote: > >> > > > > >> > > > Oh, you could! Let's say you have a syslog parser with data > from > >> > > > sources 1 > >> > > > 2 and 3. You'd end up with one kafka queue with 3 parsers > >> attached > >> > > to > >> > > > that > >> > > > queue, each picking part the messages from source 1, 2 and 3. > >> > They'd > >> > > > go > >> > > > through separate enrichment and into the indexing topology. > In > >> the > >> > > > indexing topology, you could specify the same index name > "syslog" > >> > and > >> > > > all > >> > > > of the messages go into the same index for CEP querying if so > >> > > desired. > >> > > > > >> > > > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <[email protected] > > > >> > > wrote: > >> > > > > >> > > > > Syslog is hell on parsers – I know, I worked at LogLogic in > a > >> > > > previous > >> > > > > life. It makes perfect sense to route different lines from > >> > syslog > >> > > > through > >> > > > > different appropriate parsers. But a lot of what the > parsers > >> do > >> > is > >> > > > > identify consistent subsets of metadata and annotate it – > eg, > >> > > > src_ip_addr, > >> > > > > event timestamps, etc. Once those metadata are annotated > and > >> > > > available > >> > > > > with common field names, why doesn’t it make sense to index > the > >> > > > messages > >> > > > > together, for CEP querying? I think Splunk has illustrated > >> this > >> > > > model. > >> > > > > > >> > > > > On 1/12/17, 3:00 PM, "Casey Stella" <[email protected]> > >> wrote: > >> > > > > > >> > > > > yeah, I mean, honestly, I think the approach that we've > >> taken > >> > > for > >> > > > > sources > >> > > > > which aggregate different types of data is to provide > >> filters > >> > > at > >> > > > the > >> > > > > parser > >> > > > > level and have multiple parser topologies (with > different, > >> > > > possibly > >> > > > > mutually exclusive filters) running. This would be a > >> > > completely > >> > > > > separate > >> > > > > sensor. Imagine a syslog data source that aggregates > and > >> you > >> > > > want to > >> > > > > pick > >> > > > > apart certain pieces of messages. This is why the > initial > >> > > > thought and > >> > > > > architecture was one index per sensor. > >> > > > > > >> > > > > On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley < > >> > [email protected]> > >> > > > wrote: > >> > > > > > >> > > > > > I’m thinking that CEP (Complex Event Processing) is > >> > contrary > >> > > > to the > >> > > > > idea > >> > > > > > of silo-ing data per sensor. > >> > > > > > Now it’s true that some of those sensors are already > >> > > > aggregating > >> > > > > data from > >> > > > > > multiple sources, so maybe I’m wrong here. > >> > > > > > But it just seems to me that the “data lake” insights > >> come > >> > > from > >> > > > > being able > >> > > > > > to make decisions over the whole mass of data rather > than > >> > > just > >> > > > > vertical > >> > > > > > slices of it. > >> > > > > > > >> > > > > > On 1/12/17, 2:15 PM, "Casey Stella" < > [email protected]> > >> > > > wrote: > >> > > > > > > >> > > > > > Hey Matt, > >> > > > > > > >> > > > > > Thanks for the comment! > >> > > > > > 1. At the moment, we only have one index name, the > >> > > default > >> > > > of > >> > > > > which is > >> > > > > > the > >> > > > > > sensor name but that's entirely up to the user. > This > >> > is > >> > > > sensor > >> > > > > > specific, > >> > > > > > so it'd be a separate config for each sensor. If > we > >> > want > >> > > > to > >> > > > > build > >> > > > > > multiple > >> > > > > > indices per sensor, we'd have to think carefully > >> about > >> > > how > >> > > > to do > >> > > > > that > >> > > > > > and > >> > > > > > would be a bigger undertaking. I guess I can see > the > >> > > use, > >> > > > though > >> > > > > > (redirect > >> > > > > > messages to one index vs another based on a > predicate > >> > for > >> > > > a given > >> > > > > > sensor). > >> > > > > > Anyway, not where I was originally thinking that > this > >> > > > discussion > >> > > > > would > >> > > > > > go, > >> > > > > > but it's an interesting point. > >> > > > > > > >> > > > > > 2. I hadn't thought through the implementation > quite > >> > yet, > >> > > > but we > >> > > > > don't > >> > > > > > actually have a splitter bolt in that topology, > just > >> a > >> > > > spout > >> > > > > that goes > >> > > > > > to > >> > > > > > the elasticsearch writer and also to the hdfs > writer. > >> > > > > > > >> > > > > > On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley < > >> > > > [email protected]> > >> > > > > wrote: > >> > > > > > > >> > > > > > > Casey, good to have controls like this. Couple > >> > > > questions: > >> > > > > > > > >> > > > > > > 1. Regarding the “index” : “squid” name/value > pair, > >> > is > >> > > > the > >> > > > > index name > >> > > > > > > expected to always be a sensor name? Or is the > >> given > >> > > > json > >> > > > > structure > >> > > > > > > subordinate to a sensor name in zookeeper? Or > can > >> we > >> > > > build > >> > > > > arbitrary > >> > > > > > > indexes with this new specification, > independent of > >> > > > sensor? > >> > > > > Should > >> > > > > > there > >> > > > > > > actually be a list of “indexes”, ie > >> > > > > > > { “indexes” : [ > >> > > > > > > {“index” : “name1”, > >> > > > > > > … > >> > > > > > > }, > >> > > > > > > {“index” : “name2”, > >> > > > > > > … > >> > > > > > > } ] > >> > > > > > > } > >> > > > > > > > >> > > > > > > 2. Would the filtering / writer selection logic > >> take > >> > > > place in > >> > > > > the > >> > > > > > indexing > >> > > > > > > topology splitter bolt? Seems like that would > have > >> > the > >> > > > > smallest > >> > > > > > impact on > >> > > > > > > current implementation, no? > >> > > > > > > > >> > > > > > > Sorry if these are already answered in PR-415, I > >> > > haven’t > >> > > > had > >> > > > > time to > >> > > > > > > review that one yet. > >> > > > > > > Thanks, > >> > > > > > > --Matt > >> > > > > > > > >> > > > > > > > >> > > > > > > On 1/12/17, 12:55 PM, "Michael Miklavcic" < > >> > > > > > [email protected]> > >> > > > > > > wrote: > >> > > > > > > > >> > > > > > > I like the flexibility and expressibility of > >> the > >> > > > first > >> > > > > option > >> > > > > > with > >> > > > > > > Stellar > >> > > > > > > filters. > >> > > > > > > > >> > > > > > > M > >> > > > > > > > >> > > > > > > On Thu, Jan 12, 2017 at 1:51 PM, Casey > Stella < > >> > > > > > [email protected]> > >> > > > > > > wrote: > >> > > > > > > > >> > > > > > > > As of METRON-652 < > https://github.com/apache/ > >> > > > > > > incubator-metron/pull/415>, we > >> > > > > > > > will have decoupled the indexing > >> configuration > >> > > > from the > >> > > > > > enrichment > >> > > > > > > > configuration. As an immediate follow-up > to > >> > > that, > >> > > > I'd > >> > > > > like to > >> > > > > > > provide the > >> > > > > > > > ability to turn off and on writers via the > >> > > > configs. I'd > >> > > > > like > >> > > > > > to get > >> > > > > > > some > >> > > > > > > > community feedback on how the > functionality > >> > > should > >> > > > work, > >> > > > > if > >> > > > > > y'all are > >> > > > > > > > amenable. :) > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > As of now, we have 3 possible writers > which > >> can > >> > > be > >> > > > used > >> > > > > in the > >> > > > > > > indexing > >> > > > > > > > topology: > >> > > > > > > > > >> > > > > > > > - Solr > >> > > > > > > > - Elasticsearch > >> > > > > > > > - HDFS > >> > > > > > > > > >> > > > > > > > HDFS is always used, elasticsearch or > solr is > >> > > used > >> > > > > depending > >> > > > > > on how > >> > > > > > > you > >> > > > > > > > start the indexing topology. > >> > > > > > > > > >> > > > > > > > A couple of proposals come to mind > >> immediately: > >> > > > > > > > > >> > > > > > > > *Index Filtering* > >> > > > > > > > > >> > > > > > > > You would be able to specify a filter as > >> > defined > >> > > > by a > >> > > > > stellar > >> > > > > > > statement > >> > > > > > > > (likely a reuse of the StellarFilter that > >> > exists > >> > > > in the > >> > > > > > Parsers) > >> > > > > > > which > >> > > > > > > > would allow you to indicate on a > >> > > > message-by-message basis > >> > > > > > whether or > >> > > > > > > not to > >> > > > > > > > write the message. > >> > > > > > > > > >> > > > > > > > The semantics of this would be as follows: > >> > > > > > > > > >> > > > > > > > - Default (i.e. unspecified) is to pass > >> > > > everything > >> > > > > through > >> > > > > > (hence > >> > > > > > > > backwards compatible with the current > >> > default > >> > > > config). > >> > > > > > > > - Messages which have the associated > >> stellar > >> > > > statement > >> > > > > > evaluate > >> > > > > > > to true > >> > > > > > > > for the writer type will be written, > >> > otherwise > >> > > > not. > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > Sample indexing config which would write > out > >> no > >> > > > messages > >> > > > > to > >> > > > > > HDFS and > >> > > > > > > write > >> > > > > > > > out only messages containing a field > called > >> > > > "field1": > >> > > > > > > > { > >> > > > > > > > "index" : "squid" > >> > > > > > > > ,"batchSize" : 100 > >> > > > > > > > ,"filters" : { > >> > > > > > > > "HDFS" : "false" > >> > > > > > > > ,"ES" : "exists(field1)" > >> > > > > > > > } > >> > > > > > > > } > >> > > > > > > > > >> > > > > > > > *Index On/Off Switch* > >> > > > > > > > > >> > > > > > > > A simpler solution would be to just > provide a > >> > > list > >> > > > of > >> > > > > writers > >> > > > > > to > >> > > > > > > write > >> > > > > > > > messages. The semantics would be as > follows: > >> > > > > > > > > >> > > > > > > > - If the list is unspecified, then the > >> > default > >> > > > is to > >> > > > > write > >> > > > > > all > >> > > > > > > messages > >> > > > > > > > for every writer in the indexing > topology > >> > > > > > > > - If the list is specified, then a > writer > >> > will > >> > > > write > >> > > > > all > >> > > > > > messages > >> > > > > > > if and > >> > > > > > > > only if it is named in the list. > >> > > > > > > > > >> > > > > > > > Sample indexing config which turns off > HDFS > >> and > >> > > > keeps on > >> > > > > > > Elasticsearch: > >> > > > > > > > { > >> > > > > > > > "index" : "squid" > >> > > > > > > > ,"batchSize" : 100 > >> > > > > > > > ,"writers" : [ "ES" ] > >> > > > > > > > } > >> > > > > > > > > >> > > > > > > > Thanks in advance for the feedback! > Also, if > >> > you > >> > > > have > >> > > > > any > >> > > > > > other, > >> > > > > > > better > >> > > > > > > > ideas than the ones presented here, let me > >> know > >> > > > too. > >> > > > > > > > > >> > > > > > > > Best, > >> > > > > > > > > >> > > > > > > > Casey > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > >> > > >> > >> > >> > >> -- > >> Nick Allen <[email protected]> > >> >
