I'll second my preference for the first option. I think the ability to use Stellar filters to customize indexing would be a big win.
I'm glad Matt brought up the point about data lake and CEP. I think this is a really important use case that we need to consider. Take a simple example... If I have data coming in from 3 different firewall vendors and 2 different web proxy/url filtering vendors and I want to be able to analyze that data set, I need the data to be indexed all together (likely in HDFS) and to have a normalized schema such that IP address, URL, and user name (to take a few) can be easily queried and aggregated. I can also envision scenarios where I would want to index data based on attributes other than sensor, business unit or subsidiary for example. I've been wanted to propose extending our 7 standard fields to include things like URL and user. Is there community interest/support for moving in that direction? If so, I'll start a new thread. Thanks! -Kyle On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <ma...@apache.org> wrote: > Ah, I see. If overriding the default index name allows using the same > name for multiple sensors, then the goal can be achieved. > Thanks, > --Matt > > > On 1/12/17, 3:30 PM, "Casey Stella" <ceste...@gmail.com> wrote: > > Oh, you could! Let's say you have a syslog parser with data from > sources 1 > 2 and 3. You'd end up with one kafka queue with 3 parsers attached to > that > queue, each picking part the messages from source 1, 2 and 3. They'd > go > through separate enrichment and into the indexing topology. In the > indexing topology, you could specify the same index name "syslog" and > all > of the messages go into the same index for CEP querying if so desired. > > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <ma...@apache.org> wrote: > > > Syslog is hell on parsers – I know, I worked at LogLogic in a > previous > > life. It makes perfect sense to route different lines from syslog > through > > different appropriate parsers. But a lot of what the parsers do is > > identify consistent subsets of metadata and annotate it – eg, > src_ip_addr, > > event timestamps, etc. Once those metadata are annotated and > available > > with common field names, why doesn’t it make sense to index the > messages > > together, for CEP querying? I think Splunk has illustrated this > model. > > > > On 1/12/17, 3:00 PM, "Casey Stella" <ceste...@gmail.com> wrote: > > > > yeah, I mean, honestly, I think the approach that we've taken for > > sources > > which aggregate different types of data is to provide filters at > the > > parser > > level and have multiple parser topologies (with different, > possibly > > mutually exclusive filters) running. This would be a completely > > separate > > sensor. Imagine a syslog data source that aggregates and you > want to > > pick > > apart certain pieces of messages. This is why the initial > thought and > > architecture was one index per sensor. > > > > On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley <ma...@apache.org> > wrote: > > > > > I’m thinking that CEP (Complex Event Processing) is contrary > to the > > idea > > > of silo-ing data per sensor. > > > Now it’s true that some of those sensors are already > aggregating > > data from > > > multiple sources, so maybe I’m wrong here. > > > But it just seems to me that the “data lake” insights come from > > being able > > > to make decisions over the whole mass of data rather than just > > vertical > > > slices of it. > > > > > > On 1/12/17, 2:15 PM, "Casey Stella" <ceste...@gmail.com> > wrote: > > > > > > Hey Matt, > > > > > > Thanks for the comment! > > > 1. At the moment, we only have one index name, the default > of > > which is > > > the > > > sensor name but that's entirely up to the user. This is > sensor > > > specific, > > > so it'd be a separate config for each sensor. If we want > to > > build > > > multiple > > > indices per sensor, we'd have to think carefully about how > to do > > that > > > and > > > would be a bigger undertaking. I guess I can see the use, > though > > > (redirect > > > messages to one index vs another based on a predicate for > a given > > > sensor). > > > Anyway, not where I was originally thinking that this > discussion > > would > > > go, > > > but it's an interesting point. > > > > > > 2. I hadn't thought through the implementation quite yet, > but we > > don't > > > actually have a splitter bolt in that topology, just a > spout > > that goes > > > to > > > the elasticsearch writer and also to the hdfs writer. > > > > > > On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley < > ma...@apache.org> > > wrote: > > > > > > > Casey, good to have controls like this. Couple > questions: > > > > > > > > 1. Regarding the “index” : “squid” name/value pair, is > the > > index name > > > > expected to always be a sensor name? Or is the given > json > > structure > > > > subordinate to a sensor name in zookeeper? Or can we > build > > arbitrary > > > > indexes with this new specification, independent of > sensor? > > Should > > > there > > > > actually be a list of “indexes”, ie > > > > { “indexes” : [ > > > > {“index” : “name1”, > > > > … > > > > }, > > > > {“index” : “name2”, > > > > … > > > > } ] > > > > } > > > > > > > > 2. Would the filtering / writer selection logic take > place in > > the > > > indexing > > > > topology splitter bolt? Seems like that would have the > > smallest > > > impact on > > > > current implementation, no? > > > > > > > > Sorry if these are already answered in PR-415, I haven’t > had > > time to > > > > review that one yet. > > > > Thanks, > > > > --Matt > > > > > > > > > > > > On 1/12/17, 12:55 PM, "Michael Miklavcic" < > > > michael.miklav...@gmail.com> > > > > wrote: > > > > > > > > I like the flexibility and expressibility of the > first > > option > > > with > > > > Stellar > > > > filters. > > > > > > > > M > > > > > > > > On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella < > > > ceste...@gmail.com> > > > > wrote: > > > > > > > > > As of METRON-652 <https://github.com/apache/ > > > > incubator-metron/pull/415>, we > > > > > will have decoupled the indexing configuration > from the > > > enrichment > > > > > configuration. As an immediate follow-up to that, > I'd > > like to > > > > provide the > > > > > ability to turn off and on writers via the > configs. I'd > > like > > > to get > > > > some > > > > > community feedback on how the functionality should > work, > > if > > > y'all are > > > > > amenable. :) > > > > > > > > > > > > > > > As of now, we have 3 possible writers which can be > used > > in the > > > > indexing > > > > > topology: > > > > > > > > > > - Solr > > > > > - Elasticsearch > > > > > - HDFS > > > > > > > > > > HDFS is always used, elasticsearch or solr is used > > depending > > > on how > > > > you > > > > > start the indexing topology. > > > > > > > > > > A couple of proposals come to mind immediately: > > > > > > > > > > *Index Filtering* > > > > > > > > > > You would be able to specify a filter as defined > by a > > stellar > > > > statement > > > > > (likely a reuse of the StellarFilter that exists > in the > > > Parsers) > > > > which > > > > > would allow you to indicate on a > message-by-message basis > > > whether or > > > > not to > > > > > write the message. > > > > > > > > > > The semantics of this would be as follows: > > > > > > > > > > - Default (i.e. unspecified) is to pass > everything > > through > > > (hence > > > > > backwards compatible with the current default > config). > > > > > - Messages which have the associated stellar > statement > > > evaluate > > > > to true > > > > > for the writer type will be written, otherwise > not. > > > > > > > > > > > > > > > Sample indexing config which would write out no > messages > > to > > > HDFS and > > > > write > > > > > out only messages containing a field called > "field1": > > > > > { > > > > > "index" : "squid" > > > > > ,"batchSize" : 100 > > > > > ,"filters" : { > > > > > "HDFS" : "false" > > > > > ,"ES" : "exists(field1)" > > > > > } > > > > > } > > > > > > > > > > *Index On/Off Switch* > > > > > > > > > > A simpler solution would be to just provide a list > of > > writers > > > to > > > > write > > > > > messages. The semantics would be as follows: > > > > > > > > > > - If the list is unspecified, then the default > is to > > write > > > all > > > > messages > > > > > for every writer in the indexing topology > > > > > - If the list is specified, then a writer will > write > > all > > > messages > > > > if and > > > > > only if it is named in the list. > > > > > > > > > > Sample indexing config which turns off HDFS and > keeps on > > > > Elasticsearch: > > > > > { > > > > > "index" : "squid" > > > > > ,"batchSize" : 100 > > > > > ,"writers" : [ "ES" ] > > > > > } > > > > > > > > > > Thanks in advance for the feedback! Also, if you > have > > any > > > other, > > > > better > > > > > ideas than the ones presented here, let me know > too. > > > > > > > > > > Best, > > > > > > > > > > Casey > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >