It seems like parser chaining is becomes a hot topic on the repo too with https://github.com/apache/metron/pull/969#partial-pull-merging <https://github.com/apache/metron/pull/969#partial-pull-merging>
I would like to discuss the option, and how we might architect, of configuring parsers to operate on the output of parsers. This may also give us the opportunity to be more efficient in scenarios where people have large numbers of sources, and so use up a lot of slots for lower volume parsers for example. I have a bunch of ideas around this, but am more keen to hear what everyone else thinks at this stage. How should we go about fixing parser config so that it’s clearer (removing the need for people to reinvent the parser wheel as we’ve seen in a few places) and also more concise and powerful (consolidating the parsing of transports such as syslog and content such as application logs, or types of device logs). If this can lead to a more efficient way of handling both the syslog problem, and the kind of problem that leads to switching between grok statements in something like our ASA parser then all the better. I suspect that there might also be a case for multi-level chaining here too, since some things are embedded in multiple transports, or might have complex fields that want ‘sub-parsing’. Of course one of the key values of Metron is its speed, so maybe formalising some of the microbenchmarking approaches a few of us have been working on might help here too. I’ve got a few bits of micro-benching infrastructure around CEF and ASA, and I believe there’s also been some work to load and perf test things like enrichment that might be leveraged. Thoughts on a dev board? Simon > On 20 Mar 2018, at 21:47, Otto Fowler <ottobackwa...@gmail.com> wrote: > > I entered METRON–1453 <https://issues.apache.org/jira/browse/METRON-1453> a > little while ago while working on the PR#579 > <https://github.com/apache/metron/pull/579>. > > "We have several parsers now, with many imaginable that are based on > syslog, where the format is SYSLOG HEADER MESSAGE. > > With message being in a different format. It would be great is we had a way > to generically handle syslog headers, such that ANY parser data could come > over syslog. > > Either you could have a custom parser, or configure CSV or JSON such that > they could be the payload, such that you can handle JSON over syslog by > configuration only." > > The idea would be that the parser bolt would use the configuration to > trigger parsing the incoming message as syslog formatted, and pass the > message part to the parser, and put the syslog parts in the message(s) > after parsing. > > As part of this I did some work on parsing syslog, using both grok and a > DSL that I did from the spec : https://github.com/ottobackwards/grok-v-antlr > > The DSL is slower, but grok cannot handle multiple structured data entries, > and the DSL can. I’m not good enough at grok to fix it so that it is > functionally equivalent. Another option would be to write a third parser… > It is also possible that the DSL could be improved for speed of course. > > Thoughts?