https://issues.apache.org/jira/browse/METRON-2112 has been committed to master.
On Tue, May 14, 2019 at 2:38 PM Michael Miklavcic < michael.miklav...@gmail.com> wrote: > Parser chaining uses the original_string populated by the origin routing > parser unless you explicitly change it. > > https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/ParserChaining.md#example > > For example, the logs here - > http://www.monitorware.com/en/logsamples/cisco-pix-61(2).php > Would result in a sample enveloped message with: > { > "original_string" : "Mar 29 2004 09:54:18: %PIX-6-302005: Built UDP > connection for faddr 198.207.223.240/53337 gaddr 10.0.0.187/53 laddr > 192.168.0.2/53", > "payload" : "Built UDP connection for faddr 198.207.223.240/53337 gaddr > 10.0.0.187/53 laddr 192.168.0.2/53", > etc. > } > > > On Fri, May 10, 2019 at 6:11 PM Otto Fowler <ottobackwa...@gmail.com> > wrote: > >> The original string would be the string specified as the message body, >> thus >> each message in the chain produced would just be the bytes passed in, from >> a specific field in the incoming message. >> >> >> >> On May 10, 2019 at 19:55:28, Simon Elliston Ball ( >> si...@simonellistonball.com) wrote: >> >> My understanding is that chaining preserves (correctly to my mind) the >> original original string. >> >> In other words: unless the message strategy is raw message, the original >> string is just passed through. Original string therefore comes from >> outside >> Metron, and is preserved throughout Metron processes, allowing for >> recreation of original form for forensics and evidentiary purposes. >> >> Simon >> >> > On 11 May 2019, at 00:10, Otto Fowler <ottobackwa...@gmail.com> wrote: >> > >> > What about parser chaining? Should the original string be from kafka, or >> > the last parsed? >> > >> > >> > On May 10, 2019 at 19:03:39, Simon Elliston Ball ( >> > si...@simonellistonball.com) wrote: >> > >> > The only scenario I can think of where a parser might treat original >> string >> > differently, or even need to know about it would be different encoding >> > locales. For example, if the string were to be encoded in a locale >> specific >> > to the device and choose the encoding based on metadata or parsed >> content, >> > then that could merit pushing it down. The other edge might be when you >> > have binary data that does not go down to an original string well (eg a >> > netflow parser). >> > >> > That said, that’s a highly unlikely edge case that could be handled by >> > workarounds. >> > >> > I’m a definitely +1 on Nick’s idea of pulling original string up to the >> > runner. Right now we’re pretty inconsistent in how it’s done, so that >> would >> > help. >> > >> > Simon >> > >> > Sent from my iPhone >> > >> > On 10 May 2019, at 23:10, Nick Allen <n...@nickallen.org> wrote: >> > >> >>> I suppose we could always allow this to be overridden, also. >> >> >> >> I like an on/off switch for the "original string" functionality. If on, >> >> you get the original string in pristine condition. If off, no original >> >> string is appended for those who care more about storage space. >> >> >> >> I can't think of a reason where one kind of parser would have a >> different >> >> original string mechanism than the others. If something like that does >> >> come up, the parser can create its own original string by just naming >> it >> >> something different and then turning "off" the switch that you >> described. >> >> >> >> >> >> >> >> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic < >> >> michael.miklav...@gmail.com> wrote: >> >> >> >>> I think that's an excellent idea. Can anyone think of a situation >> where >> > we >> >>> wouldn't want to add this the same way for all parsers? I suppose we >> > could >> >>> always allow this to be overridden, also. >> >>> >> >>>> On Fri, May 10, 2019 at 3:43 PM Nick Allen <n...@nickallen.org> >> wrote: >> >>>> >> >>>> I think maintaining the integrity of the original data makes a lot of >> >>> sense >> >>>> for any parser. And ideally the original string should be what came >> out >> >>> of >> >>>> Kafka with only the minimally necessary processing. >> >>>> >> >>>> With that in mind, we could solve this one level up. Instead of >> relying >> >>> on >> >>>> each parser to do this right, we could have the ParserRunner and >> >>>> specifically the ParserRunnerImpl [1] handle this round-abouts here >> >>>> < >> >>>> >> >>> >> > >> >> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158 >> >>>>> >> >>>> [1]. >> >>>> It has the raw message data and can append the original string to >> each >> >>>> message it gets back from the parsers. >> >>>> >> >>>> Just another approach to consider. >> >>>> >> >>>> -- >> >>>> [1] >> >>>> >> >>>> >> >>> >> > >> >> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158 >> >>>> >> >>>> On Fri, May 10, 2019 at 4:11 PM Otto Fowler <ottobackwa...@gmail.com >> > >> >>>> wrote: >> >>>> >> >>>>> +1 >> >>>>> >> >>>>> >> >>>>> On May 10, 2019 at 13:57:55, Michael Miklavcic ( >> >>>>> michael.miklav...@gmail.com) >> >>>>> wrote: >> >>>>> >> >>>>> When adding the capability for parsing messages in the JsonMapParser >> >>>> using >> >>>>> JSON Path expressions the original behavior for managing original >> >>> strings >> >>>>> was changed. >> >>>>> >> >>>>> >> >>>>> >> >>>> >> >>> >> > >> >> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192 >> >>>>> >> >>>>> A couple issues have been reported recently regarding this change: >> >>>>> >> >>>>> 1. We're losing the actual original string, which is a legal issue >> for >> >>>>> data lineage for some customers >> >>>>> 2. Even for the degenerate case with no sub-messages created, the >> >>>>> original sub-message string is modified because of the >> >>>>> serialization/deserialization process with Jackson/JsonSimple. The >> >>> fields >> >>>>> are reordered bc the content is normalized. >> >>>>> >> >>>>> I looked at options for preserving formatting, but am unable to find >> a >> >>>>> method that allows you to both parse, then query the original >> message >> >>> and >> >>>>> then also obtain the raw string matches without the normalizing from >> >>>>> ser/deserialization. >> >>>>> >> >>>>> I'd like to propose that we add a configuration option for this >> parser >> >>>> that >> >>>>> allows the user to toggle which approach they'd like to use. My >> >>> personal >> >>>>> preference based on feedback I've gotten from multiple customers is >> >>> that >> >>>>> the default should be the older approach which takes the raw >> original >> >>>>> string. It's arguable that this change in contract is a regression, >> so >> >>>> the >> >>>>> default should be the earlier behavior. Any sub-messages would then >> >>> have >> >>>> a >> >>>>> copy of that raw original string, not just the sub-message original >> >>>> string. >> >>>>> Enabling the flag would enable the current sub-message original >> string >> >>>>> functionality. >> >>>>> >> >>>>> Mike >> >>>>> >> >>>> >> >>> >> >