https://issues.apache.org/jira/browse/METRON-2112 has been committed to
master.

On Tue, May 14, 2019 at 2:38 PM Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> Parser chaining uses the original_string populated by the origin routing
> parser unless you explicitly change it.
>
> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/ParserChaining.md#example
>
> For example, the logs here -
> http://www.monitorware.com/en/logsamples/cisco-pix-61(2).php
> Would result in a sample enveloped message with:
> {
> "original_string" : "Mar 29 2004 09:54:18: %PIX-6-302005: Built UDP
> connection for faddr 198.207.223.240/53337 gaddr 10.0.0.187/53 laddr
> 192.168.0.2/53",
> "payload" : "Built UDP connection for faddr 198.207.223.240/53337 gaddr
> 10.0.0.187/53 laddr 192.168.0.2/53",
> etc.
> }
>
>
> On Fri, May 10, 2019 at 6:11 PM Otto Fowler <ottobackwa...@gmail.com>
> wrote:
>
>> The original string would be the string specified as the message body,
>> thus
>> each message in the chain produced would just be the bytes passed in, from
>> a specific field in the incoming message.
>>
>>
>>
>> On May 10, 2019 at 19:55:28, Simon Elliston Ball (
>> si...@simonellistonball.com) wrote:
>>
>> My understanding is that chaining preserves (correctly to my mind) the
>> original original string.
>>
>> In other words: unless the message strategy is raw message, the original
>> string is just passed through. Original string therefore comes from
>> outside
>> Metron, and is preserved throughout Metron processes, allowing for
>> recreation of original form for forensics and evidentiary purposes.
>>
>> Simon
>>
>> > On 11 May 2019, at 00:10, Otto Fowler <ottobackwa...@gmail.com> wrote:
>> >
>> > What about parser chaining? Should the original string be from kafka, or
>> > the last parsed?
>> >
>> >
>> > On May 10, 2019 at 19:03:39, Simon Elliston Ball (
>> > si...@simonellistonball.com) wrote:
>> >
>> > The only scenario I can think of where a parser might treat original
>> string
>> > differently, or even need to know about it would be different encoding
>> > locales. For example, if the string were to be encoded in a locale
>> specific
>> > to the device and choose the encoding based on metadata or parsed
>> content,
>> > then that could merit pushing it down. The other edge might be when you
>> > have binary data that does not go down to an original string well (eg a
>> > netflow parser).
>> >
>> > That said, that’s a highly unlikely edge case that could be handled by
>> > workarounds.
>> >
>> > I’m a definitely +1 on Nick’s idea of pulling original string up to the
>> > runner. Right now we’re pretty inconsistent in how it’s done, so that
>> would
>> > help.
>> >
>> > Simon
>> >
>> > Sent from my iPhone
>> >
>> > On 10 May 2019, at 23:10, Nick Allen <n...@nickallen.org> wrote:
>> >
>> >>> I suppose we could always allow this to be overridden, also.
>> >>
>> >> I like an on/off switch for the "original string" functionality. If on,
>> >> you get the original string in pristine condition. If off, no original
>> >> string is appended for those who care more about storage space.
>> >>
>> >> I can't think of a reason where one kind of parser would have a
>> different
>> >> original string mechanism than the others. If something like that does
>> >> come up, the parser can create its own original string by just naming
>> it
>> >> something different and then turning "off" the switch that you
>> described.
>> >>
>> >>
>> >>
>> >> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
>> >> michael.miklav...@gmail.com> wrote:
>> >>
>> >>> I think that's an excellent idea. Can anyone think of a situation
>> where
>> > we
>> >>> wouldn't want to add this the same way for all parsers? I suppose we
>> > could
>> >>> always allow this to be overridden, also.
>> >>>
>> >>>> On Fri, May 10, 2019 at 3:43 PM Nick Allen <n...@nickallen.org>
>> wrote:
>> >>>>
>> >>>> I think maintaining the integrity of the original data makes a lot of
>> >>> sense
>> >>>> for any parser. And ideally the original string should be what came
>> out
>> >>> of
>> >>>> Kafka with only the minimally necessary processing.
>> >>>>
>> >>>> With that in mind, we could solve this one level up. Instead of
>> relying
>> >>> on
>> >>>> each parser to do this right, we could have the ParserRunner and
>> >>>> specifically the ParserRunnerImpl [1] handle this round-abouts here
>> >>>> <
>> >>>>
>> >>>
>> >
>>
>> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>> >>>>>
>> >>>> [1].
>> >>>> It has the raw message data and can append the original string to
>> each
>> >>>> message it gets back from the parsers.
>> >>>>
>> >>>> Just another approach to consider.
>> >>>>
>> >>>> --
>> >>>> [1]
>> >>>>
>> >>>>
>> >>>
>> >
>>
>> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>> >>>>
>> >>>> On Fri, May 10, 2019 at 4:11 PM Otto Fowler <ottobackwa...@gmail.com
>> >
>> >>>> wrote:
>> >>>>
>> >>>>> +1
>> >>>>>
>> >>>>>
>> >>>>> On May 10, 2019 at 13:57:55, Michael Miklavcic (
>> >>>>> michael.miklav...@gmail.com)
>> >>>>> wrote:
>> >>>>>
>> >>>>> When adding the capability for parsing messages in the JsonMapParser
>> >>>> using
>> >>>>> JSON Path expressions the original behavior for managing original
>> >>> strings
>> >>>>> was changed.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >
>>
>> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
>> >>>>>
>> >>>>> A couple issues have been reported recently regarding this change:
>> >>>>>
>> >>>>> 1. We're losing the actual original string, which is a legal issue
>> for
>> >>>>> data lineage for some customers
>> >>>>> 2. Even for the degenerate case with no sub-messages created, the
>> >>>>> original sub-message string is modified because of the
>> >>>>> serialization/deserialization process with Jackson/JsonSimple. The
>> >>> fields
>> >>>>> are reordered bc the content is normalized.
>> >>>>>
>> >>>>> I looked at options for preserving formatting, but am unable to find
>> a
>> >>>>> method that allows you to both parse, then query the original
>> message
>> >>> and
>> >>>>> then also obtain the raw string matches without the normalizing from
>> >>>>> ser/deserialization.
>> >>>>>
>> >>>>> I'd like to propose that we add a configuration option for this
>> parser
>> >>>> that
>> >>>>> allows the user to toggle which approach they'd like to use. My
>> >>> personal
>> >>>>> preference based on feedback I've gotten from multiple customers is
>> >>> that
>> >>>>> the default should be the older approach which takes the raw
>> original
>> >>>>> string. It's arguable that this change in contract is a regression,
>> so
>> >>>> the
>> >>>>> default should be the earlier behavior. Any sub-messages would then
>> >>> have
>> >>>> a
>> >>>>> copy of that raw original string, not just the sub-message original
>> >>>> string.
>> >>>>> Enabling the flag would enable the current sub-message original
>> string
>> >>>>> functionality.
>> >>>>>
>> >>>>> Mike
>> >>>>>
>> >>>>
>> >>>
>>
>

Reply via email to