Re: [DISCUSS] JsonMapParser original string functionality

2019-05-30 Thread Michael Miklavcic
https://issues.apache.org/jira/browse/METRON-2112 has been committed to
master.

On Tue, May 14, 2019 at 2:38 PM Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> Parser chaining uses the original_string populated by the origin routing
> parser unless you explicitly change it.
>
> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/ParserChaining.md#example
>
> For example, the logs here -
> http://www.monitorware.com/en/logsamples/cisco-pix-61(2).php
> Would result in a sample enveloped message with:
> {
> "original_string" : "Mar 29 2004 09:54:18: %PIX-6-302005: Built UDP
> connection for faddr 198.207.223.240/53337 gaddr 10.0.0.187/53 laddr
> 192.168.0.2/53",
> "payload" : "Built UDP connection for faddr 198.207.223.240/53337 gaddr
> 10.0.0.187/53 laddr 192.168.0.2/53",
> etc.
> }
>
>
> On Fri, May 10, 2019 at 6:11 PM Otto Fowler 
> wrote:
>
>> The original string would be the string specified as the message body,
>> thus
>> each message in the chain produced would just be the bytes passed in, from
>> a specific field in the incoming message.
>>
>>
>>
>> On May 10, 2019 at 19:55:28, Simon Elliston Ball (
>> si...@simonellistonball.com) wrote:
>>
>> My understanding is that chaining preserves (correctly to my mind) the
>> original original string.
>>
>> In other words: unless the message strategy is raw message, the original
>> string is just passed through. Original string therefore comes from
>> outside
>> Metron, and is preserved throughout Metron processes, allowing for
>> recreation of original form for forensics and evidentiary purposes.
>>
>> Simon
>>
>> > On 11 May 2019, at 00:10, Otto Fowler  wrote:
>> >
>> > What about parser chaining? Should the original string be from kafka, or
>> > the last parsed?
>> >
>> >
>> > On May 10, 2019 at 19:03:39, Simon Elliston Ball (
>> > si...@simonellistonball.com) wrote:
>> >
>> > The only scenario I can think of where a parser might treat original
>> string
>> > differently, or even need to know about it would be different encoding
>> > locales. For example, if the string were to be encoded in a locale
>> specific
>> > to the device and choose the encoding based on metadata or parsed
>> content,
>> > then that could merit pushing it down. The other edge might be when you
>> > have binary data that does not go down to an original string well (eg a
>> > netflow parser).
>> >
>> > That said, that’s a highly unlikely edge case that could be handled by
>> > workarounds.
>> >
>> > I’m a definitely +1 on Nick’s idea of pulling original string up to the
>> > runner. Right now we’re pretty inconsistent in how it’s done, so that
>> would
>> > help.
>> >
>> > Simon
>> >
>> > Sent from my iPhone
>> >
>> > On 10 May 2019, at 23:10, Nick Allen  wrote:
>> >
>> >>> I suppose we could always allow this to be overridden, also.
>> >>
>> >> I like an on/off switch for the "original string" functionality. If on,
>> >> you get the original string in pristine condition. If off, no original
>> >> string is appended for those who care more about storage space.
>> >>
>> >> I can't think of a reason where one kind of parser would have a
>> different
>> >> original string mechanism than the others. If something like that does
>> >> come up, the parser can create its own original string by just naming
>> it
>> >> something different and then turning "off" the switch that you
>> described.
>> >>
>> >>
>> >>
>> >> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
>> >> michael.miklav...@gmail.com> wrote:
>> >>
>> >>> I think that's an excellent idea. Can anyone think of a situation
>> where
>> > we
>> >>> wouldn't want to add this the same way for all parsers? I suppose we
>> > could
>> >>> always allow this to be overridden, also.
>> >>>
>>  On Fri, May 10, 2019 at 3:43 PM Nick Allen 
>> wrote:
>> 
>>  I think maintaining the integrity of the original data makes a lot of
>> >>> sense
>>  for any parser. And ideally the original string should be what came
>> out
>> >>> of
>>  Kafka with only the minimally necessary processing.
>> 
>>  With that in mind, we could solve this one level up. Instead of
>> relying
>> >>> on
>>  each parser to do this right, we could have the ParserRunner and
>>  specifically the ParserRunnerImpl [1] handle this round-abouts here
>>  <
>> 
>> >>>
>> >
>>
>> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>> >
>>  [1].
>>  It has the raw message data and can append the original string to
>> each
>>  message it gets back from the parsers.
>> 
>>  Just another approach to consider.
>> 
>>  --
>>  [1]
>> 
>> 
>> >>>
>> >
>>
>> 

Re: [DISCUSS] JsonMapParser original string functionality

2019-05-14 Thread Michael Miklavcic
Parser chaining uses the original_string populated by the origin routing
parser unless you explicitly change it.
https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/ParserChaining.md#example

For example, the logs here -
http://www.monitorware.com/en/logsamples/cisco-pix-61(2).php
Would result in a sample enveloped message with:
{
"original_string" : "Mar 29 2004 09:54:18: %PIX-6-302005: Built UDP
connection for faddr 198.207.223.240/53337 gaddr 10.0.0.187/53 laddr
192.168.0.2/53",
"payload" : "Built UDP connection for faddr 198.207.223.240/53337 gaddr
10.0.0.187/53 laddr 192.168.0.2/53",
etc.
}


On Fri, May 10, 2019 at 6:11 PM Otto Fowler  wrote:

> The original string would be the string specified as the message body, thus
> each message in the chain produced would just be the bytes passed in, from
> a specific field in the incoming message.
>
>
>
> On May 10, 2019 at 19:55:28, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> My understanding is that chaining preserves (correctly to my mind) the
> original original string.
>
> In other words: unless the message strategy is raw message, the original
> string is just passed through. Original string therefore comes from outside
> Metron, and is preserved throughout Metron processes, allowing for
> recreation of original form for forensics and evidentiary purposes.
>
> Simon
>
> > On 11 May 2019, at 00:10, Otto Fowler  wrote:
> >
> > What about parser chaining? Should the original string be from kafka, or
> > the last parsed?
> >
> >
> > On May 10, 2019 at 19:03:39, Simon Elliston Ball (
> > si...@simonellistonball.com) wrote:
> >
> > The only scenario I can think of where a parser might treat original
> string
> > differently, or even need to know about it would be different encoding
> > locales. For example, if the string were to be encoded in a locale
> specific
> > to the device and choose the encoding based on metadata or parsed
> content,
> > then that could merit pushing it down. The other edge might be when you
> > have binary data that does not go down to an original string well (eg a
> > netflow parser).
> >
> > That said, that’s a highly unlikely edge case that could be handled by
> > workarounds.
> >
> > I’m a definitely +1 on Nick’s idea of pulling original string up to the
> > runner. Right now we’re pretty inconsistent in how it’s done, so that
> would
> > help.
> >
> > Simon
> >
> > Sent from my iPhone
> >
> > On 10 May 2019, at 23:10, Nick Allen  wrote:
> >
> >>> I suppose we could always allow this to be overridden, also.
> >>
> >> I like an on/off switch for the "original string" functionality. If on,
> >> you get the original string in pristine condition. If off, no original
> >> string is appended for those who care more about storage space.
> >>
> >> I can't think of a reason where one kind of parser would have a
> different
> >> original string mechanism than the others. If something like that does
> >> come up, the parser can create its own original string by just naming it
> >> something different and then turning "off" the switch that you
> described.
> >>
> >>
> >>
> >> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
> >> michael.miklav...@gmail.com> wrote:
> >>
> >>> I think that's an excellent idea. Can anyone think of a situation where
> > we
> >>> wouldn't want to add this the same way for all parsers? I suppose we
> > could
> >>> always allow this to be overridden, also.
> >>>
>  On Fri, May 10, 2019 at 3:43 PM Nick Allen 
> wrote:
> 
>  I think maintaining the integrity of the original data makes a lot of
> >>> sense
>  for any parser. And ideally the original string should be what came
> out
> >>> of
>  Kafka with only the minimally necessary processing.
> 
>  With that in mind, we could solve this one level up. Instead of
> relying
> >>> on
>  each parser to do this right, we could have the ParserRunner and
>  specifically the ParserRunnerImpl [1] handle this round-abouts here
>  <
> 
> >>>
> >
>
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
> >
>  [1].
>  It has the raw message data and can append the original string to each
>  message it gets back from the parsers.
> 
>  Just another approach to consider.
> 
>  --
>  [1]
> 
> 
> >>>
> >
>
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
> 
>  On Fri, May 10, 2019 at 4:11 PM Otto Fowler 
>  wrote:
> 
> > +1
> >
> >
> > On May 10, 2019 at 13:57:55, Michael Miklavcic (
> > michael.miklav...@gmail.com)
> > wrote:
> >
> > When adding the capability for parsing 

Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Otto Fowler
The original string would be the string specified as the message body, thus
each message in the chain produced would just be the bytes passed in, from
a specific field in the incoming message.



On May 10, 2019 at 19:55:28, Simon Elliston Ball (
si...@simonellistonball.com) wrote:

My understanding is that chaining preserves (correctly to my mind) the
original original string.

In other words: unless the message strategy is raw message, the original
string is just passed through. Original string therefore comes from outside
Metron, and is preserved throughout Metron processes, allowing for
recreation of original form for forensics and evidentiary purposes.

Simon

> On 11 May 2019, at 00:10, Otto Fowler  wrote:
>
> What about parser chaining? Should the original string be from kafka, or
> the last parsed?
>
>
> On May 10, 2019 at 19:03:39, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> The only scenario I can think of where a parser might treat original
string
> differently, or even need to know about it would be different encoding
> locales. For example, if the string were to be encoded in a locale
specific
> to the device and choose the encoding based on metadata or parsed
content,
> then that could merit pushing it down. The other edge might be when you
> have binary data that does not go down to an original string well (eg a
> netflow parser).
>
> That said, that’s a highly unlikely edge case that could be handled by
> workarounds.
>
> I’m a definitely +1 on Nick’s idea of pulling original string up to the
> runner. Right now we’re pretty inconsistent in how it’s done, so that
would
> help.
>
> Simon
>
> Sent from my iPhone
>
> On 10 May 2019, at 23:10, Nick Allen  wrote:
>
>>> I suppose we could always allow this to be overridden, also.
>>
>> I like an on/off switch for the "original string" functionality. If on,
>> you get the original string in pristine condition. If off, no original
>> string is appended for those who care more about storage space.
>>
>> I can't think of a reason where one kind of parser would have a
different
>> original string mechanism than the others. If something like that does
>> come up, the parser can create its own original string by just naming it
>> something different and then turning "off" the switch that you
described.
>>
>>
>>
>> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
>> michael.miklav...@gmail.com> wrote:
>>
>>> I think that's an excellent idea. Can anyone think of a situation where
> we
>>> wouldn't want to add this the same way for all parsers? I suppose we
> could
>>> always allow this to be overridden, also.
>>>
 On Fri, May 10, 2019 at 3:43 PM Nick Allen  wrote:

 I think maintaining the integrity of the original data makes a lot of
>>> sense
 for any parser. And ideally the original string should be what came
out
>>> of
 Kafka with only the minimally necessary processing.

 With that in mind, we could solve this one level up. Instead of
relying
>>> on
 each parser to do this right, we could have the ParserRunner and
 specifically the ParserRunnerImpl [1] handle this round-abouts here
 <

>>>
>
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>
 [1].
 It has the raw message data and can append the original string to each
 message it gets back from the parsers.

 Just another approach to consider.

 --
 [1]


>>>
>
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158

 On Fri, May 10, 2019 at 4:11 PM Otto Fowler 
 wrote:

> +1
>
>
> On May 10, 2019 at 13:57:55, Michael Miklavcic (
> michael.miklav...@gmail.com)
> wrote:
>
> When adding the capability for parsing messages in the JsonMapParser
 using
> JSON Path expressions the original behavior for managing original
>>> strings
> was changed.
>
>
>

>>>
>
https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
>
> A couple issues have been reported recently regarding this change:
>
> 1. We're losing the actual original string, which is a legal issue
for
> data lineage for some customers
> 2. Even for the degenerate case with no sub-messages created, the
> original sub-message string is modified because of the
> serialization/deserialization process with Jackson/JsonSimple. The
>>> fields
> are reordered bc the content is normalized.
>
> I looked at options for preserving formatting, but am unable to find
a
> method that allows you to both parse, then query the 

Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Simon Elliston Ball
My understanding is that chaining preserves (correctly to my mind) the original 
original string.

In other words: unless the message strategy is raw message, the original string 
is just passed through. Original string therefore comes from outside Metron, 
and is preserved throughout Metron processes, allowing for recreation of 
original form for forensics and evidentiary purposes.

Simon

> On 11 May 2019, at 00:10, Otto Fowler  wrote:
> 
> What about parser chaining?   Should the original string be from kafka, or
> the last parsed?
> 
> 
> On May 10, 2019 at 19:03:39, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
> 
> The only scenario I can think of where a parser might treat original string
> differently, or even need to know about it would be different encoding
> locales. For example, if the string were to be encoded in a locale specific
> to the device and choose the encoding based on metadata or parsed content,
> then that could merit pushing it down. The other edge might be when you
> have binary data that does not go down to an original string well (eg a
> netflow parser).
> 
> That said, that’s a highly unlikely edge case that could be handled by
> workarounds.
> 
> I’m a definitely +1 on Nick’s idea of pulling original string up to the
> runner. Right now we’re pretty inconsistent in how it’s done, so that would
> help.
> 
> Simon
> 
> Sent from my iPhone
> 
> On 10 May 2019, at 23:10, Nick Allen  wrote:
> 
>>> I suppose we could always allow this to be overridden, also.
>> 
>> I like an on/off switch for the "original string" functionality. If on,
>> you get the original string in pristine condition. If off, no original
>> string is appended for those who care more about storage space.
>> 
>> I can't think of a reason where one kind of parser would have a different
>> original string mechanism than the others. If something like that does
>> come up, the parser can create its own original string by just naming it
>> something different and then turning "off" the switch that you described.
>> 
>> 
>> 
>> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
>> michael.miklav...@gmail.com> wrote:
>> 
>>> I think that's an excellent idea. Can anyone think of a situation where
> we
>>> wouldn't want to add this the same way for all parsers? I suppose we
> could
>>> always allow this to be overridden, also.
>>> 
 On Fri, May 10, 2019 at 3:43 PM Nick Allen  wrote:
 
 I think maintaining the integrity of the original data makes a lot of
>>> sense
 for any parser. And ideally the original string should be what came out
>>> of
 Kafka with only the minimally necessary processing.
 
 With that in mind, we could solve this one level up. Instead of relying
>>> on
 each parser to do this right, we could have the ParserRunner and
 specifically the ParserRunnerImpl [1] handle this round-abouts here
 <
 
>>> 
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
> 
 [1].
 It has the raw message data and can append the original string to each
 message it gets back from the parsers.
 
 Just another approach to consider.
 
 --
 [1]
 
 
>>> 
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
 
 On Fri, May 10, 2019 at 4:11 PM Otto Fowler 
 wrote:
 
> +1
> 
> 
> On May 10, 2019 at 13:57:55, Michael Miklavcic (
> michael.miklav...@gmail.com)
> wrote:
> 
> When adding the capability for parsing messages in the JsonMapParser
 using
> JSON Path expressions the original behavior for managing original
>>> strings
> was changed.
> 
> 
> 
 
>>> 
> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
> 
> A couple issues have been reported recently regarding this change:
> 
> 1. We're losing the actual original string, which is a legal issue for
> data lineage for some customers
> 2. Even for the degenerate case with no sub-messages created, the
> original sub-message string is modified because of the
> serialization/deserialization process with Jackson/JsonSimple. The
>>> fields
> are reordered bc the content is normalized.
> 
> I looked at options for preserving formatting, but am unable to find a
> method that allows you to both parse, then query the original message
>>> and
> then also obtain the raw string matches without the normalizing from
> ser/deserialization.
> 
> I'd like to propose that we add a configuration option for this parser
 that
> allows the user 

Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Otto Fowler
What about parser chaining?   Should the original string be from kafka, or
the last parsed?


On May 10, 2019 at 19:03:39, Simon Elliston Ball (
si...@simonellistonball.com) wrote:

The only scenario I can think of where a parser might treat original string
differently, or even need to know about it would be different encoding
locales. For example, if the string were to be encoded in a locale specific
to the device and choose the encoding based on metadata or parsed content,
then that could merit pushing it down. The other edge might be when you
have binary data that does not go down to an original string well (eg a
netflow parser).

That said, that’s a highly unlikely edge case that could be handled by
workarounds.

I’m a definitely +1 on Nick’s idea of pulling original string up to the
runner. Right now we’re pretty inconsistent in how it’s done, so that would
help.

Simon

Sent from my iPhone

On 10 May 2019, at 23:10, Nick Allen  wrote:

>> I suppose we could always allow this to be overridden, also.
>
> I like an on/off switch for the "original string" functionality. If on,
> you get the original string in pristine condition. If off, no original
> string is appended for those who care more about storage space.
>
> I can't think of a reason where one kind of parser would have a different
> original string mechanism than the others. If something like that does
> come up, the parser can create its own original string by just naming it
> something different and then turning "off" the switch that you described.
>
>
>
> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
>> I think that's an excellent idea. Can anyone think of a situation where
we
>> wouldn't want to add this the same way for all parsers? I suppose we
could
>> always allow this to be overridden, also.
>>
>>> On Fri, May 10, 2019 at 3:43 PM Nick Allen  wrote:
>>>
>>> I think maintaining the integrity of the original data makes a lot of
>> sense
>>> for any parser. And ideally the original string should be what came out
>> of
>>> Kafka with only the minimally necessary processing.
>>>
>>> With that in mind, we could solve this one level up. Instead of relying
>> on
>>> each parser to do this right, we could have the ParserRunner and
>>> specifically the ParserRunnerImpl [1] handle this round-abouts here
>>> <
>>>
>>
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158

>>> [1].
>>> It has the raw message data and can append the original string to each
>>> message it gets back from the parsers.
>>>
>>> Just another approach to consider.
>>>
>>> --
>>> [1]
>>>
>>>
>>
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>>>
>>> On Fri, May 10, 2019 at 4:11 PM Otto Fowler 
>>> wrote:
>>>
 +1


 On May 10, 2019 at 13:57:55, Michael Miklavcic (
 michael.miklav...@gmail.com)
 wrote:

 When adding the capability for parsing messages in the JsonMapParser
>>> using
 JSON Path expressions the original behavior for managing original
>> strings
 was changed.



>>>
>>
https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192

 A couple issues have been reported recently regarding this change:

 1. We're losing the actual original string, which is a legal issue for
 data lineage for some customers
 2. Even for the degenerate case with no sub-messages created, the
 original sub-message string is modified because of the
 serialization/deserialization process with Jackson/JsonSimple. The
>> fields
 are reordered bc the content is normalized.

 I looked at options for preserving formatting, but am unable to find a
 method that allows you to both parse, then query the original message
>> and
 then also obtain the raw string matches without the normalizing from
 ser/deserialization.

 I'd like to propose that we add a configuration option for this parser
>>> that
 allows the user to toggle which approach they'd like to use. My
>> personal
 preference based on feedback I've gotten from multiple customers is
>> that
 the default should be the older approach which takes the raw original
 string. It's arguable that this change in contract is a regression, so
>>> the
 default should be the earlier behavior. Any sub-messages would then
>> have
>>> a
 copy of that raw original string, not just the sub-message original
>>> string.
 Enabling the flag would enable the current sub-message original string
 functionality.

 Mike

>>>
>>


Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Nick Allen
>  I suppose we could always allow this to be overridden, also.

I like an on/off switch for the "original string" functionality.  If on,
you get the original string in pristine condition.  If off, no original
string is appended for those who care more about storage space.

I can't think of a reason where one kind of parser would have a different
original string mechanism than the others.  If something like that does
come up, the parser can create its own original string by just naming it
something different and then turning "off" the switch that you described.



On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> I think that's an excellent idea. Can anyone think of a situation where we
> wouldn't want to add this the same way for all parsers? I suppose we could
> always allow this to be overridden, also.
>
> On Fri, May 10, 2019 at 3:43 PM Nick Allen  wrote:
>
> > I think maintaining the integrity of the original data makes a lot of
> sense
> > for any parser. And ideally the original string should be what came out
> of
> > Kafka with only the minimally necessary processing.
> >
> > With that in mind, we could solve this one level up.  Instead of relying
> on
> > each parser to do this right, we could have the ParserRunner and
> > specifically the ParserRunnerImpl [1] handle this round-abouts here
> > <
> >
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
> > >
> > [1].
> > It has the raw message data and can append the original string to each
> > message it gets back from the parsers.
> >
> > Just another approach to consider.
> >
> > --
> > [1]
> >
> >
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
> >
> > On Fri, May 10, 2019 at 4:11 PM Otto Fowler 
> > wrote:
> >
> > > +1
> > >
> > >
> > > On May 10, 2019 at 13:57:55, Michael Miklavcic (
> > > michael.miklav...@gmail.com)
> > > wrote:
> > >
> > > When adding the capability for parsing messages in the JsonMapParser
> > using
> > > JSON Path expressions the original behavior for managing original
> strings
> > > was changed.
> > >
> > >
> > >
> >
> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
> > >
> > > A couple issues have been reported recently regarding this change:
> > >
> > > 1. We're losing the actual original string, which is a legal issue for
> > > data lineage for some customers
> > > 2. Even for the degenerate case with no sub-messages created, the
> > > original sub-message string is modified because of the
> > > serialization/deserialization process with Jackson/JsonSimple. The
> fields
> > > are reordered bc the content is normalized.
> > >
> > > I looked at options for preserving formatting, but am unable to find a
> > > method that allows you to both parse, then query the original message
> and
> > > then also obtain the raw string matches without the normalizing from
> > > ser/deserialization.
> > >
> > > I'd like to propose that we add a configuration option for this parser
> > that
> > > allows the user to toggle which approach they'd like to use. My
> personal
> > > preference based on feedback I've gotten from multiple customers is
> that
> > > the default should be the older approach which takes the raw original
> > > string. It's arguable that this change in contract is a regression, so
> > the
> > > default should be the earlier behavior. Any sub-messages would then
> have
> > a
> > > copy of that raw original string, not just the sub-message original
> > string.
> > > Enabling the flag would enable the current sub-message original string
> > > functionality.
> > >
> > > Mike
> > >
> >
>


Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Michael Miklavcic
I think that's an excellent idea. Can anyone think of a situation where we
wouldn't want to add this the same way for all parsers? I suppose we could
always allow this to be overridden, also.

On Fri, May 10, 2019 at 3:43 PM Nick Allen  wrote:

> I think maintaining the integrity of the original data makes a lot of sense
> for any parser. And ideally the original string should be what came out of
> Kafka with only the minimally necessary processing.
>
> With that in mind, we could solve this one level up.  Instead of relying on
> each parser to do this right, we could have the ParserRunner and
> specifically the ParserRunnerImpl [1] handle this round-abouts here
> <
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
> >
> [1].
> It has the raw message data and can append the original string to each
> message it gets back from the parsers.
>
> Just another approach to consider.
>
> --
> [1]
>
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>
> On Fri, May 10, 2019 at 4:11 PM Otto Fowler 
> wrote:
>
> > +1
> >
> >
> > On May 10, 2019 at 13:57:55, Michael Miklavcic (
> > michael.miklav...@gmail.com)
> > wrote:
> >
> > When adding the capability for parsing messages in the JsonMapParser
> using
> > JSON Path expressions the original behavior for managing original strings
> > was changed.
> >
> >
> >
> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
> >
> > A couple issues have been reported recently regarding this change:
> >
> > 1. We're losing the actual original string, which is a legal issue for
> > data lineage for some customers
> > 2. Even for the degenerate case with no sub-messages created, the
> > original sub-message string is modified because of the
> > serialization/deserialization process with Jackson/JsonSimple. The fields
> > are reordered bc the content is normalized.
> >
> > I looked at options for preserving formatting, but am unable to find a
> > method that allows you to both parse, then query the original message and
> > then also obtain the raw string matches without the normalizing from
> > ser/deserialization.
> >
> > I'd like to propose that we add a configuration option for this parser
> that
> > allows the user to toggle which approach they'd like to use. My personal
> > preference based on feedback I've gotten from multiple customers is that
> > the default should be the older approach which takes the raw original
> > string. It's arguable that this change in contract is a regression, so
> the
> > default should be the earlier behavior. Any sub-messages would then have
> a
> > copy of that raw original string, not just the sub-message original
> string.
> > Enabling the flag would enable the current sub-message original string
> > functionality.
> >
> > Mike
> >
>


Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Nick Allen
I think maintaining the integrity of the original data makes a lot of sense
for any parser. And ideally the original string should be what came out of
Kafka with only the minimally necessary processing.

With that in mind, we could solve this one level up.  Instead of relying on
each parser to do this right, we could have the ParserRunner and
specifically the ParserRunnerImpl [1] handle this round-abouts here

[1].
It has the raw message data and can append the original string to each
message it gets back from the parsers.

Just another approach to consider.

--
[1]
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158

On Fri, May 10, 2019 at 4:11 PM Otto Fowler  wrote:

> +1
>
>
> On May 10, 2019 at 13:57:55, Michael Miklavcic (
> michael.miklav...@gmail.com)
> wrote:
>
> When adding the capability for parsing messages in the JsonMapParser using
> JSON Path expressions the original behavior for managing original strings
> was changed.
>
>
> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
>
> A couple issues have been reported recently regarding this change:
>
> 1. We're losing the actual original string, which is a legal issue for
> data lineage for some customers
> 2. Even for the degenerate case with no sub-messages created, the
> original sub-message string is modified because of the
> serialization/deserialization process with Jackson/JsonSimple. The fields
> are reordered bc the content is normalized.
>
> I looked at options for preserving formatting, but am unable to find a
> method that allows you to both parse, then query the original message and
> then also obtain the raw string matches without the normalizing from
> ser/deserialization.
>
> I'd like to propose that we add a configuration option for this parser that
> allows the user to toggle which approach they'd like to use. My personal
> preference based on feedback I've gotten from multiple customers is that
> the default should be the older approach which takes the raw original
> string. It's arguable that this change in contract is a regression, so the
> default should be the earlier behavior. Any sub-messages would then have a
> copy of that raw original string, not just the sub-message original string.
> Enabling the flag would enable the current sub-message original string
> functionality.
>
> Mike
>


Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Otto Fowler
+1


On May 10, 2019 at 13:57:55, Michael Miklavcic (michael.miklav...@gmail.com)
wrote:

When adding the capability for parsing messages in the JsonMapParser using
JSON Path expressions the original behavior for managing original strings
was changed.

https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192

A couple issues have been reported recently regarding this change:

1. We're losing the actual original string, which is a legal issue for
data lineage for some customers
2. Even for the degenerate case with no sub-messages created, the
original sub-message string is modified because of the
serialization/deserialization process with Jackson/JsonSimple. The fields
are reordered bc the content is normalized.

I looked at options for preserving formatting, but am unable to find a
method that allows you to both parse, then query the original message and
then also obtain the raw string matches without the normalizing from
ser/deserialization.

I'd like to propose that we add a configuration option for this parser that
allows the user to toggle which approach they'd like to use. My personal
preference based on feedback I've gotten from multiple customers is that
the default should be the older approach which takes the raw original
string. It's arguable that this change in contract is a regression, so the
default should be the earlier behavior. Any sub-messages would then have a
copy of that raw original string, not just the sub-message original string.
Enabling the flag would enable the current sub-message original string
functionality.

Mike


[DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Michael Miklavcic
When adding the capability for parsing messages in the JsonMapParser using
JSON Path expressions the original behavior for managing original strings
was changed.

https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192

A couple issues have been reported recently regarding this change:

   1. We're losing the actual original string, which is a legal issue for
   data lineage for some customers
   2. Even for the degenerate case with no sub-messages created, the
   original sub-message string is modified because of the
   serialization/deserialization process with Jackson/JsonSimple. The fields
   are reordered bc the content is normalized.

I looked at options for preserving formatting, but am unable to find a
method that allows you to both parse, then query the original message and
then also obtain the raw string matches without the normalizing from
ser/deserialization.

I'd like to propose that we add a configuration option for this parser that
allows the user to toggle which approach they'd like to use. My personal
preference based on feedback I've gotten from multiple customers is that
the default should be the older approach which takes the raw original
string. It's arguable that this change in contract is a regression, so the
default should be the earlier behavior. Any sub-messages would then have a
copy of that raw original string, not just the sub-message original string.
Enabling the flag would enable the current sub-message original string
functionality.

Mike