The original string would be the string specified as the message body, thus
each message in the chain produced would just be the bytes passed in, from
a specific field in the incoming message.



On May 10, 2019 at 19:55:28, Simon Elliston Ball (
si...@simonellistonball.com) wrote:

My understanding is that chaining preserves (correctly to my mind) the
original original string.

In other words: unless the message strategy is raw message, the original
string is just passed through. Original string therefore comes from outside
Metron, and is preserved throughout Metron processes, allowing for
recreation of original form for forensics and evidentiary purposes.

Simon

> On 11 May 2019, at 00:10, Otto Fowler <ottobackwa...@gmail.com> wrote:
>
> What about parser chaining? Should the original string be from kafka, or
> the last parsed?
>
>
> On May 10, 2019 at 19:03:39, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> The only scenario I can think of where a parser might treat original
string
> differently, or even need to know about it would be different encoding
> locales. For example, if the string were to be encoded in a locale
specific
> to the device and choose the encoding based on metadata or parsed
content,
> then that could merit pushing it down. The other edge might be when you
> have binary data that does not go down to an original string well (eg a
> netflow parser).
>
> That said, that’s a highly unlikely edge case that could be handled by
> workarounds.
>
> I’m a definitely +1 on Nick’s idea of pulling original string up to the
> runner. Right now we’re pretty inconsistent in how it’s done, so that
would
> help.
>
> Simon
>
> Sent from my iPhone
>
> On 10 May 2019, at 23:10, Nick Allen <n...@nickallen.org> wrote:
>
>>> I suppose we could always allow this to be overridden, also.
>>
>> I like an on/off switch for the "original string" functionality. If on,
>> you get the original string in pristine condition. If off, no original
>> string is appended for those who care more about storage space.
>>
>> I can't think of a reason where one kind of parser would have a
different
>> original string mechanism than the others. If something like that does
>> come up, the parser can create its own original string by just naming it
>> something different and then turning "off" the switch that you
described.
>>
>>
>>
>> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
>> michael.miklav...@gmail.com> wrote:
>>
>>> I think that's an excellent idea. Can anyone think of a situation where
> we
>>> wouldn't want to add this the same way for all parsers? I suppose we
> could
>>> always allow this to be overridden, also.
>>>
>>>> On Fri, May 10, 2019 at 3:43 PM Nick Allen <n...@nickallen.org> wrote:
>>>>
>>>> I think maintaining the integrity of the original data makes a lot of
>>> sense
>>>> for any parser. And ideally the original string should be what came
out
>>> of
>>>> Kafka with only the minimally necessary processing.
>>>>
>>>> With that in mind, we could solve this one level up. Instead of
relying
>>> on
>>>> each parser to do this right, we could have the ParserRunner and
>>>> specifically the ParserRunnerImpl [1] handle this round-abouts here
>>>> <
>>>>
>>>
>
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>>>>>
>>>> [1].
>>>> It has the raw message data and can append the original string to each
>>>> message it gets back from the parsers.
>>>>
>>>> Just another approach to consider.
>>>>
>>>> --
>>>> [1]
>>>>
>>>>
>>>
>
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>>>>
>>>> On Fri, May 10, 2019 at 4:11 PM Otto Fowler <ottobackwa...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>>
>>>>> On May 10, 2019 at 13:57:55, Michael Miklavcic (
>>>>> michael.miklav...@gmail.com)
>>>>> wrote:
>>>>>
>>>>> When adding the capability for parsing messages in the JsonMapParser
>>>> using
>>>>> JSON Path expressions the original behavior for managing original
>>> strings
>>>>> was changed.
>>>>>
>>>>>
>>>>>
>>>>
>>>
>
https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
>>>>>
>>>>> A couple issues have been reported recently regarding this change:
>>>>>
>>>>> 1. We're losing the actual original string, which is a legal issue
for
>>>>> data lineage for some customers
>>>>> 2. Even for the degenerate case with no sub-messages created, the
>>>>> original sub-message string is modified because of the
>>>>> serialization/deserialization process with Jackson/JsonSimple. The
>>> fields
>>>>> are reordered bc the content is normalized.
>>>>>
>>>>> I looked at options for preserving formatting, but am unable to find
a
>>>>> method that allows you to both parse, then query the original message
>>> and
>>>>> then also obtain the raw string matches without the normalizing from
>>>>> ser/deserialization.
>>>>>
>>>>> I'd like to propose that we add a configuration option for this
parser
>>>> that
>>>>> allows the user to toggle which approach they'd like to use. My
>>> personal
>>>>> preference based on feedback I've gotten from multiple customers is
>>> that
>>>>> the default should be the older approach which takes the raw original
>>>>> string. It's arguable that this change in contract is a regression,
so
>>>> the
>>>>> default should be the earlier behavior. Any sub-messages would then
>>> have
>>>> a
>>>>> copy of that raw original string, not just the sub-message original
>>>> string.
>>>>> Enabling the flag would enable the current sub-message original
string
>>>>> functionality.
>>>>>
>>>>> Mike
>>>>>
>>>>
>>>

Reply via email to