I think maintaining the integrity of the original data makes a lot of sense
for any parser. And ideally the original string should be what came out of
Kafka with only the minimally necessary processing.

With that in mind, we could solve this one level up.  Instead of relying on
each parser to do this right, we could have the ParserRunner and
specifically the ParserRunnerImpl [1] handle this round-abouts here
<https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158>
[1].
It has the raw message data and can append the original string to each
message it gets back from the parsers.

Just another approach to consider.

--
[1]
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158

On Fri, May 10, 2019 at 4:11 PM Otto Fowler <ottobackwa...@gmail.com> wrote:

> +1
>
>
> On May 10, 2019 at 13:57:55, Michael Miklavcic (
> michael.miklav...@gmail.com)
> wrote:
>
> When adding the capability for parsing messages in the JsonMapParser using
> JSON Path expressions the original behavior for managing original strings
> was changed.
>
>
> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
>
> A couple issues have been reported recently regarding this change:
>
> 1. We're losing the actual original string, which is a legal issue for
> data lineage for some customers
> 2. Even for the degenerate case with no sub-messages created, the
> original sub-message string is modified because of the
> serialization/deserialization process with Jackson/JsonSimple. The fields
> are reordered bc the content is normalized.
>
> I looked at options for preserving formatting, but am unable to find a
> method that allows you to both parse, then query the original message and
> then also obtain the raw string matches without the normalizing from
> ser/deserialization.
>
> I'd like to propose that we add a configuration option for this parser that
> allows the user to toggle which approach they'd like to use. My personal
> preference based on feedback I've gotten from multiple customers is that
> the default should be the older approach which takes the raw original
> string. It's arguable that this change in contract is a regression, so the
> default should be the earlier behavior. Any sub-messages would then have a
> copy of that raw original string, not just the sub-message original string.
> Enabling the flag would enable the current sub-message original string
> functionality.
>
> Mike
>

Reply via email to