> I suppose we could always allow this to be overridden, also. I like an on/off switch for the "original string" functionality. If on, you get the original string in pristine condition. If off, no original string is appended for those who care more about storage space.
I can't think of a reason where one kind of parser would have a different original string mechanism than the others. If something like that does come up, the parser can create its own original string by just naming it something different and then turning "off" the switch that you described. On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic < michael.miklav...@gmail.com> wrote: > I think that's an excellent idea. Can anyone think of a situation where we > wouldn't want to add this the same way for all parsers? I suppose we could > always allow this to be overridden, also. > > On Fri, May 10, 2019 at 3:43 PM Nick Allen <n...@nickallen.org> wrote: > > > I think maintaining the integrity of the original data makes a lot of > sense > > for any parser. And ideally the original string should be what came out > of > > Kafka with only the minimally necessary processing. > > > > With that in mind, we could solve this one level up. Instead of relying > on > > each parser to do this right, we could have the ParserRunner and > > specifically the ParserRunnerImpl [1] handle this round-abouts here > > < > > > https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158 > > > > > [1]. > > It has the raw message data and can append the original string to each > > message it gets back from the parsers. > > > > Just another approach to consider. > > > > -- > > [1] > > > > > https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158 > > > > On Fri, May 10, 2019 at 4:11 PM Otto Fowler <ottobackwa...@gmail.com> > > wrote: > > > > > +1 > > > > > > > > > On May 10, 2019 at 13:57:55, Michael Miklavcic ( > > > michael.miklav...@gmail.com) > > > wrote: > > > > > > When adding the capability for parsing messages in the JsonMapParser > > using > > > JSON Path expressions the original behavior for managing original > strings > > > was changed. > > > > > > > > > > > > https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192 > > > > > > A couple issues have been reported recently regarding this change: > > > > > > 1. We're losing the actual original string, which is a legal issue for > > > data lineage for some customers > > > 2. Even for the degenerate case with no sub-messages created, the > > > original sub-message string is modified because of the > > > serialization/deserialization process with Jackson/JsonSimple. The > fields > > > are reordered bc the content is normalized. > > > > > > I looked at options for preserving formatting, but am unable to find a > > > method that allows you to both parse, then query the original message > and > > > then also obtain the raw string matches without the normalizing from > > > ser/deserialization. > > > > > > I'd like to propose that we add a configuration option for this parser > > that > > > allows the user to toggle which approach they'd like to use. My > personal > > > preference based on feedback I've gotten from multiple customers is > that > > > the default should be the older approach which takes the raw original > > > string. It's arguable that this change in contract is a regression, so > > the > > > default should be the earlier behavior. Any sub-messages would then > have > > a > > > copy of that raw original string, not just the sub-message original > > string. > > > Enabling the flag would enable the current sub-message original string > > > functionality. > > > > > > Mike > > > > > >