Re: Normalization topology or separate normalization bolt for parsing topology

Casey Stella Wed, 26 Apr 2017 07:21:31 -0700

Yeah, we definitely don't want to rewrite parsing in Stellar.  I would
expect the job of the parser, however, to handle structural issues.  In my
mind, parsing is about transforming structures into fields and the role of
the field transformations are to transform values.  There's obvious overlap
there wherein parsers may do some normalizations/transformations (i.e. look
how grok handles timestamps), but it almost always gets us into trouble
when parsers do even moderately complex value transformations.


As I type this, though, I think I see your point.  What you really want is
to chain parsers, have a pre-parser to bring you 80% of the way there and
hammer out all the structural issues so you might be able to use a more
generic parser down the chain.  I have often thought that maybe we should
expose parsers as Stellar functions which take raw data and emit whole
messages.  This would allow us to compose parsers, so imagine the above
example where you've written a stellar function to normalize the input and
you're then passing it to a CSV parser, you could run
"CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise specify a parser.

As for speed, the stellar expression would get compiled into a java object,
so it shouldn't be appreciable overhead since we no longer lex and parse
for every message.

Is this kinda how you were seeing it?

On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> The challenge there I suspect is going to be that you essentially end up
> with the actual parser doing very little of value, and then effectively
> trying to write a parser in stellar against a few broad strings, which
> would likely give you all sorts of performance problems.
>
> One solution is to write a very defensive and flexible parser, but that
> would tend to be time consuming.
>
> There is also something to be said for doing some basic transformation
> before the parser topic kafka in something like nifi, but again,
> performance can be an issue there.
>
> If the noise is about broken structure for example, maybe a simple
> pre-process step as part of your parser would make sense, e.g. stripping
> syslog headers, or character set conversion, removing very broken bits as
> part of the parse method.
>
> In terms of normalisation post-parse, I agree, that 100% a job for
> Stellar, and the fieldTransformations capability. Something I would like to
> see would be a means to use that transformation step to map to a well known
> (though loosely enforced) schema provided by a governance framework, but
> that is a much bigger topic of conversation.
>
> Not of course that not everything has to be parsed just because it’s in
> the message. A relatively loose fitting parser which pulls out the relevant
> data for the use case would be fine, and likely a lot more tolerant of
> noise than something that felt the need for every field. We do after all
> store the original_string for you if you really absolutely have to had
> everything, so a more schema-on-read philosophy certainly applies and will
> likely side-step a lot of your issues.
>
> Simon
>
> > On 26 Apr 2017, at 14:37, Casey Stella <ceste...@gmail.com> wrote:
> >
> > Ok, that's another story.  hmmmm, we don't generally pre-parse becuase we
> > try to not assume any particular format there (i.e. it could be strings,
> > could be byte arrays).  Maybe the right answer is to pass the raw,
> > non-normalized data (best effort tyep of thing) through the parser and do
> > the normalization post-parse..or is there a problem with that?
> >
> > On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <alinazem...@gmail.com>
> wrote:
> >
> >> Hi Casey,
> >>
> >> It is actually pre-parse process, not a post-parse one. These type of
> >> noises affect the position of an attribute for example and give us
> parsing
> >> exception. The timestamp example was not a good one because that is
> >> actually a post-parse exception.
> >>
> >> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <ceste...@gmail.com>
> wrote:
> >>
> >>> So, further transformation post-parse was one of the motivating reasons
> >> for
> >>> Stellar (to do that transformation post-parse).  Is there a capability
> >> that
> >>> it's lacking that we can add to fit your usecase?
> >>>
> >>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <alinazem...@gmail.com>
> >>> wrote:
> >>>
> >>>> I've created a Jira ticket regarding this feature.
> >>>>
> >>>> https://issues.apache.org/jira/browse/METRON-893
> >>>>
> >>>>
> >>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <alinazem...@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>>> Currently, we are using normal regex at the Java source code to
> >> handle
> >>>>> those situations. However, it would be nice to have a separate bolt
> >> and
> >>>>> deal with them separately. Yeah, I can create a Jira issue regarding
> >>>> that.
> >>>>> The main reason I am asking for such a feature is the fact that lack
> >> of
> >>>>> such a feature makes the process of creating some parser for the
> >>>> community
> >>>>> a little painful for us. We need to maintain two different versions,
> >>> one
> >>>>> for community another for the internal use case. Clearly, noise is an
> >>>>> inevitable part of real world use cases.
> >>>>>
> >>>>> Cheers,
> >>>>> Ali
> >>>>>
> >>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> >> ottobackwa...@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Are you doing this cleansing all in the parser or are you using any
> >>>>>> Stellar to do it?
> >>>>>> Can you create a jira?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (alinazem...@gmail.com)
> >>>>>> wrote:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>>
> >>>>>> We are facing certain use cases in Metron production that happen to
> >> be
> >>>>>> related to noisy stream. For example, a wrong timestamp, duplicate
> >>>>>> hostname/IP address, etc. To deal with the normalization we have
> >> added
> >>>> an
> >>>>>> additional step for the corresponding parsers to do the data
> >> cleaning.
> >>>>>> Clearly, parsing is a standard factor which is mostly related to the
> >>>>>> device
> >>>>>> that is generating the data and can be used for the same type of
> >>> device
> >>>>>> everywhere, but normalization is very production dependent and there
> >>> is
> >>>>>> no
> >>>>>> point of mixing normalization with parsing. It would be nice to
> >> have a
> >>>>>> sperate bolt in a parsing topologies to dedicate to production
> >>>>>> related cleaning process. In that case, eveybody can easily
> >> contribute
> >>>> to
> >>>>>> Metron community with additional parsers without being worried about
> >>>>>> mixing
> >>>>>> parsers and data cleaning process.
> >>>>>>
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>> Ali
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> A.Nazemian
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> A.Nazemian
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> A.Nazemian
> >>
>
>

Re: Normalization topology or separate normalization bolt for parsing topology

Reply via email to