Re: Normalization topology or separate normalization bolt for parsing topology

Casey Stella Wed, 26 Apr 2017 07:00:00 -0700

Ok, this may be easier with a couple of examples:

*Simple Example : Downstream Processing is Independent of Normalization*


Pretend we have a data format that is CSV and the first field, let's call
it 'input_dname' is supposed to be a domain name, but sometimes you get IP
addresses.  In the situation where you get IP addresses, let's say you want
to remove the field.  Rather than doing that in the parser, you could just
emit the raw data for that field, ip address or domain name, and then in a
field transformation you could run a field transformation:

'input_dname' : "if IS_IP(input_dname) then null else input_dname"

*Intermediate Example:* *Downstream Processing is Independent of
Normalization*

Same situation, but now we have a new field called "input_tld" in which you
pull out the TLD of input_dname.  BUT you can't, because it may or may not
be a proper domain name and, furthermore, it may have spaces around it.  In
that situation, I'd suggest adding just *not* adding the field until the
field transformation and doing the following as field transformations:
'input_dname' : "if IS_IP(input_dname) then null else TRIM(input_dname)"
'input_tld' : "DOMAIN_TO_TLD(input_dname)"

If your situation doesn't fit there, could you give us an example like
above?

On Wed, Apr 26, 2017 at 9:43 AM, Ali Nazemian <alinazem...@gmail.com> wrote:

> Having Stellar function for the normalization is very cool actually.
>
> Casey, how are you going to deal with normalization after the parsing if
> that noise affects the parsing? For some reason, the incoming data do not
> look like in the way that has to be.
>
> On Wed, Apr 26, 2017 at 11:37 PM, Casey Stella <ceste...@gmail.com> wrote:
>
> > Ok, that's another story.  hmmmm, we don't generally pre-parse becuase we
> > try to not assume any particular format there (i.e. it could be strings,
> > could be byte arrays).  Maybe the right answer is to pass the raw,
> > non-normalized data (best effort tyep of thing) through the parser and do
> > the normalization post-parse..or is there a problem with that?
> >
> > On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <alinazem...@gmail.com>
> > wrote:
> >
> > > Hi Casey,
> > >
> > > It is actually pre-parse process, not a post-parse one. These type of
> > > noises affect the position of an attribute for example and give us
> > parsing
> > > exception. The timestamp example was not a good one because that is
> > > actually a post-parse exception.
> > >
> > > On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <ceste...@gmail.com>
> > wrote:
> > >
> > > > So, further transformation post-parse was one of the motivating
> reasons
> > > for
> > > > Stellar (to do that transformation post-parse).  Is there a
> capability
> > > that
> > > > it's lacking that we can add to fit your usecase?
> > > >
> > > > On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <alinazem...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > I've created a Jira ticket regarding this feature.
> > > > >
> > > > > https://issues.apache.org/jira/browse/METRON-893
> > > > >
> > > > >
> > > > > On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
> > alinazem...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Currently, we are using normal regex at the Java source code to
> > > handle
> > > > > > those situations. However, it would be nice to have a separate
> bolt
> > > and
> > > > > > deal with them separately. Yeah, I can create a Jira issue
> > regarding
> > > > > that.
> > > > > > The main reason I am asking for such a feature is the fact that
> > lack
> > > of
> > > > > > such a feature makes the process of creating some parser for the
> > > > > community
> > > > > > a little painful for us. We need to maintain two different
> > versions,
> > > > one
> > > > > > for community another for the internal use case. Clearly, noise
> is
> > an
> > > > > > inevitable part of real world use cases.
> > > > > >
> > > > > > Cheers,
> > > > > > Ali
> > > > > >
> > > > > > On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> > > ottobackwa...@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Hi,
> > > > > >>
> > > > > >> Are you doing this cleansing all in the parser or are you using
> > any
> > > > > >> Stellar to do it?
> > > > > >> Can you create a jira?
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On April 26, 2017 at 08:59:16, Ali Nazemian (
> > alinazem...@gmail.com)
> > > > > >> wrote:
> > > > > >>
> > > > > >> Hi all,
> > > > > >>
> > > > > >>
> > > > > >> We are facing certain use cases in Metron production that happen
> > to
> > > be
> > > > > >> related to noisy stream. For example, a wrong timestamp,
> duplicate
> > > > > >> hostname/IP address, etc. To deal with the normalization we have
> > > added
> > > > > an
> > > > > >> additional step for the corresponding parsers to do the data
> > > cleaning.
> > > > > >> Clearly, parsing is a standard factor which is mostly related to
> > the
> > > > > >> device
> > > > > >> that is generating the data and can be used for the same type of
> > > > device
> > > > > >> everywhere, but normalization is very production dependent and
> > there
> > > > is
> > > > > >> no
> > > > > >> point of mixing normalization with parsing. It would be nice to
> > > have a
> > > > > >> sperate bolt in a parsing topologies to dedicate to production
> > > > > >> related cleaning process. In that case, eveybody can easily
> > > contribute
> > > > > to
> > > > > >> Metron community with additional parsers without being worried
> > about
> > > > > >> mixing
> > > > > >> parsers and data cleaning process.
> > > > > >>
> > > > > >>
> > > > > >> Regards,
> > > > > >>
> > > > > >> Ali
> > > > > >>
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > A.Nazemian
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > A.Nazemian
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > A.Nazemian
> > >
> >
>
>
>
> --
> A.Nazemian
>

Re: Normalization topology or separate normalization bolt for parsing topology

Reply via email to