Re: Normalization topology or separate normalization bolt for parsing topology

Nick Allen Wed, 03 May 2017 05:03:38 -0700

> Clearly, a generic parser would be useful for the community not a type of
parser that is highly customised for our noisy environment.


Increasing the number of generic parsers for the community is definitely a
good goal.  I agree with you there.

Could we achieve the same goal by making our parsers more configurable?  As
a simple example, maybe a user could configure particular fields to be
either required or optional.

   - For my use of Parser X, I am going to configure the "timestamp" field
   to be "required".  I want the parser to fail the message if the timestamp
   field is invalid.


   - But when you are using Parser X, you would configure the "timestamp"
   field as "optional".  When a malformed timestamp arrives, it ignores the
   timestamp (maybe stamps its own valid timestamp) and allows the message to
   continue on.

In ways like this we can provide some flexibility to users of Parser X to
achieve the very important goal that you outlined, but without an
architectural change.




On May 2, 2017 9:05 PM, "Ali Nazemian" <[email protected]> wrote:

Hi Nick,

I am happy to continue the development using the current architecture and
embed the pre-parsing steps in the parser code. However, this would be
against the policy to have a contribution to Metron community to expand the
range of supported devices. Clearly, a generic parser would be useful for
the community not a type of parser that is highly customised for our noisy
environment. I was looking for decoupling Parsing and Normalisation to
implement a generic parser which can be used by others as well.

I think this is more a type of strategic decision which can increase the
number of generic parsers that will be contributed back to the community in
future. Ideally, it would be better that official Metron developers focus
on Metron features instead of developing generic parsers.

Thanks,
Ali

On Wed, May 3, 2017 at 3:03 AM, Nick Allen <[email protected]> wrote:

> Yes, and currently that normalization step is the Parsers.
>
> I am not saying the message has to be entirely clear and well-defined.
But
> there are a minimum set of expectations that you must have of any data
that
> you're ingesting.   Once it meets that "minimum set", the parser should be
> able to ingest and normalize the message.  Any oddities beyond that
> "minimum set" can be handled with Stellar either post-Parsing or in
> Enrichment.
>
> It is, of course, a judgement call as to what that minimum set is for you.
> You would just need a Parser that matches your definition of "minimum
set".
>
> My main point here is that I am not seeing a need to re-architect
> anything.  I think we have the right tools, IMHO.
>
>
>
>
>
>
>
>
>
> On Tue, May 2, 2017 at 10:33 AM, Ali Nazemian <[email protected]>
> wrote:
>
> > Hi Nick,
> >
> > The date could be corrupted due to any reason, and sometimes we haven't
> got
> > any control on the device. Obviously, it is not a big deal if we lose
> <166>
> > severity message, but it could be a different situation for <161>
> > severity or an actual critical threat. However, I have mentioned those
> > defects as an example to pointed the importance of having a
normalisation
> > step in Metron processing chain.
> >
> > I still think there is no guarantee to have an entirely clear and
> > well-defined message in the real world use case. If we recognise this
> > situation as a problem, then finding a high performance and flexible
> > solution is not very hard.
> >
> > Cheers,
> > Ali
> >
> > On Tue, May 2, 2017 at 11:24 PM, Nick Allen <[email protected]> wrote:
> >
> > > Before worrying about how to ingest this 'noisy' data, I would want to
> > > better understand root cause.  If you cannot even get a valid date
> > format,
> > > are you sure the data can be trusted?
> > >
> > > Rather than bending over backwards to try to ingest it, I would first
> > make
> > > sure the telemetry is not totally bogus to begin with.  Maybe it is
> > better
> > > that the data is dropped in cases like this.
> > >
> > > IMHO, that is how I would tackle a problem like this.  Not all data
can
> > be
> > > trusted.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian <[email protected]>
> > > wrote:
> > >
> > > > Are you sure? The syslog_host name is way more complicated than
> > something
> > > > that can be a coincidence. I need to double check with one of the
> > > security
> > > > device experts, but I thought it is some kind of noises.
> > > >
> > > > Yes, we do have more use cases that seem to be corrupted. For
> example,
> > > > having duplicate IP addresses or corrupted date format. Please have
a
> > > look
> > > > at the following message. At least I am sure the date format is
> > corrupted
> > > > in this one.
> > > >
> > > > <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP
> > > connection
> > > > 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to
> inside:*y.y.y.y/p2*
> > > > *y.y.y.y/p2*
> > > >
> > > > Cheers,
> > > > Ali
> > > >
> > > > On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball <
> > > > [email protected]> wrote:
> > > >
> > > > > Is that instance, you're looking at valid syslog which should be
> > parsed
> > > > as
> > > > > such. The repeat host is not really a host in syslog terms, it's
an
> > > > > application name header which happens to be the same. This is
> > > definitely
> > > > a
> > > > > parser bug which should be handled, esp since the header is
> perfectly
> > > RFC
> > > > > compliant.
> > > > >
> > > > > Do you have any other such cases? My view is that parsers should
be
> > > > > written with more any case, so should extract all the fields they
> can
> > > > from
> > > > > malformed logs, rather than throwing exceptions, but that's more
> > about
> > > > the
> > > > > way we write parsers than having some kind of pre-clean.
> > > > >
> > > > > Simon
> > > > >
> > > > > Sent from my iPad
> > > > >
> > > > > > On 27 Apr 2017, at 08:04, Ali Nazemian <[email protected]>
> > > wrote:
> > > > > >
> > > > > > I do agree there is a fair amount of overhead for using another
> > bolt
> > > > for
> > > > > > this purpose. I am not pointing to the way of implementation. It
> > > might
> > > > > be a
> > > > > > way of implementation to segregate two extension points without
> > > adding
> > > > > > overhead; I haven't thought about it yet. However, the main
issue
> > is
> > > > > > sometimes the type of noise is something that generates an
> > exception
> > > on
> > > > > the
> > > > > > parsing side. For example, have a look at the following log:
> > > > > >
> > > > > > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown
> ICMP
> > > > > > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
> > > > > > (ryanmar)
> > > > > >
> > > > > > Clearly duplicate syslog_host throws an exception on parsing, so
> > how
> > > > > > are we going to deal with that at post-parse transformation? It
> > > cannot
> > > > > > pass the parsing. This is only a single example of cases that
> might
> > > > > > affect the production data. Unless Stellar transformation is
> > > something
> > > > > > that can be done at pre-parse and for the entire message.
> > > > > >
> > > > > >
> > > > > > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > >> Ali,
> > > > > >>
> > > > > >> Sounds very much like what you’re talking about when you say
> > > > > >> normalization, and what I would understand it as, is the
process
> > > > > fulfilled
> > > > > >> by stellar field transformation in the parser config. Agreed
> that
> > > some
> > > > > of
> > > > > >> these will be general, based on common metron standard schema,
> but
> > > > > others
> > > > > >> will be organisation specific (custom fields overloaded with
> > > different
> > > > > >> meanings for instance in CEF, for example). These are very much
> > one
> > > of
> > > > > the
> > > > > >> reasons we have the stellar transformation step. I don’t think
> > that
> > > > > should
> > > > > >> be moved to a separate bolt to be honest, because that comes
> with
> > a
> > > > fair
> > > > > >> amount of overhead, but logically it is in the parser config
> > rather
> > > > than
> > > > > >> the parser, so seems to serve this purpose in the post-parse
> > > > transform,
> > > > > no?
> > > > > >>
> > > > > >> Simon
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>> On 27 Apr 2017, at 02:08, Ali Nazemian <[email protected]>
> > > > wrote:
> > > > > >>>
> > > > > >>> Hi Simon,
> > > > > >>>
> > > > > >>> The reason I am asking for a specific normalisation step is
due
> > to
> > > > the
> > > > > >> fact
> > > > > >>> that normalisation is not a general use case which can be used
> by
> > > > other
> > > > > >>> users. It is completely bounded to our application. The way we
> > have
> > > > > fixed
> > > > > >>> it, for now, is to add a normalisation step to the parser and
> > clear
> > > > the
> > > > > >>> incoming data so the parser step can work on that, but I don't
> > like
> > > > it.
> > > > > >>> There is no point of creating a parser that can handle all of
> the
> > > > > >> possible
> > > > > >>> noises that can exist in the production data. Even if it is
> > > possible
> > > > to
> > > > > >>> predict every kind of noise in production data there is no
> point
> > > for
> > > > > >> Metron
> > > > > >>> community to focus on building a general purpose parser for a
> > > > specific
> > > > > >>> device while they can spend that time on developing a cool
> > feature.
> > > > > Even
> > > > > >> if
> > > > > >>> it is possible to predict noises and it is acceptable for the
> > > > community
> > > > > >> to
> > > > > >>> spend their time on creating that kind of parser why every
> Metron
> > > > user
> > > > > >> need
> > > > > >>> that extra normalisation? A user data might be clear at the
> first
> > > > step
> > > > > >> and
> > > > > >>> obviously, it only decreases the total throughput without any
> use
> > > for
> > > > > >> that
> > > > > >>> specific user.
> > > > > >>>
> > > > > >>> Imagine there is an additional bolt for normalisation and
there
> > is
> > > a
> > > > > >>> mechanism to customise the normalisation without changing the
> > > general
> > > > > >>> parser for a specific device. We can have a general parser as
a
> > > > common
> > > > > >>> parser for that device and leave the normalisation development
> to
> > > > > users.
> > > > > >>> However, it is very important to provide the normalisation
step
> > as
> > > > fast
> > > > > >> as
> > > > > >>> possible.
> > > > > >>>
> > > > > >>> Cheers,
> > > > > >>> Ali
> > > > > >>>
> > > > > >>> On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <
> > [email protected]
> > > >
> > > > > >> wrote:
> > > > > >>>
> > > > > >>>> Yeah, we definitely don't want to rewrite parsing in
> Stellar.  I
> > > > would
> > > > > >>>> expect the job of the parser, however, to handle structural
> > > issues.
> > > > > In
> > > > > >> my
> > > > > >>>> mind, parsing is about transforming structures into fields
and
> > the
> > > > > role
> > > > > >> of
> > > > > >>>> the field transformations are to transform values.  There's
> > > obvious
> > > > > >> overlap
> > > > > >>>> there wherein parsers may do some
> normalizations/transformations
> > > > (i.e.
> > > > > >> look
> > > > > >>>> how grok handles timestamps), but it almost always gets us
> into
> > > > > trouble
> > > > > >>>> when parsers do even moderately complex value
transformations.
> > > > > >>>>
> > > > > >>>> As I type this, though, I think I see your point.  What you
> > really
> > > > > want
> > > > > >> is
> > > > > >>>> to chain parsers, have a pre-parser to bring you 80% of the
> way
> > > > there
> > > > > >> and
> > > > > >>>> hammer out all the structural issues so you might be able to
> > use a
> > > > > more
> > > > > >>>> generic parser down the chain.  I have often thought that
> maybe
> > we
> > > > > >> should
> > > > > >>>> expose parsers as Stellar functions which take raw data and
> emit
> > > > whole
> > > > > >>>> messages.  This would allow us to compose parsers, so imagine
> > the
> > > > > above
> > > > > >>>> example where you've written a stellar function to normalize
> the
> > > > input
> > > > > >> and
> > > > > >>>> you're then passing it to a CSV parser, you could run
> > > > > >>>> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise
> > > specify a
> > > > > >>>> parser.
> > > > > >>>>
> > > > > >>>> As for speed, the stellar expression would get compiled into
a
> > > java
> > > > > >> object,
> > > > > >>>> so it shouldn't be appreciable overhead since we no longer
lex
> > and
> > > > > parse
> > > > > >>>> for every message.
> > > > > >>>>
> > > > > >>>> Is this kinda how you were seeing it?
> > > > > >>>>
> > > > > >>>> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball <
> > > > > >>>> [email protected]> wrote:
> > > > > >>>>
> > > > > >>>>> The challenge there I suspect is going to be that you
> > essentially
> > > > end
> > > > > >> up
> > > > > >>>>> with the actual parser doing very little of value, and then
> > > > > effectively
> > > > > >>>>> trying to write a parser in stellar against a few broad
> > strings,
> > > > > which
> > > > > >>>>> would likely give you all sorts of performance problems.
> > > > > >>>>>
> > > > > >>>>> One solution is to write a very defensive and flexible
> parser,
> > > but
> > > > > that
> > > > > >>>>> would tend to be time consuming.
> > > > > >>>>>
> > > > > >>>>> There is also something to be said for doing some basic
> > > > > transformation
> > > > > >>>>> before the parser topic kafka in something like nifi, but
> > again,
> > > > > >>>>> performance can be an issue there.
> > > > > >>>>>
> > > > > >>>>> If the noise is about broken structure for example, maybe a
> > > simple
> > > > > >>>>> pre-process step as part of your parser would make sense,
> e.g.
> > > > > >> stripping
> > > > > >>>>> syslog headers, or character set conversion, removing very
> > broken
> > > > > bits
> > > > > >> as
> > > > > >>>>> part of the parse method.
> > > > > >>>>>
> > > > > >>>>> In terms of normalisation post-parse, I agree, that 100% a
> job
> > > for
> > > > > >>>>> Stellar, and the fieldTransformations capability. Something
I
> > > would
> > > > > >> like
> > > > > >>>> to
> > > > > >>>>> see would be a means to use that transformation step to map
> to
> > a
> > > > well
> > > > > >>>> known
> > > > > >>>>> (though loosely enforced) schema provided by a governance
> > > > framework,
> > > > > >> but
> > > > > >>>>> that is a much bigger topic of conversation.
> > > > > >>>>>
> > > > > >>>>> Not of course that not everything has to be parsed just
> because
> > > > it’s
> > > > > in
> > > > > >>>>> the message. A relatively loose fitting parser which pulls
> out
> > > the
> > > > > >>>> relevant
> > > > > >>>>> data for the use case would be fine, and likely a lot more
> > > tolerant
> > > > > of
> > > > > >>>>> noise than something that felt the need for every field. We
> do
> > > > after
> > > > > >> all
> > > > > >>>>> store the original_string for you if you really absolutely
> have
> > > to
> > > > > had
> > > > > >>>>> everything, so a more schema-on-read philosophy certainly
> > applies
> > > > and
> > > > > >>>> will
> > > > > >>>>> likely side-step a lot of your issues.
> > > > > >>>>>
> > > > > >>>>> Simon
> > > > > >>>>>
> > > > > >>>>>> On 26 Apr 2017, at 14:37, Casey Stella <[email protected]>
> > > > wrote:
> > > > > >>>>>>
> > > > > >>>>>> Ok, that's another story.  hmmmm, we don't generally
> pre-parse
> > > > > becuase
> > > > > >>>> we
> > > > > >>>>>> try to not assume any particular format there (i.e. it
could
> > be
> > > > > >>>> strings,
> > > > > >>>>>> could be byte arrays).  Maybe the right answer is to pass
> the
> > > raw,
> > > > > >>>>>> non-normalized data (best effort tyep of thing) through the
> > > parser
> > > > > and
> > > > > >>>> do
> > > > > >>>>>> the normalization post-parse..or is there a problem with
> that?
> > > > > >>>>>>
> > > > > >>>>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <
> > > > > [email protected]>
> > > > > >>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Hi Casey,
> > > > > >>>>>>>
> > > > > >>>>>>> It is actually pre-parse process, not a post-parse one.
> These
> > > > type
> > > > > of
> > > > > >>>>>>> noises affect the position of an attribute for example and
> > give
> > > > us
> > > > > >>>>> parsing
> > > > > >>>>>>> exception. The timestamp example was not a good one
because
> > > that
> > > > is
> > > > > >>>>>>> actually a post-parse exception.
> > > > > >>>>>>>
> > > > > >>>>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <
> > > > [email protected]
> > > > > >
> > > > > >>>>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> So, further transformation post-parse was one of the
> > > motivating
> > > > > >>>> reasons
> > > > > >>>>>>> for
> > > > > >>>>>>>> Stellar (to do that transformation post-parse).  Is there
> a
> > > > > >>>> capability
> > > > > >>>>>>> that
> > > > > >>>>>>>> it's lacking that we can add to fit your usecase?
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <
> > > > > >> [email protected]
> > > > > >>>>>
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> I've created a Jira ticket regarding this feature.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> https://issues.apache.org/jira/browse/METRON-893
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
> > > > > >>>> [email protected]
> > > > > >>>>>>
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> Currently, we are using normal regex at the Java source
> > code
> > > > to
> > > > > >>>>>>> handle
> > > > > >>>>>>>>>> those situations. However, it would be nice to have a
> > > separate
> > > > > >> bolt
> > > > > >>>>>>> and
> > > > > >>>>>>>>>> deal with them separately. Yeah, I can create a Jira
> issue
> > > > > >>>> regarding
> > > > > >>>>>>>>> that.
> > > > > >>>>>>>>>> The main reason I am asking for such a feature is the
> fact
> > > > that
> > > > > >>>> lack
> > > > > >>>>>>> of
> > > > > >>>>>>>>>> such a feature makes the process of creating some
parser
> > for
> > > > the
> > > > > >>>>>>>>> community
> > > > > >>>>>>>>>> a little painful for us. We need to maintain two
> different
> > > > > >>>> versions,
> > > > > >>>>>>>> one
> > > > > >>>>>>>>>> for community another for the internal use case.
> Clearly,
> > > > noise
> > > > > is
> > > > > >>>> an
> > > > > >>>>>>>>>> inevitable part of real world use cases.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Cheers,
> > > > > >>>>>>>>>> Ali
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> > > > > >>>>>>> [email protected]
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> Hi,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Are you doing this cleansing all in the parser or are
> you
> > > > using
> > > > > >>>> any
> > > > > >>>>>>>>>>> Stellar to do it?
> > > > > >>>>>>>>>>> Can you create a jira?
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (
> > > > > >>>> [email protected])
> > > > > >>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Hi all,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> We are facing certain use cases in Metron production
> that
> > > > > happen
> > > > > >>>> to
> > > > > >>>>>>> be
> > > > > >>>>>>>>>>> related to noisy stream. For example, a wrong
> timestamp,
> > > > > >> duplicate
> > > > > >>>>>>>>>>> hostname/IP address, etc. To deal with the
> normalization
> > we
> > > > > have
> > > > > >>>>>>> added
> > > > > >>>>>>>>> an
> > > > > >>>>>>>>>>> additional step for the corresponding parsers to do
the
> > > data
> > > > > >>>>>>> cleaning.
> > > > > >>>>>>>>>>> Clearly, parsing is a standard factor which is mostly
> > > related
> > > > > to
> > > > > >>>> the
> > > > > >>>>>>>>>>> device
> > > > > >>>>>>>>>>> that is generating the data and can be used for the
> same
> > > type
> > > > > of
> > > > > >>>>>>>> device
> > > > > >>>>>>>>>>> everywhere, but normalization is very production
> > dependent
> > > > and
> > > > > >>>> there
> > > > > >>>>>>>> is
> > > > > >>>>>>>>>>> no
> > > > > >>>>>>>>>>> point of mixing normalization with parsing. It would
be
> > > nice
> > > > to
> > > > > >>>>>>> have a
> > > > > >>>>>>>>>>> sperate bolt in a parsing topologies to dedicate to
> > > > production
> > > > > >>>>>>>>>>> related cleaning process. In that case, eveybody can
> > easily
> > > > > >>>>>>> contribute
> > > > > >>>>>>>>> to
> > > > > >>>>>>>>>>> Metron community with additional parsers without being
> > > > worried
> > > > > >>>> about
> > > > > >>>>>>>>>>> mixing
> > > > > >>>>>>>>>>> parsers and data cleaning process.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Regards,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Ali
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> --
> > > > > >>>>>>>>>> A.Nazemian
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> --
> > > > > >>>>>>>>> A.Nazemian
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> --
> > > > > >>>>>>> A.Nazemian
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> A.Nazemian
> > > > > >>
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > A.Nazemian
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



--
A.Nazemian

Re: Normalization topology or separate normalization bolt for parsing topology

Reply via email to