Re: Normalization topology or separate normalization bolt for parsing topology

Simon Elliston Ball Thu, 27 Apr 2017 05:01:07 -0700

Is that instance, you're looking at valid syslog which should be parsed as 
such. The repeat host is not really a host in syslog terms, it's an application 
name header which happens to be the same. This is definitely a parser bug which 
should be handled, esp since the header is perfectly RFC compliant.


Do you have any other such cases? My view is that parsers should be written 
with more any case, so should extract all the fields they can from malformed 
logs, rather than throwing exceptions, but that's more about the way we write 
parsers than having some kind of pre-clean.

Simon

Sent from my iPad

> On 27 Apr 2017, at 08:04, Ali Nazemian <[email protected]> wrote:
> 
> I do agree there is a fair amount of overhead for using another bolt for
> this purpose. I am not pointing to the way of implementation. It might be a
> way of implementation to segregate two extension points without adding
> overhead; I haven't thought about it yet. However, the main issue is
> sometimes the type of noise is something that generates an exception on the
> parsing side. For example, have a look at the following log:
> 
> <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP
> connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
> (ryanmar)
> 
> Clearly duplicate syslog_host throws an exception on parsing, so how
> are we going to deal with that at post-parse transformation? It cannot
> pass the parsing. This is only a single example of cases that might
> affect the production data. Unless Stellar transformation is something
> that can be done at pre-parse and for the entire message.
> 
> 
> On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
> [email protected]> wrote:
> 
>> Ali,
>> 
>> Sounds very much like what you’re talking about when you say
>> normalization, and what I would understand it as, is the process fulfilled
>> by stellar field transformation in the parser config. Agreed that some of
>> these will be general, based on common metron standard schema, but others
>> will be organisation specific (custom fields overloaded with different
>> meanings for instance in CEF, for example). These are very much one of the
>> reasons we have the stellar transformation step. I don’t think that should
>> be moved to a separate bolt to be honest, because that comes with a fair
>> amount of overhead, but logically it is in the parser config rather than
>> the parser, so seems to serve this purpose in the post-parse transform, no?
>> 
>> Simon
>> 
>> 
>> 
>>> On 27 Apr 2017, at 02:08, Ali Nazemian <[email protected]> wrote:
>>> 
>>> Hi Simon,
>>> 
>>> The reason I am asking for a specific normalisation step is due to the
>> fact
>>> that normalisation is not a general use case which can be used by other
>>> users. It is completely bounded to our application. The way we have fixed
>>> it, for now, is to add a normalisation step to the parser and clear the
>>> incoming data so the parser step can work on that, but I don't like it.
>>> There is no point of creating a parser that can handle all of the
>> possible
>>> noises that can exist in the production data. Even if it is possible to
>>> predict every kind of noise in production data there is no point for
>> Metron
>>> community to focus on building a general purpose parser for a specific
>>> device while they can spend that time on developing a cool feature. Even
>> if
>>> it is possible to predict noises and it is acceptable for the community
>> to
>>> spend their time on creating that kind of parser why every Metron user
>> need
>>> that extra normalisation? A user data might be clear at the first step
>> and
>>> obviously, it only decreases the total throughput without any use for
>> that
>>> specific user.
>>> 
>>> Imagine there is an additional bolt for normalisation and there is a
>>> mechanism to customise the normalisation without changing the general
>>> parser for a specific device. We can have a general parser as a common
>>> parser for that device and leave the normalisation development to users.
>>> However, it is very important to provide the normalisation step as fast
>> as
>>> possible.
>>> 
>>> Cheers,
>>> Ali
>>> 
>>> On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <[email protected]>
>> wrote:
>>> 
>>>> Yeah, we definitely don't want to rewrite parsing in Stellar.  I would
>>>> expect the job of the parser, however, to handle structural issues.  In
>> my
>>>> mind, parsing is about transforming structures into fields and the role
>> of
>>>> the field transformations are to transform values.  There's obvious
>> overlap
>>>> there wherein parsers may do some normalizations/transformations (i.e.
>> look
>>>> how grok handles timestamps), but it almost always gets us into trouble
>>>> when parsers do even moderately complex value transformations.
>>>> 
>>>> As I type this, though, I think I see your point.  What you really want
>> is
>>>> to chain parsers, have a pre-parser to bring you 80% of the way there
>> and
>>>> hammer out all the structural issues so you might be able to use a more
>>>> generic parser down the chain.  I have often thought that maybe we
>> should
>>>> expose parsers as Stellar functions which take raw data and emit whole
>>>> messages.  This would allow us to compose parsers, so imagine the above
>>>> example where you've written a stellar function to normalize the input
>> and
>>>> you're then passing it to a CSV parser, you could run
>>>> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise specify a
>>>> parser.
>>>> 
>>>> As for speed, the stellar expression would get compiled into a java
>> object,
>>>> so it shouldn't be appreciable overhead since we no longer lex and parse
>>>> for every message.
>>>> 
>>>> Is this kinda how you were seeing it?
>>>> 
>>>> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball <
>>>> [email protected]> wrote:
>>>> 
>>>>> The challenge there I suspect is going to be that you essentially end
>> up
>>>>> with the actual parser doing very little of value, and then effectively
>>>>> trying to write a parser in stellar against a few broad strings, which
>>>>> would likely give you all sorts of performance problems.
>>>>> 
>>>>> One solution is to write a very defensive and flexible parser, but that
>>>>> would tend to be time consuming.
>>>>> 
>>>>> There is also something to be said for doing some basic transformation
>>>>> before the parser topic kafka in something like nifi, but again,
>>>>> performance can be an issue there.
>>>>> 
>>>>> If the noise is about broken structure for example, maybe a simple
>>>>> pre-process step as part of your parser would make sense, e.g.
>> stripping
>>>>> syslog headers, or character set conversion, removing very broken bits
>> as
>>>>> part of the parse method.
>>>>> 
>>>>> In terms of normalisation post-parse, I agree, that 100% a job for
>>>>> Stellar, and the fieldTransformations capability. Something I would
>> like
>>>> to
>>>>> see would be a means to use that transformation step to map to a well
>>>> known
>>>>> (though loosely enforced) schema provided by a governance framework,
>> but
>>>>> that is a much bigger topic of conversation.
>>>>> 
>>>>> Not of course that not everything has to be parsed just because it’s in
>>>>> the message. A relatively loose fitting parser which pulls out the
>>>> relevant
>>>>> data for the use case would be fine, and likely a lot more tolerant of
>>>>> noise than something that felt the need for every field. We do after
>> all
>>>>> store the original_string for you if you really absolutely have to had
>>>>> everything, so a more schema-on-read philosophy certainly applies and
>>>> will
>>>>> likely side-step a lot of your issues.
>>>>> 
>>>>> Simon
>>>>> 
>>>>>> On 26 Apr 2017, at 14:37, Casey Stella <[email protected]> wrote:
>>>>>> 
>>>>>> Ok, that's another story.  hmmmm, we don't generally pre-parse becuase
>>>> we
>>>>>> try to not assume any particular format there (i.e. it could be
>>>> strings,
>>>>>> could be byte arrays).  Maybe the right answer is to pass the raw,
>>>>>> non-normalized data (best effort tyep of thing) through the parser and
>>>> do
>>>>>> the normalization post-parse..or is there a problem with that?
>>>>>> 
>>>>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <[email protected]>
>>>>> wrote:
>>>>>> 
>>>>>>> Hi Casey,
>>>>>>> 
>>>>>>> It is actually pre-parse process, not a post-parse one. These type of
>>>>>>> noises affect the position of an attribute for example and give us
>>>>> parsing
>>>>>>> exception. The timestamp example was not a good one because that is
>>>>>>> actually a post-parse exception.
>>>>>>> 
>>>>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <[email protected]>
>>>>> wrote:
>>>>>>> 
>>>>>>>> So, further transformation post-parse was one of the motivating
>>>> reasons
>>>>>>> for
>>>>>>>> Stellar (to do that transformation post-parse).  Is there a
>>>> capability
>>>>>>> that
>>>>>>>> it's lacking that we can add to fit your usecase?
>>>>>>>> 
>>>>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <
>> [email protected]
>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I've created a Jira ticket regarding this feature.
>>>>>>>>> 
>>>>>>>>> https://issues.apache.org/jira/browse/METRON-893
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
>>>> [email protected]
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Currently, we are using normal regex at the Java source code to
>>>>>>> handle
>>>>>>>>>> those situations. However, it would be nice to have a separate
>> bolt
>>>>>>> and
>>>>>>>>>> deal with them separately. Yeah, I can create a Jira issue
>>>> regarding
>>>>>>>>> that.
>>>>>>>>>> The main reason I am asking for such a feature is the fact that
>>>> lack
>>>>>>> of
>>>>>>>>>> such a feature makes the process of creating some parser for the
>>>>>>>>> community
>>>>>>>>>> a little painful for us. We need to maintain two different
>>>> versions,
>>>>>>>> one
>>>>>>>>>> for community another for the internal use case. Clearly, noise is
>>>> an
>>>>>>>>>> inevitable part of real world use cases.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Ali
>>>>>>>>>> 
>>>>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
>>>>>>> [email protected]
>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> Are you doing this cleansing all in the parser or are you using
>>>> any
>>>>>>>>>>> Stellar to do it?
>>>>>>>>>>> Can you create a jira?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (
>>>> [email protected])
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi all,
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> We are facing certain use cases in Metron production that happen
>>>> to
>>>>>>> be
>>>>>>>>>>> related to noisy stream. For example, a wrong timestamp,
>> duplicate
>>>>>>>>>>> hostname/IP address, etc. To deal with the normalization we have
>>>>>>> added
>>>>>>>>> an
>>>>>>>>>>> additional step for the corresponding parsers to do the data
>>>>>>> cleaning.
>>>>>>>>>>> Clearly, parsing is a standard factor which is mostly related to
>>>> the
>>>>>>>>>>> device
>>>>>>>>>>> that is generating the data and can be used for the same type of
>>>>>>>> device
>>>>>>>>>>> everywhere, but normalization is very production dependent and
>>>> there
>>>>>>>> is
>>>>>>>>>>> no
>>>>>>>>>>> point of mixing normalization with parsing. It would be nice to
>>>>>>> have a
>>>>>>>>>>> sperate bolt in a parsing topologies to dedicate to production
>>>>>>>>>>> related cleaning process. In that case, eveybody can easily
>>>>>>> contribute
>>>>>>>>> to
>>>>>>>>>>> Metron community with additional parsers without being worried
>>>> about
>>>>>>>>>>> mixing
>>>>>>>>>>> parsers and data cleaning process.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> 
>>>>>>>>>>> Ali
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> A.Nazemian
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> A.Nazemian
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> A.Nazemian
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> A.Nazemian
>> 
>> 
> 
> 
> -- 
> A.Nazemian

Re: Normalization topology or separate normalization bolt for parsing topology

Reply via email to