The challenge there I suspect is going to be that you essentially end up with 
the actual parser doing very little of value, and then effectively trying to 
write a parser in stellar against a few broad strings, which would likely give 
you all sorts of performance problems. 

One solution is to write a very defensive and flexible parser, but that would 
tend to be time consuming. 

There is also something to be said for doing some basic transformation before 
the parser topic kafka in something like nifi, but again, performance can be an 
issue there. 

If the noise is about broken structure for example, maybe a simple pre-process 
step as part of your parser would make sense, e.g. stripping syslog headers, or 
character set conversion, removing very broken bits as part of the parse 
method. 

In terms of normalisation post-parse, I agree, that 100% a job for Stellar, and 
the fieldTransformations capability. Something I would like to see would be a 
means to use that transformation step to map to a well known (though loosely 
enforced) schema provided by a governance framework, but that is a much bigger 
topic of conversation.

Not of course that not everything has to be parsed just because it’s in the 
message. A relatively loose fitting parser which pulls out the relevant data 
for the use case would be fine, and likely a lot more tolerant of noise than 
something that felt the need for every field. We do after all store the 
original_string for you if you really absolutely have to had everything, so a 
more schema-on-read philosophy certainly applies and will likely side-step a 
lot of your issues. 

Simon

> On 26 Apr 2017, at 14:37, Casey Stella <ceste...@gmail.com> wrote:
> 
> Ok, that's another story.  hmmmm, we don't generally pre-parse becuase we
> try to not assume any particular format there (i.e. it could be strings,
> could be byte arrays).  Maybe the right answer is to pass the raw,
> non-normalized data (best effort tyep of thing) through the parser and do
> the normalization post-parse..or is there a problem with that?
> 
> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <alinazem...@gmail.com> wrote:
> 
>> Hi Casey,
>> 
>> It is actually pre-parse process, not a post-parse one. These type of
>> noises affect the position of an attribute for example and give us parsing
>> exception. The timestamp example was not a good one because that is
>> actually a post-parse exception.
>> 
>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <ceste...@gmail.com> wrote:
>> 
>>> So, further transformation post-parse was one of the motivating reasons
>> for
>>> Stellar (to do that transformation post-parse).  Is there a capability
>> that
>>> it's lacking that we can add to fit your usecase?
>>> 
>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <alinazem...@gmail.com>
>>> wrote:
>>> 
>>>> I've created a Jira ticket regarding this feature.
>>>> 
>>>> https://issues.apache.org/jira/browse/METRON-893
>>>> 
>>>> 
>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <alinazem...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Currently, we are using normal regex at the Java source code to
>> handle
>>>>> those situations. However, it would be nice to have a separate bolt
>> and
>>>>> deal with them separately. Yeah, I can create a Jira issue regarding
>>>> that.
>>>>> The main reason I am asking for such a feature is the fact that lack
>> of
>>>>> such a feature makes the process of creating some parser for the
>>>> community
>>>>> a little painful for us. We need to maintain two different versions,
>>> one
>>>>> for community another for the internal use case. Clearly, noise is an
>>>>> inevitable part of real world use cases.
>>>>> 
>>>>> Cheers,
>>>>> Ali
>>>>> 
>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
>> ottobackwa...@gmail.com
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Are you doing this cleansing all in the parser or are you using any
>>>>>> Stellar to do it?
>>>>>> Can you create a jira?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (alinazem...@gmail.com)
>>>>>> wrote:
>>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> 
>>>>>> We are facing certain use cases in Metron production that happen to
>> be
>>>>>> related to noisy stream. For example, a wrong timestamp, duplicate
>>>>>> hostname/IP address, etc. To deal with the normalization we have
>> added
>>>> an
>>>>>> additional step for the corresponding parsers to do the data
>> cleaning.
>>>>>> Clearly, parsing is a standard factor which is mostly related to the
>>>>>> device
>>>>>> that is generating the data and can be used for the same type of
>>> device
>>>>>> everywhere, but normalization is very production dependent and there
>>> is
>>>>>> no
>>>>>> point of mixing normalization with parsing. It would be nice to
>> have a
>>>>>> sperate bolt in a parsing topologies to dedicate to production
>>>>>> related cleaning process. In that case, eveybody can easily
>> contribute
>>>> to
>>>>>> Metron community with additional parsers without being worried about
>>>>>> mixing
>>>>>> parsers and data cleaning process.
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> 
>>>>>> Ali
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> A.Nazemian
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> A.Nazemian
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> A.Nazemian
>> 

Reply via email to