Re: Performance comparison between Grok and Java regex

Simon Elliston Ball Wed, 11 Jul 2018 08:37:32 -0700

A streaming token parser might well get you good performance for that format... 
maybe something like an antlr grammar or even a simple scanner. Regex is not 
the only pattern :)


It would also be great to see such a parser contributed back to the community 
of possible, and I sure we would be happy to help maintain and improve it in 
the open source.

Simon

> On 11 Jul 2018, at 16:26, Muhammed Irshad <[email protected]> wrote:
> 
> Otto Fowler,
> 
> Yes, I am Ok with the trade-offs. In case of Active Directory log records
> can I parse it using non-regex custom parser ? I think we need one pattern
> matching library right as it is plain text thing ? One of the dummy AD
> record of my use case would be like this below.
> 
> 
> 12/02/2017 05:14:43 PM LogName=Security SourceName=Microsoft Windows
> security auditing. EventCode=4625 EventType=0 Type=Information ComputerName=
> dc1.ad.ecorp.com TaskCategory=Logon OpCode=Info
> RecordNumber=95055509895231650867 Keywords=Audit Success Message=An account
> failed to log on. Subject: Security ID: NULL SID Account Name: - Account
> Domain: - Logon ID: 0x0 Logon Type: 3 Account For Which Logon Failed:
> Security ID: NULL SID Account Name: K1560365938U$ Account Domain: ECORP
> Failure Information: Failure Reason: Unknown user name or bad password.
> Status: 0xC000006D Sub Status: 0xC000006A Network Information: Workstation
> Name: K1560365938U Source Network Address: 192.168.151.95 Source Port:
> 53176 Detailed Authentification Information: Logon Process: NtLmSsp
> Authentification Package: NTLM Transited Services: - Package Name (NTLM
> ONLY): - Key Length: 0 This event is generated when a logon request fails.
> It is generated on the computer where access was attempted. The Subject
> fields indicate the account on the local system which requested the logon.
> This is most commonly a service such as the Server service, or a local
> process such as Winlogon.exe or Services.exe. The Logon Type field
> indicates the kind of logon that was requested. The most common types are 2
> (interactive) and 3 (network). The Process Information fields indicate
> which account and process on the system requested the logon. The Network
> Information fields indicate where a remote logon request originated.
> Workstation name is not always available and may be left blank in some
> cases. The authentication information fields provide detailed information
> about this specific logon request. Transited services indicate which
> intermediate services have participated in this logon request. Package name
> indicates which sub-protocol was used among the NTLM protocols
> 
> On Wed, Jul 11, 2018 at 8:44 PM, Otto Fowler <[email protected]>
> wrote:
> 
>> I am not saying it is faster, just giving some info.
>> 
>> Also, that part of the documentation is not referring to regex v. grok,
>> but grok verses a custom non-regex parser, at least as I read it.
>> 
>> If you have the ability to build, deploy, test and maintain a custom
>> parser ( unless you will be submitting it to the project? ), then in most
>> cases where performance
>> is the top issue ( or rather throughput ) you are most likely going to get
>> better results that way.  Accepting that you are ok with the tradeoffs.
>> 
>> If you have 10M mps parsing might night be your bottleneck.
>> 
>> 
>> 
>> 
>> 
>> On July 11, 2018 at 11:01:19, Muhammed Irshad ([email protected])
>> wrote:
>> 
>> Otto Fowler,
>> 
>> Thanks for the reply. I saw it uses same Java regex under the hood. I got
>> bit sceptic by seeing this open issue
>> <https://github.com/thekrakken/java-grok/issues/75> in java-grok which
>> says
>> grok is much slower when compared with pure regex. The fix is not
>> available
>> yet in metron as it need few changes in the API and issue to be closed. As
>> data volume is so huge in my requirement I had to double check and confirm
>> before I go with one. Also metron documentation
>> <https://metron.apache.org/current-book/metron-platform/
>> metron-parsers/index.html>
>> itself says the below statement under Parser Adapter section.
>> 
>> "Grok parser adapters are designed primarly for someone who is not a Java
>> coder for quickly standing up a parser adapter for lower velocity
>> topologies. Grok relies on Regex for message parsing, which is much slower
>> than purpose-built Java parsers, but is more extensible. Grok parsers are
>> defined via a config file and the topplogy does not need to be recombiled
>> in order to make changes to them."
>> 
>> On Wed, Jul 11, 2018 at 8:01 PM, Otto Fowler <[email protected]>
>> wrote:
>> 
>>> Java-Grok IS java regex. It is just a DSL over Java regex. It takes grok
>>> expressions ( that can reference other expressions and be compound ) and
>>> parses/resolves them and then builds one big regex out of them.
>>> Also, Groks, once parsed / used are re-used, so at that point they are
>>> like compiled regex’s.
>>> 
>>> That is not to say that that takes 0 time, but it may help you to
>>> understand.
>>> 
>>> https://github.com/thekrakken/java-grok/blob/master/src/
>>> main/java/io/krakens/grok/api/Grok.java
>>> https://github.com/thekrakken/java-grok/blob/master/src/
>>> main/java/io/krakens/grok/api/GrokCompiler.java
>>> 
>>> On July 11, 2018 at 07:13:38, Muhammed Irshad ([email protected])
>>> wrote:
>>> 
>>> Thanks a lot Kevin for replying. Which thread are you mentioning ? The
>>> stackoverflow link ? I could not see any such option.
>>> 
>>> On Wed, Jul 11, 2018 at 3:04 PM, Kevin Waterson <
>> [email protected]>
>>> 
>>> wrote:
>>> 
>>>> Like the thread says, the two regex engines are wildly different,
>>> however..
>>>> you can increase the threads using -w option in grok to increase the
>>>> threads.
>>>> 
>>>> Kevin
>>>> 
>>>> On Wed, Jul 11, 2018 at 5:35 PM Muhammed Irshad <
>> [email protected]>
>>> 
>>>> wrote:
>>>> 
>>>>> Hi All,
>>>>> 
>>>>> I am trying to write Java custom parser for parsing AD logs. I am
>>>> expecting
>>>>> log flow of 10 million AD events per second. Is using Java regex to
>>> parse
>>>>> benefit over using Grok parser in terms of performance ? Is there
>> any
>>>>> performance benchmark or insights regarding the same ?
>>>>> 
>>>>> I found this stackoverflow
>>>>> <
>>>>> https://stackoverflow.com/questions/43222863/logstash-
>>>> grok-filter-is-slower-than-java-regex-pattern-matching
>>>>>> 
>>>>> question which inspired me for this post.
>>>>> 
>>>>> --
>>>>> Muhammed Irshad K T
>>>>> Senior Software Engineer
>>>>> +919447946359
>>>>> [email protected]
>>>>> Skype : muhammed.irshad.k.t
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Muhammed Irshad K T
>>> Senior Software Engineer
>>> +919447946359
>>> [email protected]
>>> Skype : muhammed.irshad.k.t
>>> 
>>> 
>> 
>> 
>> --
>> Muhammed Irshad K T
>> Senior Software Engineer
>> +919447946359
>> [email protected]
>> Skype : muhammed.irshad.k.t
>> 
>> 
> 
> 
> -- 
> Muhammed Irshad K T
> Senior Software Engineer
> +919447946359
> [email protected]
> Skype : muhammed.irshad.k.t

Re: Performance comparison between Grok and Java regex

Reply via email to