A streaming token parser might well get you good performance for that format... maybe something like an antlr grammar or even a simple scanner. Regex is not the only pattern :)
It would also be great to see such a parser contributed back to the community of possible, and I sure we would be happy to help maintain and improve it in the open source. Simon > On 11 Jul 2018, at 16:26, Muhammed Irshad <irshadkt....@gmail.com> wrote: > > Otto Fowler, > > Yes, I am Ok with the trade-offs. In case of Active Directory log records > can I parse it using non-regex custom parser ? I think we need one pattern > matching library right as it is plain text thing ? One of the dummy AD > record of my use case would be like this below. > > > 12/02/2017 05:14:43 PM LogName=Security SourceName=Microsoft Windows > security auditing. EventCode=4625 EventType=0 Type=Information ComputerName= > dc1.ad.ecorp.com TaskCategory=Logon OpCode=Info > RecordNumber=95055509895231650867 Keywords=Audit Success Message=An account > failed to log on. Subject: Security ID: NULL SID Account Name: - Account > Domain: - Logon ID: 0x0 Logon Type: 3 Account For Which Logon Failed: > Security ID: NULL SID Account Name: K1560365938U$ Account Domain: ECORP > Failure Information: Failure Reason: Unknown user name or bad password. > Status: 0xC000006D Sub Status: 0xC000006A Network Information: Workstation > Name: K1560365938U Source Network Address: 192.168.151.95 Source Port: > 53176 Detailed Authentification Information: Logon Process: NtLmSsp > Authentification Package: NTLM Transited Services: - Package Name (NTLM > ONLY): - Key Length: 0 This event is generated when a logon request fails. > It is generated on the computer where access was attempted. The Subject > fields indicate the account on the local system which requested the logon. > This is most commonly a service such as the Server service, or a local > process such as Winlogon.exe or Services.exe. The Logon Type field > indicates the kind of logon that was requested. The most common types are 2 > (interactive) and 3 (network). The Process Information fields indicate > which account and process on the system requested the logon. The Network > Information fields indicate where a remote logon request originated. > Workstation name is not always available and may be left blank in some > cases. The authentication information fields provide detailed information > about this specific logon request. Transited services indicate which > intermediate services have participated in this logon request. Package name > indicates which sub-protocol was used among the NTLM protocols > > On Wed, Jul 11, 2018 at 8:44 PM, Otto Fowler <ottobackwa...@gmail.com> > wrote: > >> I am not saying it is faster, just giving some info. >> >> Also, that part of the documentation is not referring to regex v. grok, >> but grok verses a custom non-regex parser, at least as I read it. >> >> If you have the ability to build, deploy, test and maintain a custom >> parser ( unless you will be submitting it to the project? ), then in most >> cases where performance >> is the top issue ( or rather throughput ) you are most likely going to get >> better results that way. Accepting that you are ok with the tradeoffs. >> >> If you have 10M mps parsing might night be your bottleneck. >> >> >> >> >> >> On July 11, 2018 at 11:01:19, Muhammed Irshad (irshadkt....@gmail.com) >> wrote: >> >> Otto Fowler, >> >> Thanks for the reply. I saw it uses same Java regex under the hood. I got >> bit sceptic by seeing this open issue >> <https://github.com/thekrakken/java-grok/issues/75> in java-grok which >> says >> grok is much slower when compared with pure regex. The fix is not >> available >> yet in metron as it need few changes in the API and issue to be closed. As >> data volume is so huge in my requirement I had to double check and confirm >> before I go with one. Also metron documentation >> <https://metron.apache.org/current-book/metron-platform/ >> metron-parsers/index.html> >> itself says the below statement under Parser Adapter section. >> >> "Grok parser adapters are designed primarly for someone who is not a Java >> coder for quickly standing up a parser adapter for lower velocity >> topologies. Grok relies on Regex for message parsing, which is much slower >> than purpose-built Java parsers, but is more extensible. Grok parsers are >> defined via a config file and the topplogy does not need to be recombiled >> in order to make changes to them." >> >> On Wed, Jul 11, 2018 at 8:01 PM, Otto Fowler <ottobackwa...@gmail.com> >> wrote: >> >>> Java-Grok IS java regex. It is just a DSL over Java regex. It takes grok >>> expressions ( that can reference other expressions and be compound ) and >>> parses/resolves them and then builds one big regex out of them. >>> Also, Groks, once parsed / used are re-used, so at that point they are >>> like compiled regex’s. >>> >>> That is not to say that that takes 0 time, but it may help you to >>> understand. >>> >>> https://github.com/thekrakken/java-grok/blob/master/src/ >>> main/java/io/krakens/grok/api/Grok.java >>> https://github.com/thekrakken/java-grok/blob/master/src/ >>> main/java/io/krakens/grok/api/GrokCompiler.java >>> >>> On July 11, 2018 at 07:13:38, Muhammed Irshad (irshadkt....@gmail.com) >>> wrote: >>> >>> Thanks a lot Kevin for replying. Which thread are you mentioning ? The >>> stackoverflow link ? I could not see any such option. >>> >>> On Wed, Jul 11, 2018 at 3:04 PM, Kevin Waterson < >> kevin.water...@gmail.com> >>> >>> wrote: >>> >>>> Like the thread says, the two regex engines are wildly different, >>> however.. >>>> you can increase the threads using -w option in grok to increase the >>>> threads. >>>> >>>> Kevin >>>> >>>> On Wed, Jul 11, 2018 at 5:35 PM Muhammed Irshad < >> irshadkt....@gmail.com> >>> >>>> wrote: >>>> >>>>> Hi All, >>>>> >>>>> I am trying to write Java custom parser for parsing AD logs. I am >>>> expecting >>>>> log flow of 10 million AD events per second. Is using Java regex to >>> parse >>>>> benefit over using Grok parser in terms of performance ? Is there >> any >>>>> performance benchmark or insights regarding the same ? >>>>> >>>>> I found this stackoverflow >>>>> < >>>>> https://stackoverflow.com/questions/43222863/logstash- >>>> grok-filter-is-slower-than-java-regex-pattern-matching >>>>>> >>>>> question which inspired me for this post. >>>>> >>>>> -- >>>>> Muhammed Irshad K T >>>>> Senior Software Engineer >>>>> +919447946359 >>>>> irshadkt....@gmail.com >>>>> Skype : muhammed.irshad.k.t >>>>> >>>> >>> >>> >>> >>> -- >>> Muhammed Irshad K T >>> Senior Software Engineer >>> +919447946359 >>> irshadkt....@gmail.com >>> Skype : muhammed.irshad.k.t >>> >>> >> >> >> -- >> Muhammed Irshad K T >> Senior Software Engineer >> +919447946359 >> irshadkt....@gmail.com >> Skype : muhammed.irshad.k.t >> >> > > > -- > Muhammed Irshad K T > Senior Software Engineer > +919447946359 > irshadkt....@gmail.com > Skype : muhammed.irshad.k.t