[
https://issues.apache.org/jira/browse/METRON-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Justin Leet updated METRON-1795:
--------------------------------
Fix Version/s: 0.7.1
> General Purpose Regex Parser
> ----------------------------
>
> Key: METRON-1795
> URL: https://issues.apache.org/jira/browse/METRON-1795
> Project: Metron
> Issue Type: New Feature
> Reporter: Jagdeep Singh
> Priority: Minor
> Fix For: 0.7.1
>
> Time Spent: 2h 40m
> Remaining Estimate: 0h
>
> We have implemented a general purpose regex parser for Metron that we are
> interested in contributing back to the community.
>
> While the Metron Grok parser provides some regex based capability today, the
> intention of this general purpose regex parser is to:
> # Allow for more advanced parsing scenarios (specifically, dealing with
> multiple regex lines for devices that contain several log formats within them)
> # Give users and developers of Metron additional options for parsing
> # With the new parser chaining and regex routing feature available in
> Metron, this gives some additional flexibility to logically separate a flow
> by:
> # Regex routing to segregate logs at a device level and handle envelope
> unwrapping
> # This general purpose regex parser to parse an entire device type that
> contains multiple log formats within the single device (for example, RHEL
> logs)
> At the high-level control flow is like this:
> # Identify the record type if incoming raw message.
> # Find and apply the regular expression of corresponding record type to
> extract the fields (using named groups).
> # Apply the message header regex to extract the fields in the header part of
> the message (using named groups).
>
> The parser config uses the following structure:
>
> {code:java}
> "recordTypeRegex": "(?<process>(?<=\\s)\\b(kernel|syslog)\\b(?=\\[|:))"
> "messageHeaderRegex":
> "(?<syslogpriority>(?<=^<)\\d{1,4}(?=>)).*?(?<timestamp>(?<=>)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?<syslogHost>(?<=\\s).*?(?=\\s))",
> "fields": [
> {
> "recordType": "kernel",
> "regex": ".*(?<eventInfo>(?<=\\]|\\w\\:).*?(?=$))"
> },
> {
> "recordType": "syslog",
> "regex":
> ".*(?<processid>(?<=PID\\s=\\s).*?(?=\\sLine)).*(?<filePath>(?<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?<fileName>.*?(?=\")).*(?<eventInfo>(?<=\").*?(?=$))"
> }
> ]
> {code}
>
> Where:
> * *recordTypeRegex* is used to distinctly identify a record type. It inputs
> a valid regular expression and may also have named groups, which would be
> extracted into fields.
> * *messageHeaderRegex* is used to specify a regular expression to extract
> fields from a message part which is common across all the messages (i.e,
> syslog fields, standard headers)
> * *fields*: json list of objects containing recordType and regex. The
> expression that is evaluated is based on the output of the recordTypeRegex
> * Note: *recordTypeRegex* and *messageHeaderRegex* could be specified as
> lists also (as a JSON array), where the list will be evaluated in order until
> a matching regular expression is found.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)