[jira] [Updated] (METRON-1795) General Purpose Regex Parser

Jagdeep Singh (JIRA) Wed, 26 Sep 2018 22:23:34 -0700


     [ 
https://issues.apache.org/jira/browse/METRON-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jagdeep Singh updated METRON-1795:
----------------------------------
    Description: 
We have implemented a general purpose regex parser for Metron that we are 
interested in contributing back to the community.

 

While the Metron Grok parser provides some regex based capability today, the 
intention of this general purpose regex parser is to:
 # Allow for more advanced parsing scenarios (specifically, dealing with 
multiple regex lines for devices that contain several log formats within them)
 # Give users and developers of Metron additional options for parsing
 # With the new parser chaining and regex routing feature available in Metron, 
this gives some additional flexibility to logically separate a flow by:
 # Regex routing to segregate logs at a device level and handle envelope 
unwrapping
 # This general purpose regex parser to parse an entire device type that 
contains multiple log formats within the single device (for example, RHEL logs)

At the high-level control flow is like this:
 # Identify the record type if incoming raw message.
 # Find and apply the regular expression of corresponding record type to 
extract the fields (using named groups). 
 # Apply the message header regex to extract the fields in the header part of 
the message (using named groups).

 

The parser config uses the following structure:

  
{code:java}
"recordTypeRegex": "(?<process>(?<=\\s)\\b(kernel|syslog)\\b(?=\\[|:))"  
 "messageHeaderRegex": 
"(?<syslogpriority>(?<=^<)\\d{1,4}(?=>)).*?(?<timestamp>(?<=>)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?<syslogHost>(?<=\\s).*?(?=\\s))",
   "fields": [
      {
        "recordType": "kernel",
        "regex": ".*(?<eventInfo>(?<=\\]|\\w\\:).*?(?=$))"
      },
      {
        "recordType": "syslog",
        "regex": 
".*(?<processid>(?<=PID\\s=\\s).*?(?=\\sLine)).*(?<filePath>(?<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?<fileName>.*?(?=\")).*(?<eventInfo>(?<=\").*?(?=$))"
      }
]
{code}
 

Where:
 * *recordTypeRegex* is used to distinctly identify a record type. It inputs a 
valid regular expression and may also have named groups, which would be 
extracted into fields.
 * *messageHeaderRegex* is used to specify a regular expression to extract 
fields from a message part which is common across all the messages (i.e, syslog 
fields, standard headers)
 * *fields*: json list of objects containing recordType and regex. The 
expression that is evaluated is based on the output of the recordTypeRegex
 * Note: *recordTypeRegex* and *messageHeaderRegex* could be specified as lists 
also (as a JSON array), where the list will be evaluated in order until a 
matching regular expression is found.

  was:
We have implemented a general purpose regex parser for Metron that we are 
interested in contributing back to the community.

 

While the Metron Grok parser provides some regex based capability today, the 
intention of this general purpose regex parser is to:
 # Allow for more advanced parsing scenarios (specifically, dealing with 
multiple regex lines for devices that contain several log formats within them)
 # Give users and developers of Metron additional options for parsing
 # With the new parser chaining and regex routing feature available in Metron, 
this gives some additional flexibility to logically separate a flow by:
 # Regex routing to segregate logs at a device level and handle envelope 
unwrapping
 # This general purpose regex parser to parse an entire device type that 
contains multiple log formats within the single device (for example, RHEL logs)

At the high-level control flow is like this:
 # Identify the record type if incoming raw message.
 # Find and apply the regular expression of corresponding record type to 
extract the fields (using named groups). 
 # Apply the message header regex to extract the fields in the header part of 
the message (using named groups).

 

The parser config uses the following structure:

  
{code:java}
"recordTypeRegex": "(?<process>(?<=\\s)\\b(kernel|syslog)\\b(?=\\[|:))"   
"messageHeaderRegex": 
"(?<syslogpriority>(?<=^<)\\d{1,4}(?=>)).*?(?<timestamp>(?<=>)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?<syslogHost>(?<=\\s).*?(?=\\s))",
   "fields": [
      {
        "recordType": "kernel",
        "regex": ".*(?<eventInfo>(?<=\\]|\\w\\:).*?(?=$))"
      },
      {
        "recordType": "syslog",
        "regex": 
".*(?<processid>(?<=PID\\s=\\s).*?(?=\\sLine)).*(?<filePath>(?<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?<fileName>.*?(?=\")).*(?<eventInfo>(?<=\").*?(?=$))"
      }
]
{code}
 

Where:
 * *recordTypeRegex* is used to distinctly identify a record type. It inputs a 
valid regular expression and may also have named groups, which would be 
extracted into fields.
 * *messageHeaderRegex* is used to specify a regular expression to extract 
fields from a message part which is common across all the messages (i.e, syslog 
fields, standard headers)
 * *fields*: json list of objects containing recordType and regex. The 
expression that is evaluated is based on the output of the recordTypeRegex
 * Note: *recordTypeRegex* and *messageHeaderRegex* could be specified as lists 
also (as a JSON array), where the list will be evaluated in order until a 
matching regular expression is found.


> General Purpose Regex Parser
> ----------------------------
>
>                 Key: METRON-1795
>                 URL: https://issues.apache.org/jira/browse/METRON-1795
>             Project: Metron
>          Issue Type: New Feature
>            Reporter: Jagdeep Singh
>            Priority: Minor
>
> We have implemented a general purpose regex parser for Metron that we are 
> interested in contributing back to the community.
>  
> While the Metron Grok parser provides some regex based capability today, the 
> intention of this general purpose regex parser is to:
>  # Allow for more advanced parsing scenarios (specifically, dealing with 
> multiple regex lines for devices that contain several log formats within them)
>  # Give users and developers of Metron additional options for parsing
>  # With the new parser chaining and regex routing feature available in 
> Metron, this gives some additional flexibility to logically separate a flow 
> by:
>  # Regex routing to segregate logs at a device level and handle envelope 
> unwrapping
>  # This general purpose regex parser to parse an entire device type that 
> contains multiple log formats within the single device (for example, RHEL 
> logs)
> At the high-level control flow is like this:
>  # Identify the record type if incoming raw message.
>  # Find and apply the regular expression of corresponding record type to 
> extract the fields (using named groups). 
>  # Apply the message header regex to extract the fields in the header part of 
> the message (using named groups).
>  
> The parser config uses the following structure:
>   
> {code:java}
> "recordTypeRegex": "(?<process>(?<=\\s)\\b(kernel|syslog)\\b(?=\\[|:))"  
>  "messageHeaderRegex": 
> "(?<syslogpriority>(?<=^<)\\d{1,4}(?=>)).*?(?<timestamp>(?<=>)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?<syslogHost>(?<=\\s).*?(?=\\s))",
>    "fields": [
>       {
>         "recordType": "kernel",
>         "regex": ".*(?<eventInfo>(?<=\\]|\\w\\:).*?(?=$))"
>       },
>       {
>         "recordType": "syslog",
>         "regex": 
> ".*(?<processid>(?<=PID\\s=\\s).*?(?=\\sLine)).*(?<filePath>(?<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?<fileName>.*?(?=\")).*(?<eventInfo>(?<=\").*?(?=$))"
>       }
> ]
> {code}
>  
> Where:
>  * *recordTypeRegex* is used to distinctly identify a record type. It inputs 
> a valid regular expression and may also have named groups, which would be 
> extracted into fields.
>  * *messageHeaderRegex* is used to specify a regular expression to extract 
> fields from a message part which is common across all the messages (i.e, 
> syslog fields, standard headers)
>  * *fields*: json list of objects containing recordType and regex. The 
> expression that is evaluated is based on the output of the recordTypeRegex
>  * Note: *recordTypeRegex* and *messageHeaderRegex* could be specified as 
> lists also (as a JSON array), where the list will be evaluated in order until 
> a matching regular expression is found.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (METRON-1795) General Purpose Regex Parser

Reply via email to