[
https://issues.apache.org/jira/browse/METRON-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661752#comment-16661752
]
ASF GitHub Bot commented on METRON-1795:
----------------------------------------
GitHub user jagdeepsingh2 opened a pull request:
https://github.com/apache/metron/pull/1245
METRON-1795: Initial Commit for Regular Expressions Parser
## Contributor Comments
Contributing a new general purpose regular expressions based parser.
## Pull Request Checklist
Thank you for submitting a contribution to Apache Metron.
Please refer to our [Development
Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235)
for the complete guide to follow for contributions.
Please refer also to our [Build Verification
Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview)
for complete smoke testing guides.
In order to streamline the review of the contribution we ask you follow
these guidelines and ask you to double check the following:
### For all changes:
- [ ] Is there a JIRA ticket associated with this PR? If not one needs to
be created at [Metron
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
**Yes. Jira created for this PR.
https://issues.apache.org/jira/browse/METRON-1795**
- [ ] Does your PR title start with METRON-XXXX where XXXX is the JIRA
number you are trying to resolve? Pay particular attention to the hyphen "-"
character.
**Yes.**
- [ ] Has your PR been rebased against the latest commit within the target
branch (typically master)?
**Yes**
### For code changes:
- [ ] Have you included steps to reproduce the behavior or problem that is
being changed or addressed?
**N/A as this PR is for a new feature.**
- [ ] Have you included steps or a guide to how the change may be verified
and tested manually?
**Yes. Included Junit can be used to test the new parser.**
- [ ] Have you ensured that the full suite of tests and checks have been
executed in the root metron folder via:
```
mvn -q clean integration-test install &&
dev-utilities/build-utils/verify_licenses.sh
```
**Yes.**
- [ ] Have you written or updated unit tests and or integration tests to
verify your changes?
**I have included the unit tests.**
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
**N/A**
- [ ] Have you verified the basic functionality of the build by building
and running locally with Vagrant full-dev environment or the equivalent?
**Yes**
### For documentation related changes:
- [ ] Have you ensured that format looks appropriate for the output in
which it is rendered by building and verifying the site-book? If not then run
the following commands and the verify changes via
`site-book/target/site/index.html`:
```
cd site-book
mvn site
```
**Yes.**
#### Note:
Please ensure that once the PR is submitted, you check travis-ci for build
issues and submit an update to your PR as soon as possible.
It is also recommended that [travis-ci](https://travis-ci.org) is set up
for your personal repository such that your branches are built there before
submitting a pull request.
Note: This is a follow up for an earlier PR for METRON-1795, which was
created and subsequently closed due to corrupted git commits history. Following
comments were posted in earlier PR which I am posting here again with my
disposition.
@nickwallen commented 27 days ago
Thanks for the contribution @jagdeepsingh2. To take this any further we
need at a minimum the following items.
**An explanation of what itch this scratches (Why is this needed over Grok
parser?)**
This question was answered in the associated jira ticket
(https://issues.apache.org/jira/browse/METRON-1795). In a nutshell
Allow for more advanced parsing scenarios (specifically, dealing with
multiple regex lines for devices that contain several log formats within them)
Give users and developers of Metron additional options for parsing
With the new parser chaining and regex routing feature available in Metron,
this gives some additional flexibility to logically separate a flow by:
Regex routing to segregate logs at a device level and handle envelope
unwrapping
This general purpose regex parser to parse an entire device type that
contains multiple log formats within the single device (for example, RHEL logs)
Also, as per GrokParser documentation
(https://cwiki.apache.org/confluence/display/METRON/Parsing+Topology) it is
intended for low volume scenarios only, while we have tested this parser
(RegularExpressionsParser) in very high volume scenarios also.
**Documented Instructions on how to use your parser. Include a README.md in
your code contribution.**
I have updated the README.md file in the metron-parsers project.
**A test plan including in your PR description showing us how to spin-up
and test your parser**
I have included the junit test for this parser, included the JavaDoc and
also updated the README.md file in the metron-parsers project. The
documentation when used in conjunction with unit tests is enough to test and
spin-up this parser.
**A description of how you have personally tested this**
We have unit and integration tested this parser for lots of different
devices. This parser has also been successfully running in our production
environment for more than six months now.
mmiklavc commented 22 days ago
**@jagdeepsingh2 Some emphasis on the configuration options for this parser
would be particularly useful.
Please refer to
https://github.com/apache/metron/tree/master/metron-platform/metron-parsers for
some good examples of how we document existing Metron parsers.**
Thanks, I have added the documentaiton in current PR now.
jagdeepsingh2 commented 19 days ago •
@mmiklavc Yeah, I performed a rebase yesterday as I had to pull the latest
changes from upstream. What is the best way out? Should I discard this PR and
create a fresh and clean PR?
mmiklavc commented 18 days ago
@jagdeepsingh2 - you could try this -
https://stackoverflow.com/questions/134882/undoing-a-git-rebase, but at this
point it might be better to just open a new PR bc pushing up to github is going
to cause some additional drama as well. You'll want to keep the default
checklist that's populated in the description when you open the PR. Please note
the comments from @nickwallen and myself regarding what should also be included
in your description.
In general, once you've pushed a branch to the public it's better to just
git merge, otherwise you can get into trouble like this. We flatten PR's once
they're committed to master anyhow.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jagdeepsingh2/metron master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/metron/pull/1245.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1245
----
commit fefb21c5b74d8021107986e2936017042ae54d0e
Author: jagdeep <jagdeep.singh.2@...>
Date: 2018-10-24T04:24:47Z
METRON-1795: Initial Commit for Regular Expressions Parser
----
> General Purpose Regex Parser
> ----------------------------
>
> Key: METRON-1795
> URL: https://issues.apache.org/jira/browse/METRON-1795
> Project: Metron
> Issue Type: New Feature
> Reporter: Jagdeep Singh
> Priority: Minor
>
> We have implemented a general purpose regex parser for Metron that we are
> interested in contributing back to the community.
>
> While the Metron Grok parser provides some regex based capability today, the
> intention of this general purpose regex parser is to:
> # Allow for more advanced parsing scenarios (specifically, dealing with
> multiple regex lines for devices that contain several log formats within them)
> # Give users and developers of Metron additional options for parsing
> # With the new parser chaining and regex routing feature available in
> Metron, this gives some additional flexibility to logically separate a flow
> by:
> # Regex routing to segregate logs at a device level and handle envelope
> unwrapping
> # This general purpose regex parser to parse an entire device type that
> contains multiple log formats within the single device (for example, RHEL
> logs)
> At the high-level control flow is like this:
> # Identify the record type if incoming raw message.
> # Find and apply the regular expression of corresponding record type to
> extract the fields (using named groups).
> # Apply the message header regex to extract the fields in the header part of
> the message (using named groups).
>
> The parser config uses the following structure:
>
> {code:java}
> "recordTypeRegex": "(?<process>(?<=\\s)\\b(kernel|syslog)\\b(?=\\[|:))"
> "messageHeaderRegex":
> "(?<syslogpriority>(?<=^<)\\d{1,4}(?=>)).*?(?<timestamp>(?<=>)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?<syslogHost>(?<=\\s).*?(?=\\s))",
> "fields": [
> {
> "recordType": "kernel",
> "regex": ".*(?<eventInfo>(?<=\\]|\\w\\:).*?(?=$))"
> },
> {
> "recordType": "syslog",
> "regex":
> ".*(?<processid>(?<=PID\\s=\\s).*?(?=\\sLine)).*(?<filePath>(?<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?<fileName>.*?(?=\")).*(?<eventInfo>(?<=\").*?(?=$))"
> }
> ]
> {code}
>
> Where:
> * *recordTypeRegex* is used to distinctly identify a record type. It inputs
> a valid regular expression and may also have named groups, which would be
> extracted into fields.
> * *messageHeaderRegex* is used to specify a regular expression to extract
> fields from a message part which is common across all the messages (i.e,
> syslog fields, standard headers)
> * *fields*: json list of objects containing recordType and regex. The
> expression that is evaluated is based on the output of the recordTypeRegex
> * Note: *recordTypeRegex* and *messageHeaderRegex* could be specified as
> lists also (as a JSON array), where the list will be evaluated in order until
> a matching regular expression is found.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)