[
https://issues.apache.org/jira/browse/METRON-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953469#comment-15953469
]
ASF GitHub Bot commented on METRON-817:
---------------------------------------
GitHub user justinleet opened a pull request:
https://github.com/apache/incubator-metron/pull/505
METRON-817: Customise output file path patterns for HDFS indexing
## Contributor Comments
Primarily this affects HdfsWriter by changing the output path from a set
path (`/apps/metron/.../<sensor>`), and allow it to be defined via a Stellar
Function. Specifically, the base path is still defined the same (The
`/apps/metron/.../` portion), but the `<sensor>` portion is dropped and can now
be defined by a Stellar function. By default, the original behavior of
`<sensor>` is used. This is defined in the `<sensor>.json` file as indicated
in the new README.md for metron-writer.
### Notes
- This requires adding tracking things a bit more carefully (and if you're
reviewing, please validate that it happens correctly). When the outputFile is
closed, we remove the sourceHandler from HdfsWriter's map.
- I'm slightly concerned about the correctness of the implementation, but
it seems necessary to ensure that we don't leave a bunch of SourceHandlers
lying around as data changes (and we don't want an enormous number of output
files being written to).
- If there's a cleaner way to manage this, I'd love to hear it and can
refactor pretty easily. It throws off the rotation count (because we kill the
SourceHandler from the map itself), but I doubt we care about that since it
really only shows up in the output filename anyway.
- This also adds an argument for max open files. This is a flux level
config. I defaulted this to 500. 500 was chosen because it was an arbitrary
round number that wasn't enormous.
- If someone has a default with any real reasoning behind it, I'll go
ahead and change it.
- In HdfsWriter, we iterate through the messages, apply the Stellar
function and then call the relevant handler. The entire group of message is
treated as one single pass/fail (which is the same as the old behavior), rather
than individually. The try/catch could potentially be moved into the for loop,
but I don't think there's an explicit link between the message and the tuples
that we can exploit to fail per message. I don't think it needs to be
addressed here, but I'm curious if there's thought on this.
### Testing
Unit tests are added to pretty much cover HdfsWriter, and this can be spun
up in a dev environment.
To test in dev
- Spin up a dev environment
- Validate that the output matches the old format in HDFS (Nothing has an
output function defined)
```
[hdfs@node1 vagrant]$ hdfs dfs -ls /apps/metron/indexing/indexed/
Found 3 items
drwxrwxr-x - storm hadoop 0 2017-04-03 13:11
/apps/metron/indexing/indexed/bro
drwxrwxr-x - storm hadoop 0 2017-04-03 13:11
/apps/metron/indexing/indexed/error
drwxrwxr-x - storm hadoop 0 2017-04-03 13:11
/apps/metron/indexing/indexed/snort
```
- Edit the indexing config for Bro to include an outputPathFunction in the
hdfs section, e.g. in `/usr/metron/0.3.1/config/zookeeper/indexing/bro.json`
```
{
"hdfs" : {
"index": "bro",
"batchSize": 5,
"enabled" : true,
"outputPathFunction": "FORMAT('ipsrc-%s', ip_src_addr)"
},
"elasticsearch" : {
"index": "bro",
"batchSize": 5,
"enabled" : true
},
"solr" : {
"index": "bro",
"batchSize": 5,
"enabled" : true
}
}
```
- Push the config configs to ZooKeeper:
`/usr/metron/0.3.1/bin/zk_load_configs.sh -z node1:2181 -m PUSH -i
/usr/metron/0.3.1/config/zookeeper/`
- Let some more data run through and check the output folders, e.g.
```
[hdfs@node1 vagrant]$ hdfs dfs -ls /apps/metron/indexing/indexed/
Found 5 items
drwxrwxr-x - storm hadoop 0 2017-04-03 13:11
/apps/metron/indexing/indexed/bro
drwxrwxr-x - storm hadoop 0 2017-04-03 13:11
/apps/metron/indexing/indexed/error
drwxrwxr-x - storm hadoop 0 2017-04-03 13:14
/apps/metron/indexing/indexed/ipsrc-192.168.138.158
drwxrwxr-x - storm hadoop 0 2017-04-03 13:14
/apps/metron/indexing/indexed/ipsrc-192.168.66.1
drwxrwxr-x - storm hadoop 0 2017-04-03 13:11
/apps/metron/indexing/indexed/snort
[hdfs@node1 vagrant]$ hdfs dfs -ls
/apps/metron/indexing/indexed/ipsrc-192.168.138.158
Found 1 items
-rw-r--r-- 1 storm hadoop 223182 2017-04-03 13:14
/apps/metron/indexing/indexed/ipsrc-192.168.138.158/enrichment-null-0-0-1491225291377.json
```
## Pull Request Checklist
Thank you for submitting a contribution to Apache Metron (Incubating).
Please refer to our [Development
Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235)
for the complete guide to follow for contributions.
Please refer also to our [Build Verification
Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview)
for complete smoke testing guides.
In order to streamline the review of the contribution we ask you follow
these guidelines and ask you to double check the following:
### For all changes:
- [x] Is there a JIRA ticket associated with this PR? If not one needs to
be created at [Metron
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
- [x] Does your PR title start with METRON-XXXX where XXXX is the JIRA
number you are trying to resolve? Pay particular attention to the hyphen "-"
character.
- [x] Has your PR been rebased against the latest commit within the target
branch (typically master)?
### For code changes:
- [x] Have you included steps to reproduce the behavior or problem that is
being changed or addressed?
- [x] Have you included steps or a guide to how the change may be verified
and tested manually?
- [x] Have you ensured that the full suite of tests and checks have been
executed in the root incubating-metron folder via:
```
mvn -q clean integration-test install && build_utils/verify_licenses.sh
```
- [x] Have you written or updated unit tests and or integration tests to
verify your changes?
- ~If adding new dependencies to the code, are these dependencies licensed
in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?~
- [x] Have you verified the basic functionality of the build by building
and running locally with Vagrant full-dev environment or the equivalent?
### For documentation related changes:
- [x] Have you ensured that format looks appropriate for the output in
which it is rendered by building and verifying the site-book? If not then run
the following commands and the verify changes via
`site-book/target/site/index.html`:
```
cd site-book
bin/generate-md.sh
mvn site:site
```
#### Note:
Please ensure that once the PR is submitted, you check travis-ci for build
issues and submit an update to your PR as soon as possible.
It is also recommened that [travis-ci](https://travis-ci.org) is set up for
your personal repository such that your branches are built there before
submitting a pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/justinleet/incubator-metron hdfs_path
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-metron/pull/505.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #505
----
commit e84c1393c293a809372386fdcf374f0c3dc50d9c
Author: justinjleet <[email protected]>
Date: 2017-03-31T03:34:21Z
Allowing for message guided output and adding doc
commit 9762cb6c473c6c11da70963f8de1e24f6f7b0502
Author: justinjleet <[email protected]>
Date: 2017-04-03T11:41:25Z
renamed parameters in override method for clarity, added test around
SourceFileNameFormat to ensure additions work
commit 6693740a85a5cd9d90ac4a413e2037410f927432
Author: justinjleet <[email protected]>
Date: 2017-04-03T11:47:57Z
Removing extraneous json change, and not tripping rat
commit 72348501c4f7876aaaa7930f600bf68bc98a0d61
Author: justinjleet <[email protected]>
Date: 2017-04-03T12:12:10Z
Adjusting output
commit 8b19e67c10821b12c8e09439fc7c89528372df77
Author: justinjleet <[email protected]>
Date: 2017-04-03T12:19:13Z
README adjustment
commit 10b9c52ad373e395f612a8253385d6e9783cb09e
Author: justinjleet <[email protected]>
Date: 2017-04-03T12:22:22Z
Renaming SourceFileNameFormat, cleaning up a couple minor annoyances in
SourceAwareMoveAction
----
> Customise output file path patterns for HDFS indexing
> -----------------------------------------------------
>
> Key: METRON-817
> URL: https://issues.apache.org/jira/browse/METRON-817
> Project: Metron
> Issue Type: Improvement
> Reporter: Justin Leet
> Assignee: Justin Leet
>
> We need to be able to customize the filepaths for HDFS indexing, to allow
> fields to be part of the naming. For example, if I have a 'tenant' field, I
> should be able to direct it to a path
> {code}
> /apps/metron/.../{tenant}/{sensor}/{date}/filename-34324-432434.json
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)