[
https://issues.apache.org/jira/browse/METRON-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636167#comment-16636167
]
ASF GitHub Bot commented on METRON-1801:
----------------------------------------
GitHub user nickwallen opened a pull request:
https://github.com/apache/metron/pull/1218
METRON-1801 Allow Customization of Elasticsearch Document ID
Currently, the Metron GUID is always used as the Elasticsearch document ID.
As documented in
[METRON-1677](https://issues.apache.org/jira/browse/METRON-1677), using a
randomized UUID like Java's `UUID.randomUUID()` can negatively impact
Elasticsearch performance. This change allows a user to customize the
identifier that is used by Elasticsearch when indexing documents.
We do this by allowing a user to specify the name of the message field
whose value is set as the document ID. The user can customize this by defining
a global variable called `es.document.id`. There are three usage scenarios
that I see.
* By default, Metron's GUID field will be used as the source of the
document ID. This ensures backwards compatible behavior. This is the behavior
should the value be set as below or should the global variable not be set.
```
es.document.id = guid
```
* If a user wants Elasticsearch to define its own document id, then
`es.document.id` should be set to a blank value or empty string. In this case,
the document ID will not be set by the client and Elasticsearch will define its
own.
```
es.document.id =
```
* If a user wants to set their own custom document ID, they should create
an enrichment that defines a new message field like `my_document_id`. They
should then use this new field to set the Elasticsearch document ID.
```
es.document.id = my_document_id
```
## TODO
I have a few more loose ends to tie-up, but wanted to get a start on the
test plan and description in case the community has early feedback to offer.
- [ ] Allow user to set the `es.document.id` value in the Mpack.
- [ ] Document this global settings and usage scenario in a README.
- [ ] More unit/integration tests might be needed. Trying to determine
where those need to go.
- [ ] Run the UI e2e tests to ensure they remain happy.
- [ ] Fix issue with Solr integration tests.
## Changes
* The `ElasticsearchWriter` was updated to allow the document ID to be
configurable.
* A 'search by GUID' in the REST layer was implicitly using the document
ID, whereas it should be using the Metron GUID.
* Search results should use the Metron GUID as the ID returned to the UI.
All IDs visible to the user should always be the Metron GUID, not the document
ID.
## Testing
1. Spin-up a development environment. You may need to stop the PCAP and/or
Profiler topology to free-up slots to allow indexing to occur.
```
cd metron-deployment/development/centos6
vagrant up
```
1. Ensure that alerts are visible in the Alerts UI.
1. Stop the indexing topologies using Ambari.
1. Login to the VM.
```
vagrant ssh
sudo su -
```
1. Delete the existing indices in Elasticsearch.
```
curl -XDELETE http://node1:9200/bro*
curl -XDELETE http://node1:9200/snort*
```
1. Launch the REPL.
```
source /etc/default/metron
cd $METRON_HOME
bin/stellar -z $ZOOKEEPER
```
1. Change the configuration so that Elasticsearch generates its own unique
document ID. Define `es.doc.id.source.field` to be an empty or blank in the
global settings.
```
[Stellar]>>> g := CONFIG_GET("GLOBAL")
...
[Stellar]>>> g := SHELL_EDIT(g)
{
"es.clustername" : "metron",
"es.ip" : "node1:9300",
"es.date.format" : "yyyy.MM.dd.HH",
"es.document.id": " ",
"parser.error.topic" : "indexing",
"update.hbase.table" : "metron_update",
"update.hbase.cf" : "t",
"es.client.settings" : {
"client.transport.ping_timeout" : "500s"
},
"profiler.client.period.duration" : "15",
"profiler.client.period.duration.units" : "MINUTES",
"user.settings.hbase.table" : "user_settings",
"user.settings.hbase.cf" : "cf",
"bootstrap.servers" : "node1:6667",
"source.type.field" : "source:type",
"threat.triage.score.field" : "threat:triage:score",
"enrichment.writer.batchSize" : "15",
"enrichment.writer.batchTimeout" : "0",
"profiler.writer.batchSize" : "15",
"profiler.writer.batchTimeout" : "0",
"geo.hdfs.file" : "/apps/metron/geo/default/GeoLite2-City.mmdb.gz"
}
...
[Stellar]>>> CONFIG_PUT("GLOBAL", g)
```
1. Restart the indexing topology.
```
bin/start_elasticsearch_topology.sh
```
1. Open the Alerts UI and ensure that alerts are visible. Notice that the
ID listed in the table has not changed. This will always display the Metron
GUID, no matter what ID used for the document.

1. Click on a GUID in the table to search for a single alert.

1. Create a meta-alert and ensure that alerts tied to the meta-alert are
still discoverable by GUID.

1. Open Kibana and verify that indeed Elasticsearch is generated its own
document IDs. You will notice an `_id` field which has been generated by
Elasticsearch. This will be different than the UUID generated by Metron and
stored as part of the document as `guid`.

## Pull Request Checklist
- [ ] Is there a JIRA ticket associated with this PR? If not one needs to
be created at [Metron
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
- [ ] Does your PR title start with METRON-XXXX where XXXX is the JIRA
number you are trying to resolve? Pay particular attention to the hyphen "-"
character.
- [ ] Has your PR been rebased against the latest commit within the target
branch (typically master)?
- [ ] Have you included steps to reproduce the behavior or problem that is
being changed or addressed?
- [ ] Have you included steps or a guide to how the change may be verified
and tested manually?
- [ ] Have you ensured that the full suite of tests and checks have been
executed in the root metron folder via:
- [ ] Have you written or updated unit tests and or integration tests to
verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] Have you verified the basic functionality of the build by building
and running locally with Vagrant full-dev environment or the equivalent?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/nickwallen/metron METRON-1801
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/metron/pull/1218.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1218
----
commit 23a96ccc0786d31dfe2190fe470b0afe2b52c936
Author: Nick Allen <nick@...>
Date: 2018-10-01T22:39:03Z
Can change source field used for document ID. Unable to /findOne in Alerts
UI
commit 2d0f478327981403eb2463b7efefe3d871441db0
Author: Nick Allen <nick@...>
Date: 2018-10-02T15:09:02Z
Cannot assume that ES doc ID == Metron GUID
commit d66c839eda55d5efd22f56f8f5920937e2703d14
Author: Nick Allen <nick@...>
Date: 2018-10-02T15:10:13Z
Removed unnecessary dependencies
commit 36c59211181480db571619b811456105b80d7b0d
Author: Nick Allen <nick@...>
Date: 2018-10-02T19:33:04Z
Search results need to use Metron GUID as ID, not the doc ID
commit 4ebb8000522540b6ff2a30e2d5ffeffcf6016bb4
Author: Nick Allen <nick@...>
Date: 2018-10-02T21:31:58Z
Small rename
commit 13d698df8cb46a3cf8ef747f7f62acc298ef8e4b
Author: Nick Allen <nick@...>
Date: 2018-10-02T21:38:00Z
Removed unncessary part of error msg
----
> Allow Customization of Elasticsearch Document ID
> ------------------------------------------------
>
> Key: METRON-1801
> URL: https://issues.apache.org/jira/browse/METRON-1801
> Project: Metron
> Issue Type: Sub-task
> Reporter: Nick Allen
> Assignee: Nick Allen
> Priority: Major
>
> The user should be able to customize the document ID that is set by the
> client when indexing documents into Elasticsearch. The user should be able
> to use the Metron GUID, define their own custom document ID, or choose to not
> have the document ID set by the client.
>
> This task covers Elasticsearch only. An additional task should be created to
> cover Solr.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)