Sandor Molnar created KNOX-2549:
-----------------------------------
Summary: Knox CM discovery may lose relevant audit events
Key: KNOX-2549
URL: https://issues.apache.org/jira/browse/KNOX-2549
Project: Apache Knox
Issue Type: Bug
Components: cm-discovery
Affects Versions: 1.5.0
Reporter: Sandor Molnar
Assignee: Sandor Molnar
Fix For: 1.6.0
*Test steps:*
1. Configure Knox service discovery; Knox is in HA
1. Updated hive on tez-cm config {{hive.server2.transport.mode}} from binary to
http
2. Restarted {{Hive on Tez}} service
*Expected:*
After 2 mins, check if the HIVE service have the updated endpoint URL in the
relevant topology
*{color:#de350b}Issue:{color}*
HIVE endpoint URL is updated on only one of the Knox instances.
Let's see the relevant gateway logs and the related commands from CM of my new
findings:
{noformat}
gateway.log - knox-1-host
-----------------------------
2020-12-03 00:32:23,751 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
https://my-cm-host:7183 for configuration changes...
2020-12-03 00:32:23,769 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
configuration activation events from Cluster 1 @ https://my-cm-host:7183 since
2020-12-03T08:31:23.681Z
2020-12-03 00:32:23,861 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false
2020-12-03 00:32:23,861 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false
2020-12-03 00:32:32,042 DEBUG knox.gateway
(ClouderaManagerDescriptorMonitor.java:monitorClouderaManagerDescriptor(96)) -
Skipping CM descriptor monitoring of
/var/lib/knox/gateway/conf/descriptors/cdp-resources.cm (force = false;
lastReloadTime = 2020-12-03T08:20:24.486931Z; lastModified =
2020-12-03T08:20:24.486931Z)
...
2020-12-03 00:33:23,862 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
https://my-cm-host:7183 for configuration changes...
2020-12-03 00:33:23,874 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
configuration activation events from Cluster 1 @ https://my-cm-host:7183 since
2020-12-03T08:32:23.769Z
2020-12-03 00:33:23,938 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:getRelevantEvents(402)) - There is no any
activation event found within the given time period
gateway.log - knox-host-2
-----------------------------
2020-12-03 00:32:07,745 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
https://my-cm-host:7183 for configuration changes...
2020-12-03 00:32:07,920 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
configuration activation events from Cluster 1 @ https://my-cm-host:7183 since
2020-12-03T08:31:07.670Z
2020-12-03 00:32:08,039 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false
2020-12-03 00:32:08,043 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false
...
2020-12-03 00:33:08,045 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
https://my-cm-host:7183 for configuration changes...
2020-12-03 00:33:08,083 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
configuration activation events from Cluster 1 @ https://my-cm-host:7183 since
2020-12-03T08:32:07.922Z
2020-12-03 00:33:08,184 DEBUG discovery.cm
(PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
relevance: 7537adcb-d7da-4591-aa66-1f857584c099 = true
AUDIT events fetched from CM
----------------------------
https://my-cm-host:7183/api/v41/events?query=category==AUDIT_EVENT;timeOccurred=gt=2020-12-03T08:31:23.681Z;timeOccurred=lt=2020-12-03T08:32:23.769Z
{
totalResults: 3,
items: [{
id: "7537adcb-d7da-4591-aa66-1f857584c099",
content: "Command Start on subject ATLAS-1 succeeded.",
timeOccurred: "2020-12-03T08:32:20.366Z",
timeReceived: "2020-12-03T08:32:20.369Z",
category: "AUDIT_EVENT",
severity: "INFORMATIONAL",
alert: false,
attributes: [{
values: [
"SvcCmdArgs{targetRoles=[],
args=[]}"
],
name: "COMMAND_ARGS"
},
{
values: [
"Successfully started service."
],
name: "COMMAND_RESULT"
},
{
values: [
"556"
],
name: "COMMAND_ID"
},
{
values: [
"clouderaManager"
],
name: "USER"
},
{
values: [
"Cluster 1"
],
name: "CLUSTER_DISPLAY_NAME"
},
{
values: [
"https://my-cm-host5003/cmf/eventRedirect/7537adcb-d7da-4591-aa66-1f857584c099"
],
name: "URL"
},
{
values: [
"ATLAS-1"
],
name: "SERVICE_DISPLAY_NAME"
},
{
values: [
"SUCCEEDED"
],
name: "COMMAND_STATUS"
},
{
values: [
"EV_SERVICE_STARTED"
],
name: "EVENTCODE"
},
{
values: [
"false"
],
name: "ALERT_SUPPRESSED"
},
{
values: [
"Start"
],
name: "COMMAND"
},
{
values: [
"ATLAS-1"
],
name: "SERVICE"
},
{
values: [
"ATLAS"
],
name: "SERVICE_TYPE"
},
{
values: [
"1"
],
name: "CLUSTER_ID"
},
{
values: [
"Cluster 1"
],
name: "CLUSTER"
},
{
values: [
"COMMAND_SUCCEEDED_WITH_TARGET"
],
name: "MESSAGE_CODES"
}
]
},
{
id: "e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2",
content: "User admin created a new revision. Message
was: Modified Server HTTPS Port.",
timeOccurred: "2020-12-03T08:32:01.769Z",
timeReceived: "2020-12-03T08:32:01.772Z",
category: "AUDIT_EVENT",
severity: "INFORMATIONAL",
alert: false,
attributes: [{
values: [
"257"
],
name: "REVISION"
},
{
values: [
"admin"
],
name: "USER"
},
{
values: [
"Cluster 1"
],
name: "CLUSTER_DISPLAY_NAME"
},
{
values: [
"https://knox-host-2:5003/cmf/eventRedirect/e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2"
],
name: "URL"
},
{
values: [
"ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63"
],
name: "ROLE"
},
{
values: [
"ATLAS-1-ATLAS_SERVER-BASE"
],
name: "ROLE_CONFIG_GROUP"
},
{
values: [
"ATLAS-1"
],
name: "SERVICE_DISPLAY_NAME"
},
{
values: [
"EV_REVISION_CREATED"
],
name: "EVENTCODE"
},
{
values: [
"false"
],
name: "ALERT_SUPPRESSED"
},
{
values: [
"ATLAS-1"
],
name: "SERVICE"
},
{
values: [
"ATLAS"
],
name: "SERVICE_TYPE"
},
{
values: [
"1"
],
name: "CLUSTER_ID"
},
{
values: [
"Cluster 1"
],
name: "CLUSTER"
},
{
values: [
"REVISION_CREATED"
],
name: "MESSAGE_CODES"
}
]
},
{
id: "d3176932-8c50-4c42-879c-7e25ae9e2cbd",
content: "User admin started service Start.",
timeOccurred: "2020-12-03T08:31:58.360Z",
timeReceived: "2020-12-03T08:31:58.361Z",
category: "AUDIT_EVENT",
severity: "INFORMATIONAL",
alert: false,
attributes: [{
values: [
"SvcCmdArgs{targetRoles=[DbRole{id=65,
name=ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63,
hostName=knox-host-2}], args=[]}"
],
name: "COMMAND_ARGS"
},
{
values: [
"556"
],
name: "COMMAND_ID"
},
{
values: [
"admin"
],
name: "USER"
},
{
values: [
"Cluster 1"
],
name: "CLUSTER_DISPLAY_NAME"
},
{
values: [
"https://knox-host-2:5003/cmf/eventRedirect/d3176932-8c50-4c42-879c-7e25ae9e2cbd"
],
name: "URL"
},
{
values: [
"ATLAS-1"
],
name: "SERVICE_DISPLAY_NAME"
},
{
values: [
"STARTED"
],
name: "COMMAND_STATUS"
},
{
values: [
"EV_SERVICE_STARTED"
],
name: "EVENTCODE"
},
{
values: [
"false"
],
name: "ALERT_SUPPRESSED"
},
{
values: [
"Start"
],
name: "COMMAND"
},
{
values: [
"ATLAS-1"
],
name: "SERVICE"
},
{
values: [
"ATLAS"
],
name: "SERVICE_TYPE"
},
{
values: [
"1"
],
name: "CLUSTER_ID"
},
{
values: [
"Cluster 1"
],
name: "CLUSTER"
},
{
values: [
"SERVICE_STARTED"
],
name: "MESSAGE_CODES"
}
]
}
]
}
{noformat}
On {{knox-host-2}} everything worked as expected. However, on {{knox-host-1}},
my ATLAS service configuration change got lost:
- The start command successfully finished on {{2020-12-03T08:32:20.366Z}}
- Polling configuration analyzer fetched audit events from CM server at
{{2020-12-03 08:32:23,769Z}} since {{2020-12-03T08:31:23.681Z}}. However, CM
did +_not_+ return the start command that finished 3 seconds ago, even it
should have returned it
- A minute later, the polling configuration analyzer fetched audit events from
CM server at {{2020-12-03 08:33:23,874Z}} since {{2020-12-03T08:32:23.769Z}}
(the last query timestamp). This query returned nothing which is correct since
there was no activation within this time frame
As a result, the command with ID of {{7537adcb-d7da-4591-aa66-1f857584c099}}
has been lost on {{knox-host-1}}. It's important to note that the first
timestamp, when Knox issued a CM query, was sort of close to the timestamp when
the successful start event occurred. There seems to be a lag between the CM
server/agent communication which may cause this issue. This means Knox should
be smarter in this respect and try to mitigate the chance of missing relevant
audit events.
I'm going to implement the following approach:
- Knox will look back a little bit more time than is does currently: instead
of checking the {{[lastQueryTimestamp -> currentTimestamp]}} interval, we'll
check {{[lastQueryTimestamp - (2 x pollingInterval) -> currentTimestamp]}}
- Knox will maintain a cache of all relevant events that it processed to avoid
re-processing them
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
