[
https://issues.apache.org/jira/browse/KNOX-2549?focusedWorklogId=565983&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-565983
]
ASF GitHub Bot logged work on KNOX-2549:
----------------------------------------
Author: ASF GitHub Bot
Created on: 14/Mar/21 23:11
Start Date: 14/Mar/21 23:11
Worklog Time Spent: 10m
Work Description: smolnar82 merged pull request #412:
URL: https://github.com/apache/knox/pull/412
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 565983)
Time Spent: 20m (was: 10m)
> Knox CM discovery may lose relevant audit events
> ------------------------------------------------
>
> Key: KNOX-2549
> URL: https://issues.apache.org/jira/browse/KNOX-2549
> Project: Apache Knox
> Issue Type: Bug
> Components: cm-discovery
> Affects Versions: 1.5.0
> Reporter: Sandor Molnar
> Assignee: Sandor Molnar
> Priority: Major
> Fix For: 1.6.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> *Test steps:*
> 1. Configure Knox service discovery; Knox is in HA
> 1. Updated hive on tez-cm config {{hive.server2.transport.mode}} from binary
> to http
> 2. Restarted {{Hive on Tez}} service
> *Expected:*
> After 2 mins, check if the HIVE service have the updated endpoint URL in the
> relevant topology
> *{color:#de350b}Issue:{color}*
> HIVE endpoint URL is updated on only one of the Knox instances.
>
> Let's see the relevant gateway logs and the related commands from CM of my
> new findings:
> {noformat}
> gateway.log - knox-1-host
> -----------------------------
> 2020-12-03 00:32:23,751 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
> https://my-cm-host:7183 for configuration changes...
> 2020-12-03 00:32:23,769 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
> configuration activation events from Cluster 1 @ https://my-cm-host:7183
> since 2020-12-03T08:31:23.681Z
> 2020-12-03 00:32:23,861 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false
> 2020-12-03 00:32:23,861 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false
> 2020-12-03 00:32:32,042 DEBUG knox.gateway
> (ClouderaManagerDescriptorMonitor.java:monitorClouderaManagerDescriptor(96))
> - Skipping CM descriptor monitoring of
> /var/lib/knox/gateway/conf/descriptors/cdp-resources.cm (force = false;
> lastReloadTime = 2020-12-03T08:20:24.486931Z; lastModified =
> 2020-12-03T08:20:24.486931Z)
> ...
> 2020-12-03 00:33:23,862 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
> https://my-cm-host:7183 for configuration changes...
> 2020-12-03 00:33:23,874 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
> configuration activation events from Cluster 1 @ https://my-cm-host:7183
> since 2020-12-03T08:32:23.769Z
> 2020-12-03 00:33:23,938 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(402)) - There is no any
> activation event found within the given time period
> gateway.log - knox-host-2
> -----------------------------
> 2020-12-03 00:32:07,745 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
> https://my-cm-host:7183 for configuration changes...
> 2020-12-03 00:32:07,920 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
> configuration activation events from Cluster 1 @ https://my-cm-host:7183
> since 2020-12-03T08:31:07.670Z
> 2020-12-03 00:32:08,039 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false
> 2020-12-03 00:32:08,043 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false
> ...
> 2020-12-03 00:33:08,045 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
> https://my-cm-host:7183 for configuration changes...
> 2020-12-03 00:33:08,083 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
> configuration activation events from Cluster 1 @ https://my-cm-host:7183
> since 2020-12-03T08:32:07.922Z
> 2020-12-03 00:33:08,184 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: 7537adcb-d7da-4591-aa66-1f857584c099 = true
> AUDIT events fetched from CM
> ----------------------------
> https://my-cm-host:7183/api/v41/events?query=category==AUDIT_EVENT;timeOccurred=gt=2020-12-03T08:31:23.681Z;timeOccurred=lt=2020-12-03T08:32:23.769Z
> {
> totalResults: 3,
> items: [{
> id: "7537adcb-d7da-4591-aa66-1f857584c099",
> content: "Command Start on subject ATLAS-1 succeeded.",
> timeOccurred: "2020-12-03T08:32:20.366Z",
> timeReceived: "2020-12-03T08:32:20.369Z",
> category: "AUDIT_EVENT",
> severity: "INFORMATIONAL",
> alert: false,
> attributes: [{
> values: [
> "SvcCmdArgs{targetRoles=[],
> args=[]}"
> ],
> name: "COMMAND_ARGS"
> },
> {
> values: [
> "Successfully started service."
> ],
> name: "COMMAND_RESULT"
> },
> {
> values: [
> "556"
> ],
> name: "COMMAND_ID"
> },
> {
> values: [
> "clouderaManager"
> ],
> name: "USER"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER_DISPLAY_NAME"
> },
> {
> values: [
>
> "https://my-cm-host5003/cmf/eventRedirect/7537adcb-d7da-4591-aa66-1f857584c099"
> ],
> name: "URL"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE_DISPLAY_NAME"
> },
> {
> values: [
> "SUCCEEDED"
> ],
> name: "COMMAND_STATUS"
> },
> {
> values: [
> "EV_SERVICE_STARTED"
> ],
> name: "EVENTCODE"
> },
> {
> values: [
> "false"
> ],
> name: "ALERT_SUPPRESSED"
> },
> {
> values: [
> "Start"
> ],
> name: "COMMAND"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE"
> },
> {
> values: [
> "ATLAS"
> ],
> name: "SERVICE_TYPE"
> },
> {
> values: [
> "1"
> ],
> name: "CLUSTER_ID"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER"
> },
> {
> values: [
> "COMMAND_SUCCEEDED_WITH_TARGET"
> ],
> name: "MESSAGE_CODES"
> }
> ]
> },
> {
> id: "e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2",
> content: "User admin created a new revision. Message
> was: Modified Server HTTPS Port.",
> timeOccurred: "2020-12-03T08:32:01.769Z",
> timeReceived: "2020-12-03T08:32:01.772Z",
> category: "AUDIT_EVENT",
> severity: "INFORMATIONAL",
> alert: false,
> attributes: [{
> values: [
> "257"
> ],
> name: "REVISION"
> },
> {
> values: [
> "admin"
> ],
> name: "USER"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER_DISPLAY_NAME"
> },
> {
> values: [
>
> "https://knox-host-2:5003/cmf/eventRedirect/e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2"
> ],
> name: "URL"
> },
> {
> values: [
>
> "ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63"
> ],
> name: "ROLE"
> },
> {
> values: [
> "ATLAS-1-ATLAS_SERVER-BASE"
> ],
> name: "ROLE_CONFIG_GROUP"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE_DISPLAY_NAME"
> },
> {
> values: [
> "EV_REVISION_CREATED"
> ],
> name: "EVENTCODE"
> },
> {
> values: [
> "false"
> ],
> name: "ALERT_SUPPRESSED"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE"
> },
> {
> values: [
> "ATLAS"
> ],
> name: "SERVICE_TYPE"
> },
> {
> values: [
> "1"
> ],
> name: "CLUSTER_ID"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER"
> },
> {
> values: [
> "REVISION_CREATED"
> ],
> name: "MESSAGE_CODES"
> }
> ]
> },
> {
> id: "d3176932-8c50-4c42-879c-7e25ae9e2cbd",
> content: "User admin started service Start.",
> timeOccurred: "2020-12-03T08:31:58.360Z",
> timeReceived: "2020-12-03T08:31:58.361Z",
> category: "AUDIT_EVENT",
> severity: "INFORMATIONAL",
> alert: false,
> attributes: [{
> values: [
>
> "SvcCmdArgs{targetRoles=[DbRole{id=65,
> name=ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63,
> hostName=knox-host-2}], args=[]}"
> ],
> name: "COMMAND_ARGS"
> },
> {
> values: [
> "556"
> ],
> name: "COMMAND_ID"
> },
> {
> values: [
> "admin"
> ],
> name: "USER"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER_DISPLAY_NAME"
> },
> {
> values: [
>
> "https://knox-host-2:5003/cmf/eventRedirect/d3176932-8c50-4c42-879c-7e25ae9e2cbd"
> ],
> name: "URL"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE_DISPLAY_NAME"
> },
> {
> values: [
> "STARTED"
> ],
> name: "COMMAND_STATUS"
> },
> {
> values: [
> "EV_SERVICE_STARTED"
> ],
> name: "EVENTCODE"
> },
> {
> values: [
> "false"
> ],
> name: "ALERT_SUPPRESSED"
> },
> {
> values: [
> "Start"
> ],
> name: "COMMAND"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE"
> },
> {
> values: [
> "ATLAS"
> ],
> name: "SERVICE_TYPE"
> },
> {
> values: [
> "1"
> ],
> name: "CLUSTER_ID"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER"
> },
> {
> values: [
> "SERVICE_STARTED"
> ],
> name: "MESSAGE_CODES"
> }
> ]
> }
> ]
> }
> {noformat}
> On {{knox-host-2}} everything worked as expected. However, on
> {{knox-host-1}}, my ATLAS service configuration change got lost:
> - The start command successfully finished on {{2020-12-03T08:32:20.366Z}}
> - Polling configuration analyzer fetched audit events from CM server at
> {{2020-12-03 08:32:23,769Z}} since {{2020-12-03T08:31:23.681Z}}. However, CM
> did +_not_+ return the start command that finished 3 seconds ago, even it
> should have returned it
> - A minute later, the polling configuration analyzer fetched audit events
> from CM server at {{2020-12-03 08:33:23,874Z}} since
> {{2020-12-03T08:32:23.769Z}} (the last query timestamp). This query returned
> nothing which is correct since there was no activation within this time frame
> As a result, the command with ID of {{7537adcb-d7da-4591-aa66-1f857584c099}}
> has been lost on {{knox-host-1}}. It's important to note that the first
> timestamp, when Knox issued a CM query, was sort of close to the timestamp
> when the successful start event occurred. There seems to be a lag between the
> CM server/agent communication which may cause this issue. This means Knox
> should be smarter in this respect and try to mitigate the chance of missing
> relevant audit events.
> I'm going to implement the following approach:
> - Knox will look back a little bit more time than is does currently: instead
> of checking the {{[lastQueryTimestamp -> currentTimestamp]}} interval, we'll
> check {{[lastQueryTimestamp - (2 x pollingInterval) -> currentTimestamp]}}
> - Knox will maintain a cache of all relevant events that it processed to
> avoid re-processing them
--
This message was sent by Atlassian Jira
(v8.3.4#803005)