[
https://issues.apache.org/jira/browse/KNOX-2549?focusedWorklogId=565982&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-565982
]
ASF GitHub Bot logged work on KNOX-2549:
----------------------------------------
Author: ASF GitHub Bot
Created on: 14/Mar/21 23:11
Start Date: 14/Mar/21 23:11
Worklog Time Spent: 10m
Work Description: smolnar82 opened a new pull request #412:
URL: https://github.com/apache/knox/pull/412
## What changes were proposed in this pull request?
To mitigate the chance of losing relevant audit events from CM, Knox needs
to look back a little bit more than the last query timestamp. To be able to do
this without processing relevant events again and again a cache was introduced
to have them in memory for a short time.
Additionally, I removed made the maximum audit event resultset size so that
every audit event are fetched from CM within the given time interval.
## How was this patch tested?
(Manual testing with Knox HA and a live CM.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 565982)
Remaining Estimate: 0h
Time Spent: 10m
> Knox CM discovery may lose relevant audit events
> ------------------------------------------------
>
> Key: KNOX-2549
> URL: https://issues.apache.org/jira/browse/KNOX-2549
> Project: Apache Knox
> Issue Type: Bug
> Components: cm-discovery
> Affects Versions: 1.5.0
> Reporter: Sandor Molnar
> Assignee: Sandor Molnar
> Priority: Major
> Fix For: 1.6.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> *Test steps:*
> 1. Configure Knox service discovery; Knox is in HA
> 1. Updated hive on tez-cm config {{hive.server2.transport.mode}} from binary
> to http
> 2. Restarted {{Hive on Tez}} service
> *Expected:*
> After 2 mins, check if the HIVE service have the updated endpoint URL in the
> relevant topology
> *{color:#de350b}Issue:{color}*
> HIVE endpoint URL is updated on only one of the Knox instances.
>
> Let's see the relevant gateway logs and the related commands from CM of my
> new findings:
> {noformat}
> gateway.log - knox-1-host
> -----------------------------
> 2020-12-03 00:32:23,751 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
> https://my-cm-host:7183 for configuration changes...
> 2020-12-03 00:32:23,769 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
> configuration activation events from Cluster 1 @ https://my-cm-host:7183
> since 2020-12-03T08:31:23.681Z
> 2020-12-03 00:32:23,861 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false
> 2020-12-03 00:32:23,861 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false
> 2020-12-03 00:32:32,042 DEBUG knox.gateway
> (ClouderaManagerDescriptorMonitor.java:monitorClouderaManagerDescriptor(96))
> - Skipping CM descriptor monitoring of
> /var/lib/knox/gateway/conf/descriptors/cdp-resources.cm (force = false;
> lastReloadTime = 2020-12-03T08:20:24.486931Z; lastModified =
> 2020-12-03T08:20:24.486931Z)
> ...
> 2020-12-03 00:33:23,862 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
> https://my-cm-host:7183 for configuration changes...
> 2020-12-03 00:33:23,874 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
> configuration activation events from Cluster 1 @ https://my-cm-host:7183
> since 2020-12-03T08:32:23.769Z
> 2020-12-03 00:33:23,938 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(402)) - There is no any
> activation event found within the given time period
> gateway.log - knox-host-2
> -----------------------------
> 2020-12-03 00:32:07,745 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
> https://my-cm-host:7183 for configuration changes...
> 2020-12-03 00:32:07,920 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
> configuration activation events from Cluster 1 @ https://my-cm-host:7183
> since 2020-12-03T08:31:07.670Z
> 2020-12-03 00:32:08,039 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false
> 2020-12-03 00:32:08,043 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false
> ...
> 2020-12-03 00:33:08,045 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
> https://my-cm-host:7183 for configuration changes...
> 2020-12-03 00:33:08,083 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
> configuration activation events from Cluster 1 @ https://my-cm-host:7183
> since 2020-12-03T08:32:07.922Z
> 2020-12-03 00:33:08,184 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: 7537adcb-d7da-4591-aa66-1f857584c099 = true
> AUDIT events fetched from CM
> ----------------------------
> https://my-cm-host:7183/api/v41/events?query=category==AUDIT_EVENT;timeOccurred=gt=2020-12-03T08:31:23.681Z;timeOccurred=lt=2020-12-03T08:32:23.769Z
> {
> totalResults: 3,
> items: [{
> id: "7537adcb-d7da-4591-aa66-1f857584c099",
> content: "Command Start on subject ATLAS-1 succeeded.",
> timeOccurred: "2020-12-03T08:32:20.366Z",
> timeReceived: "2020-12-03T08:32:20.369Z",
> category: "AUDIT_EVENT",
> severity: "INFORMATIONAL",
> alert: false,
> attributes: [{
> values: [
> "SvcCmdArgs{targetRoles=[],
> args=[]}"
> ],
> name: "COMMAND_ARGS"
> },
> {
> values: [
> "Successfully started service."
> ],
> name: "COMMAND_RESULT"
> },
> {
> values: [
> "556"
> ],
> name: "COMMAND_ID"
> },
> {
> values: [
> "clouderaManager"
> ],
> name: "USER"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER_DISPLAY_NAME"
> },
> {
> values: [
>
> "https://my-cm-host5003/cmf/eventRedirect/7537adcb-d7da-4591-aa66-1f857584c099"
> ],
> name: "URL"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE_DISPLAY_NAME"
> },
> {
> values: [
> "SUCCEEDED"
> ],
> name: "COMMAND_STATUS"
> },
> {
> values: [
> "EV_SERVICE_STARTED"
> ],
> name: "EVENTCODE"
> },
> {
> values: [
> "false"
> ],
> name: "ALERT_SUPPRESSED"
> },
> {
> values: [
> "Start"
> ],
> name: "COMMAND"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE"
> },
> {
> values: [
> "ATLAS"
> ],
> name: "SERVICE_TYPE"
> },
> {
> values: [
> "1"
> ],
> name: "CLUSTER_ID"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER"
> },
> {
> values: [
> "COMMAND_SUCCEEDED_WITH_TARGET"
> ],
> name: "MESSAGE_CODES"
> }
> ]
> },
> {
> id: "e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2",
> content: "User admin created a new revision. Message
> was: Modified Server HTTPS Port.",
> timeOccurred: "2020-12-03T08:32:01.769Z",
> timeReceived: "2020-12-03T08:32:01.772Z",
> category: "AUDIT_EVENT",
> severity: "INFORMATIONAL",
> alert: false,
> attributes: [{
> values: [
> "257"
> ],
> name: "REVISION"
> },
> {
> values: [
> "admin"
> ],
> name: "USER"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER_DISPLAY_NAME"
> },
> {
> values: [
>
> "https://knox-host-2:5003/cmf/eventRedirect/e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2"
> ],
> name: "URL"
> },
> {
> values: [
>
> "ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63"
> ],
> name: "ROLE"
> },
> {
> values: [
> "ATLAS-1-ATLAS_SERVER-BASE"
> ],
> name: "ROLE_CONFIG_GROUP"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE_DISPLAY_NAME"
> },
> {
> values: [
> "EV_REVISION_CREATED"
> ],
> name: "EVENTCODE"
> },
> {
> values: [
> "false"
> ],
> name: "ALERT_SUPPRESSED"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE"
> },
> {
> values: [
> "ATLAS"
> ],
> name: "SERVICE_TYPE"
> },
> {
> values: [
> "1"
> ],
> name: "CLUSTER_ID"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER"
> },
> {
> values: [
> "REVISION_CREATED"
> ],
> name: "MESSAGE_CODES"
> }
> ]
> },
> {
> id: "d3176932-8c50-4c42-879c-7e25ae9e2cbd",
> content: "User admin started service Start.",
> timeOccurred: "2020-12-03T08:31:58.360Z",
> timeReceived: "2020-12-03T08:31:58.361Z",
> category: "AUDIT_EVENT",
> severity: "INFORMATIONAL",
> alert: false,
> attributes: [{
> values: [
>
> "SvcCmdArgs{targetRoles=[DbRole{id=65,
> name=ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63,
> hostName=knox-host-2}], args=[]}"
> ],
> name: "COMMAND_ARGS"
> },
> {
> values: [
> "556"
> ],
> name: "COMMAND_ID"
> },
> {
> values: [
> "admin"
> ],
> name: "USER"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER_DISPLAY_NAME"
> },
> {
> values: [
>
> "https://knox-host-2:5003/cmf/eventRedirect/d3176932-8c50-4c42-879c-7e25ae9e2cbd"
> ],
> name: "URL"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE_DISPLAY_NAME"
> },
> {
> values: [
> "STARTED"
> ],
> name: "COMMAND_STATUS"
> },
> {
> values: [
> "EV_SERVICE_STARTED"
> ],
> name: "EVENTCODE"
> },
> {
> values: [
> "false"
> ],
> name: "ALERT_SUPPRESSED"
> },
> {
> values: [
> "Start"
> ],
> name: "COMMAND"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE"
> },
> {
> values: [
> "ATLAS"
> ],
> name: "SERVICE_TYPE"
> },
> {
> values: [
> "1"
> ],
> name: "CLUSTER_ID"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER"
> },
> {
> values: [
> "SERVICE_STARTED"
> ],
> name: "MESSAGE_CODES"
> }
> ]
> }
> ]
> }
> {noformat}
> On {{knox-host-2}} everything worked as expected. However, on
> {{knox-host-1}}, my ATLAS service configuration change got lost:
> - The start command successfully finished on {{2020-12-03T08:32:20.366Z}}
> - Polling configuration analyzer fetched audit events from CM server at
> {{2020-12-03 08:32:23,769Z}} since {{2020-12-03T08:31:23.681Z}}. However, CM
> did +_not_+ return the start command that finished 3 seconds ago, even it
> should have returned it
> - A minute later, the polling configuration analyzer fetched audit events
> from CM server at {{2020-12-03 08:33:23,874Z}} since
> {{2020-12-03T08:32:23.769Z}} (the last query timestamp). This query returned
> nothing which is correct since there was no activation within this time frame
> As a result, the command with ID of {{7537adcb-d7da-4591-aa66-1f857584c099}}
> has been lost on {{knox-host-1}}. It's important to note that the first
> timestamp, when Knox issued a CM query, was sort of close to the timestamp
> when the successful start event occurred. There seems to be a lag between the
> CM server/agent communication which may cause this issue. This means Knox
> should be smarter in this respect and try to mitigate the chance of missing
> relevant audit events.
> I'm going to implement the following approach:
> - Knox will look back a little bit more time than is does currently: instead
> of checking the {{[lastQueryTimestamp -> currentTimestamp]}} interval, we'll
> check {{[lastQueryTimestamp - (2 x pollingInterval) -> currentTimestamp]}}
> - Knox will maintain a cache of all relevant events that it processed to
> avoid re-processing them
--
This message was sent by Atlassian Jira
(v8.3.4#803005)