[
https://issues.apache.org/jira/browse/KNOX-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301294#comment-17301294
]
ASF subversion and git services commented on KNOX-2549:
-------------------------------------------------------
Commit d0d5417d22f359d5910d934c37175fa00213b694 in knox's branch
refs/heads/master from Sandor Molnar
[ https://gitbox.apache.org/repos/asf?p=knox.git;h=d0d5417 ]
KNOX-2549 - Fetching audit events from CM beyond previous query timestamp (#412)
To mitigate the chance of losing relevant audit events from CM, Knox needs to
look back a little bit more than the last query timestamp. To be able to do
this without processing relevant events again and again a cache was introduced
to have them in memory for a short time.
Additionally, I removed made the maximum audit event resulset size so that
every audit events are fetched from CM within the given time interval.
> Knox CM discovery may lose relevant audit events
> ------------------------------------------------
>
> Key: KNOX-2549
> URL: https://issues.apache.org/jira/browse/KNOX-2549
> Project: Apache Knox
> Issue Type: Bug
> Components: cm-discovery
> Affects Versions: 1.5.0
> Reporter: Sandor Molnar
> Assignee: Sandor Molnar
> Priority: Major
> Fix For: 1.6.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> *Test steps:*
> 1. Configure Knox service discovery; Knox is in HA
> 1. Updated hive on tez-cm config {{hive.server2.transport.mode}} from binary
> to http
> 2. Restarted {{Hive on Tez}} service
> *Expected:*
> After 2 mins, check if the HIVE service have the updated endpoint URL in the
> relevant topology
> *{color:#de350b}Issue:{color}*
> HIVE endpoint URL is updated on only one of the Knox instances.
>
> Let's see the relevant gateway logs and the related commands from CM of my
> new findings:
> {noformat}
> gateway.log - knox-1-host
> -----------------------------
> 2020-12-03 00:32:23,751 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
> https://my-cm-host:7183 for configuration changes...
> 2020-12-03 00:32:23,769 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
> configuration activation events from Cluster 1 @ https://my-cm-host:7183
> since 2020-12-03T08:31:23.681Z
> 2020-12-03 00:32:23,861 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false
> 2020-12-03 00:32:23,861 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false
> 2020-12-03 00:32:32,042 DEBUG knox.gateway
> (ClouderaManagerDescriptorMonitor.java:monitorClouderaManagerDescriptor(96))
> - Skipping CM descriptor monitoring of
> /var/lib/knox/gateway/conf/descriptors/cdp-resources.cm (force = false;
> lastReloadTime = 2020-12-03T08:20:24.486931Z; lastModified =
> 2020-12-03T08:20:24.486931Z)
> ...
> 2020-12-03 00:33:23,862 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
> https://my-cm-host:7183 for configuration changes...
> 2020-12-03 00:33:23,874 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
> configuration activation events from Cluster 1 @ https://my-cm-host:7183
> since 2020-12-03T08:32:23.769Z
> 2020-12-03 00:33:23,938 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(402)) - There is no any
> activation event found within the given time period
> gateway.log - knox-host-2
> -----------------------------
> 2020-12-03 00:32:07,745 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
> https://my-cm-host:7183 for configuration changes...
> 2020-12-03 00:32:07,920 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
> configuration activation events from Cluster 1 @ https://my-cm-host:7183
> since 2020-12-03T08:31:07.670Z
> 2020-12-03 00:32:08,039 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false
> 2020-12-03 00:32:08,043 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false
> ...
> 2020-12-03 00:33:08,045 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @
> https://my-cm-host:7183 for configuration changes...
> 2020-12-03 00:33:08,083 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying
> configuration activation events from Cluster 1 @ https://my-cm-host:7183
> since 2020-12-03T08:32:07.922Z
> 2020-12-03 00:33:08,184 DEBUG discovery.cm
> (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event
> relevance: 7537adcb-d7da-4591-aa66-1f857584c099 = true
> AUDIT events fetched from CM
> ----------------------------
> https://my-cm-host:7183/api/v41/events?query=category==AUDIT_EVENT;timeOccurred=gt=2020-12-03T08:31:23.681Z;timeOccurred=lt=2020-12-03T08:32:23.769Z
> {
> totalResults: 3,
> items: [{
> id: "7537adcb-d7da-4591-aa66-1f857584c099",
> content: "Command Start on subject ATLAS-1 succeeded.",
> timeOccurred: "2020-12-03T08:32:20.366Z",
> timeReceived: "2020-12-03T08:32:20.369Z",
> category: "AUDIT_EVENT",
> severity: "INFORMATIONAL",
> alert: false,
> attributes: [{
> values: [
> "SvcCmdArgs{targetRoles=[],
> args=[]}"
> ],
> name: "COMMAND_ARGS"
> },
> {
> values: [
> "Successfully started service."
> ],
> name: "COMMAND_RESULT"
> },
> {
> values: [
> "556"
> ],
> name: "COMMAND_ID"
> },
> {
> values: [
> "clouderaManager"
> ],
> name: "USER"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER_DISPLAY_NAME"
> },
> {
> values: [
>
> "https://my-cm-host5003/cmf/eventRedirect/7537adcb-d7da-4591-aa66-1f857584c099"
> ],
> name: "URL"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE_DISPLAY_NAME"
> },
> {
> values: [
> "SUCCEEDED"
> ],
> name: "COMMAND_STATUS"
> },
> {
> values: [
> "EV_SERVICE_STARTED"
> ],
> name: "EVENTCODE"
> },
> {
> values: [
> "false"
> ],
> name: "ALERT_SUPPRESSED"
> },
> {
> values: [
> "Start"
> ],
> name: "COMMAND"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE"
> },
> {
> values: [
> "ATLAS"
> ],
> name: "SERVICE_TYPE"
> },
> {
> values: [
> "1"
> ],
> name: "CLUSTER_ID"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER"
> },
> {
> values: [
> "COMMAND_SUCCEEDED_WITH_TARGET"
> ],
> name: "MESSAGE_CODES"
> }
> ]
> },
> {
> id: "e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2",
> content: "User admin created a new revision. Message
> was: Modified Server HTTPS Port.",
> timeOccurred: "2020-12-03T08:32:01.769Z",
> timeReceived: "2020-12-03T08:32:01.772Z",
> category: "AUDIT_EVENT",
> severity: "INFORMATIONAL",
> alert: false,
> attributes: [{
> values: [
> "257"
> ],
> name: "REVISION"
> },
> {
> values: [
> "admin"
> ],
> name: "USER"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER_DISPLAY_NAME"
> },
> {
> values: [
>
> "https://knox-host-2:5003/cmf/eventRedirect/e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2"
> ],
> name: "URL"
> },
> {
> values: [
>
> "ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63"
> ],
> name: "ROLE"
> },
> {
> values: [
> "ATLAS-1-ATLAS_SERVER-BASE"
> ],
> name: "ROLE_CONFIG_GROUP"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE_DISPLAY_NAME"
> },
> {
> values: [
> "EV_REVISION_CREATED"
> ],
> name: "EVENTCODE"
> },
> {
> values: [
> "false"
> ],
> name: "ALERT_SUPPRESSED"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE"
> },
> {
> values: [
> "ATLAS"
> ],
> name: "SERVICE_TYPE"
> },
> {
> values: [
> "1"
> ],
> name: "CLUSTER_ID"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER"
> },
> {
> values: [
> "REVISION_CREATED"
> ],
> name: "MESSAGE_CODES"
> }
> ]
> },
> {
> id: "d3176932-8c50-4c42-879c-7e25ae9e2cbd",
> content: "User admin started service Start.",
> timeOccurred: "2020-12-03T08:31:58.360Z",
> timeReceived: "2020-12-03T08:31:58.361Z",
> category: "AUDIT_EVENT",
> severity: "INFORMATIONAL",
> alert: false,
> attributes: [{
> values: [
>
> "SvcCmdArgs{targetRoles=[DbRole{id=65,
> name=ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63,
> hostName=knox-host-2}], args=[]}"
> ],
> name: "COMMAND_ARGS"
> },
> {
> values: [
> "556"
> ],
> name: "COMMAND_ID"
> },
> {
> values: [
> "admin"
> ],
> name: "USER"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER_DISPLAY_NAME"
> },
> {
> values: [
>
> "https://knox-host-2:5003/cmf/eventRedirect/d3176932-8c50-4c42-879c-7e25ae9e2cbd"
> ],
> name: "URL"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE_DISPLAY_NAME"
> },
> {
> values: [
> "STARTED"
> ],
> name: "COMMAND_STATUS"
> },
> {
> values: [
> "EV_SERVICE_STARTED"
> ],
> name: "EVENTCODE"
> },
> {
> values: [
> "false"
> ],
> name: "ALERT_SUPPRESSED"
> },
> {
> values: [
> "Start"
> ],
> name: "COMMAND"
> },
> {
> values: [
> "ATLAS-1"
> ],
> name: "SERVICE"
> },
> {
> values: [
> "ATLAS"
> ],
> name: "SERVICE_TYPE"
> },
> {
> values: [
> "1"
> ],
> name: "CLUSTER_ID"
> },
> {
> values: [
> "Cluster 1"
> ],
> name: "CLUSTER"
> },
> {
> values: [
> "SERVICE_STARTED"
> ],
> name: "MESSAGE_CODES"
> }
> ]
> }
> ]
> }
> {noformat}
> On {{knox-host-2}} everything worked as expected. However, on
> {{knox-host-1}}, my ATLAS service configuration change got lost:
> - The start command successfully finished on {{2020-12-03T08:32:20.366Z}}
> - Polling configuration analyzer fetched audit events from CM server at
> {{2020-12-03 08:32:23,769Z}} since {{2020-12-03T08:31:23.681Z}}. However, CM
> did +_not_+ return the start command that finished 3 seconds ago, even it
> should have returned it
> - A minute later, the polling configuration analyzer fetched audit events
> from CM server at {{2020-12-03 08:33:23,874Z}} since
> {{2020-12-03T08:32:23.769Z}} (the last query timestamp). This query returned
> nothing which is correct since there was no activation within this time frame
> As a result, the command with ID of {{7537adcb-d7da-4591-aa66-1f857584c099}}
> has been lost on {{knox-host-1}}. It's important to note that the first
> timestamp, when Knox issued a CM query, was sort of close to the timestamp
> when the successful start event occurred. There seems to be a lag between the
> CM server/agent communication which may cause this issue. This means Knox
> should be smarter in this respect and try to mitigate the chance of missing
> relevant audit events.
> I'm going to implement the following approach:
> - Knox will look back a little bit more time than is does currently: instead
> of checking the {{[lastQueryTimestamp -> currentTimestamp]}} interval, we'll
> check {{[lastQueryTimestamp - (2 x pollingInterval) -> currentTimestamp]}}
> - Knox will maintain a cache of all relevant events that it processed to
> avoid re-processing them
--
This message was sent by Atlassian Jira
(v8.3.4#803005)