[
https://issues.apache.org/jira/browse/EAGLE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15945052#comment-15945052
]
ASF GitHub Bot commented on EAGLE-971:
--------------------------------------
GitHub user qingwen220 opened a pull request:
https://github.com/apache/eagle/pull/895
EAGLE-971: fix a bug that duplicated queues are generated under a monitored
stream
https://issues.apache.org/jira/browse/EAGLE-971
New policies for alert spec generation
1. each alert bolt has no more than 'coordinator.policiesPerBolt' policies.
2. each alert bolt has no more than 'coordinator.streamsPerBolt' queues if
'reuseBoltInStreams' is true
3. NO queues on one alert bolt have the same StreamGroup.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/qingwen220/eagle EAGLE-971
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/eagle/pull/895.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #895
----
commit c4950daa2bd6f1805664fab8593e95d5baaf2531
Author: Zhao, Qingwen <[email protected]>
Date: 2017-03-28T12:27:06Z
fix a bug that duplicated queues are generated under a monitored stream
----
> Duplicated queues are generated under a monitored stream
> --------------------------------------------------------
>
> Key: EAGLE-971
> URL: https://issues.apache.org/jira/browse/EAGLE-971
> Project: Eagle
> Issue Type: Bug
> Affects Versions: v0.5.0
> Reporter: Zhao, Qingwen
> Assignee: Zhao, Qingwen
>
> This issue is caused by the wrong routing spec generated by the coordinator.
> Here is the procedure to reproduce it.
> 1. setting {{policiesPerBolt = 2, streamsPerBolt = 3, reuseBoltInStreams =
> true}} in server config
> 2. create four policies which has the same partition and consume the same
> stream
> {code}
> from HADOOP_JMX_METRIC_STREAM_SANDBOX[metric ==
> "hadoop.namenode.rpc.callqueuelength"]#window.length(2) select site, host,
> component, metric, min(convert(value, "long")) as minValue group by site,
> host, component, metric having minValue >= 10000 insert into
> HADOOP_JMX_METRIC_STREAM_SANDBOX_CALL_QUEUE_EXCEEDS_OUT;
> from HADOOP_JMX_METRIC_STREAM_SANDBOX[metric ==
> "hadoop.namenode.rpc.callqueuelength"]#window.length(30) select site, host,
> component, metric, min(convert(value, "long")) as minValue group by site,
> host, component, metric having minValue >= 10000 insert into
> HADOOP_JMX_METRIC_STREAM_SANDBOX_CALL_QUEUE_EXCEEDS_OUT;
> from HADOOP_JMX_METRIC_STREAM_SANDBOX[metric ==
> "hadoop.namenode.hastate.failed.count"]#window.length(2) select site, host,
> component, metric, timestamp, min(value) as minValue group by site, host,
> component, metric insert into
> HADOOP_JMX_METRIC_STREAM_SANDBOX_NN_NO_RESPONSE_OUT
> from HADOOP_JMX_METRIC_STREAM_SANDBOX[metric ==
> "hadoop.namenode.hastate.failed.count.test"]#window.length(3) select site,
> host, component, metric, count(value) as cnt group by site, host, component,
> metric insert into HADOOP_JMX_METRIC_STREAM_SANDBOX_NN_NO_RESPONSE_OUT;
> {code}
> After creating the four policies, the routing spec is
> {code}
> routerSpecs: [
> {
> streamId: "HADOOP_JMX_METRIC_STREAM_SANDBOX",
> partition: {
> streamId: "HADOOP_JMX_METRIC_STREAM_SANDBOX",
> type: "GROUPBY",
> columns: [
> "site",
> "host",
> "component",
> "metric"
> ],
> sortSpec: null
> },
> targetQueue: [
> {
> partition: {
> streamId: "HADOOP_JMX_METRIC_STREAM_SANDBOX",
> type: "GROUPBY",
> columns: [
> "site",
> "host",
> "component",
> "metric"
> ],
> sortSpec: null
> },
> workers: [
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt9"
> },
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt0"
> },
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt1"
> },
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt2"
> },
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt3"
> }
> ]
> },
> {
> partition: {
> streamId: "HADOOP_JMX_METRIC_STREAM_SANDBOX",
> type: "GROUPBY",
> columns: [
> "site",
> "host",
> "component",
> "metric"
> ],
> sortSpec: null
> },
> workers: [
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt9"
> },
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt0"
> },
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt1"
> },
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt2"
> },
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt3"
> }
> ]
> },
> {
> partition: {
> streamId: "HADOOP_JMX_METRIC_STREAM_SANDBOX",
> type: "GROUPBY",
> columns: [
> "site",
> "host",
> "component",
> "metric"
> ],
> sortSpec: null
> },
> workers: [
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt9"
> },
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt0"
> },
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt1"
> },
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt2"
> },
> {
> topologyName: "ALERT_UNIT_TOPOLOGY_APP_SANDBOX",
> boltId: "alertBolt3"
> }
> ]
> }
> ]
> }
> ]
> {code}
> and the alert spec is
> {code}
> boltPolicyIdsMap: {
> alertBolt9: [
> "NameNodeWithOneNoResponse",
> "NameNodeHAHasNoResponse",
> "CallQueueLengthExceeds30Times",
> "CallQueueLengthExceeds2Times"
> ],
> alertBolt0: [
> "NameNodeWithOneNoResponse",
> "NameNodeHAHasNoResponse",
> "CallQueueLengthExceeds30Times",
> "CallQueueLengthExceeds2Times"
> ],
> alertBolt1: [
> "NameNodeWithOneNoResponse",
> "NameNodeHAHasNoResponse",
> "CallQueueLengthExceeds30Times",
> "CallQueueLengthExceeds2Times"
> ],
> alertBolt2: [
> "NameNodeWithOneNoResponse",
> "NameNodeHAHasNoResponse",
> "CallQueueLengthExceeds30Times",
> "CallQueueLengthExceeds2Times"
> ],
> alertBolt3: [
> "NameNodeWithOneNoResponse",
> "NameNodeHAHasNoResponse",
> "CallQueueLengthExceeds30Times",
> "CallQueueLengthExceeds2Times"
> ]
> }
> {code}
> 3. produce messages into kafka topic 'hadoop_jmx_metrics_sandbox' and trigger
> NameNodeWithOneNoResponse.
> {code}
> {"timestamp": 1490250963445, "metric":
> "hadoop.namenode.hastate.failed.count", "component": "namenode", "site":
> "artemislvs", "value": 0.0, "host": "localhost"}
> {code}
> Then one message is sent three times.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)