Send netdisco-users mailing list submissions to
[email protected]
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.sourceforge.net/lists/listinfo/netdisco-users
or, via email, send a message with subject or body 'help' to
[email protected]
You can reach the person managing the list at
[email protected]
When replying, please edit your Subject line so it is more specific
than "Re: Contents of netdisco-users digest..."
Today's Topics:
1. Lot of job activity but nothing is really happening
(Pavel Skovajsa)
2. Re: Lot of job activity but nothing is really happening
(Oliver Gorwits)
--- Begin Message ---
Hello,
for last month or so we seem to be having an issue with Netdisco (latest
version). I already worked with folks over at the #netdisco chat room about
this couple weeks ago, but it seems like it is not solved.
The symptoms are that for almost any device neither of the arpnip, macsuck
and discover jobs get run at all. This is easily visible on the GUI by
picking any device, and looking at the last_xyz field.
On the other side, the netdisco server is super busy, there is feverish job
activity, which can be easily seen on watch -n1 'ps aux | grep nd2. Note
that I anonymized the IPs.
netdisco 43647 2.0 0.3 282360 96756 ? S 16:00 0:00 nd2: #25
poll: #108499921: arpnip 99.99.99.5
netdisco 43649 2.2 0.3 281452 95992 ? S 16:00 0:00 nd2: #13
poll: #108498350: arpnip 99.99.99.5
netdisco 43650 2.3 0.3 281316 95864 ? S 16:00 0:00 nd2: #5
poll: #108494531: arpnip 99.99.99.5
netdisco 43653 2.3 0.3 281148 95700 ? S 16:00 0:00 nd2: #26
poll: #108495935: arpnip 99.99.99.5
netdisco 43655 2.5 0.3 282088 96624 ? S 16:00 0:00 nd2: #9
poll: #108496811: arpnip 99.99.99.196
netdisco 43656 2.5 0.3 282700 97104 ? S 16:00 0:00 nd2: #17
poll: #108496067: arpnip 99.99.99.193
netdisco 43659 2.8 0.3 282160 96720 ? S 16:00 0:00 nd2: #7
poll: #108495423: arpnip 99.99.99.129
netdisco 43660 2.6 0.3 281424 95940 ? S 16:00 0:00 nd2: #14
poll: #108498390: arpnip 99.99.99.4
netdisco 43663 2.8 0.3 282588 97164 ? S 16:00 0:00 nd2: #10
poll: #108494599: arpnip 99.99.99.34
netdisco 43665 3.1 0.3 281988 96392 ? S 16:00 0:00 nd2: #8
poll: #108496966: arpnip 99.99.99.69
netdisco 43667 3.1 0.3 282368 96924 ? S 16:01 0:00 nd2: #24
poll: #108496728: arpnip 99.99.99.36
netdisco 43670 3.3 0.3 282000 96580 ? S 16:01 0:00 nd2: #3
poll: #108496497: arpnip 99.99.99.7
netdisco 43672 3.6 0.3 282740 97272 ? S 16:01 0:00 nd2: #4
poll: #108500244: arpnip 99.99.99.5
netdisco 43674 3.6 0.3 281788 96360 ? S 16:01 0:00 nd2: #22
poll: #108497237: arpnip 99.99.99.14
netdisco 43676 3.7 0.3 282320 96900 ? S 16:01 0:00 nd2: #20
poll: #108496890: arpnip 10.208.77.228
netdisco 43677 3.7 0.3 282868 97248 ? S 16:01 0:00 nd2: #16
poll: #108496626: arpnip 10.240.52.162
netdisco 43680 3.8 0.3 282208 96756 ? S 16:01 0:00 nd2: #21
poll: #108497958: arpnip 99.99.99.1
netdisco 43682 4.2 0.3 282360 96888 ? S 16:01 0:00 nd2: #15
poll: #108495989: arpnip 10.216.54.167
netdisco 43684 5.3 0.3 282056 96632 ? S 16:01 0:00 nd2: #6
poll: #108498053: arpnip 99.99.99.138
netdisco 43685 5.0 0.3 282156 96724 ? S 16:01 0:00 nd2: #19
poll: #108494786: arpnip 99.99.99.7
netdisco 43688 5.6 0.3 282220 96636 ? S 16:01 0:00 nd2: #23
poll: #108498366: arpnip 10.34.1.3
netdisco 43691 13.6 0.3 281808 96360 ? S 16:01 0:00 nd2: #18
poll: #108496464: arpnip 172.21.228.15
netdisco 43693 16.4 0.3 282360 96884 ? S 16:01 0:00 nd2: #12
poll: #108498792: arpnip 10.208.70.194
netdisco 43695 23.3 0.3 281704 96120 ? S 16:01 0:00 nd2: #11
poll: #108497080: arpnip 99.99.99.227
Turning on debug and looking at it for some time I have noticed that all
these job are mostly not doing much since all of those jobs are SNMP
timeouts and netdisco is somehow stuck in a loop for a given set of 95
device (which is strange, since we have around 13k network devices in
netdisco). See below:
netdisco@mdnetdisco:~/logs$ egrep 'try_connect' netdisco-backend* | awk
'{print $5}' | sort | uniq -c | wc -l
95
Most of the uniq -c output for last 24 hours looks like this:
3472 [10.217.72.129:161]
3470 [10.217.72.130:161]
3465 [10.217.72.162:161]
3467 [10.218.29.225:161]
3458 [10.218.77.2:161]
3463 [10.223.120.54:161]
3469 [10.240.1.65:161]
3471 [10.240.1.66:161]
3471 [10.240.52.162:161]
3472 [10.242.0.5:161]
3458 [10.242.0.69:161]
3470 [10.242.1.5:161]
3469 [10.255.10.60:161]
3467 [10.34.1.3:161]
3460 [10.34.1.4:161]
3458 [10.34.1.5:161]
3464 [10.47.120.32:161]
3465 [10.68.120.244:161]
3463 [10.68.120.73:161]
3463 [10.69.120.69:161]
So, poor thing, it keeps on trying to get to these over and over, stuck in
some kind of loop. Does somebody have any idea? I tried to play with
min_age but it seems to be ignoring it. Here are the settings:
snmpver: 2
snmpforce_v2: ['0.0.0.0/0']
expire_devices: 30
expire_nodes: 90
expire_nodes_archive: 60
discover_min_age: 60200
macsuck_min_age: 43200
arpnip_min_age: 43200
dns:
max_outstanding: 100
# the default is 2 x number of CPU cores, let's change to 4x number of CPU
cores
workers:
tasks: 'AUTO * 4'
# the default is 50 minutes
jobs_stale_after: '6 hours'
schedule:
macwalk:
when: '20 8,14,21 * * *'
arpwalk:
when: '50 8,14,21 * * *'
Regards,
Pavel
--- End Message ---
--- Begin Message ---
Hi Pavel
Thanks for describing the issue so well and providing the investigation
below.
If you take one of the IPs that is having an SNMP timeout, can you find
the entry in the device_skip DB table and see what it shows? The default
maximum number of deferrals is 10, so after this amount the entry should
not be retried...
...UNLESS the job is submitted from the web or CLI, which overrides the
device_skip deferrals.
So I wonder whether these jobs are submitted in the past from the web or
CLI (a discoverall/*walk might do it, I need to check the code) and they
never go idle due to this logic.
There were some commits recently to make sure that web and CLI submitted
jobs always get run (as this is one way to unstick a job that is
deferred if you don't want to wait a week for a retry). The row in the
admin DB table (for jobs) will have a non-null value in the username
field if that's the case, I think.
As you can see, I am rubber duck debugging here :-)
regards,
oliver.
On 2018-06-05 21:16, Pavel Skovajsa wrote:
Hello,
for last month or so we seem to be having an issue with Netdisco
(latest version). I already worked with folks over at the #netdisco
chat room about this couple weeks ago, but it seems like it is not
solved.
The symptoms are that for almost any device neither of the arpnip,
macsuck and discover jobs get run at all. This is easily visible on
the GUI by picking any device, and looking at the last_xyz field.
On the other side, the netdisco server is super busy, there is
feverish job activity, which can be easily seen on watch -n1 'ps aux |
grep nd2. Note that I anonymized the IPs.
netdisco 43647 2.0 0.3 282360 96756 ? S
16:00 0:00 nd2: #25 poll: #108499921: arpnip 99.99.99.5
netdisco 43649 2.2 0.3 281452 95992 ? S
16:00 0:00 nd2: #13 poll: #108498350: arpnip 99.99.99.5
netdisco 43650 2.3 0.3 281316 95864 ? S
16:00 0:00 nd2: #5 poll: #108494531: arpnip 99.99.99.5
netdisco 43653 2.3 0.3 281148 95700 ? S
16:00 0:00 nd2: #26 poll: #108495935: arpnip 99.99.99.5
netdisco 43655 2.5 0.3 282088 96624 ? S
16:00 0:00 nd2: #9 poll: #108496811: arpnip 99.99.99.196
netdisco 43656 2.5 0.3 282700 97104 ? S
16:00 0:00 nd2: #17 poll: #108496067: arpnip 99.99.99.193
netdisco 43659 2.8 0.3 282160 96720 ? S
16:00 0:00 nd2: #7 poll: #108495423: arpnip 99.99.99.129
netdisco 43660 2.6 0.3 281424 95940 ? S
16:00 0:00 nd2: #14 poll: #108498390: arpnip 99.99.99.4
netdisco 43663 2.8 0.3 282588 97164 ? S
16:00 0:00 nd2: #10 poll: #108494599: arpnip 99.99.99.34
netdisco 43665 3.1 0.3 281988 96392 ? S
16:00 0:00 nd2: #8 poll: #108496966: arpnip 99.99.99.69
netdisco 43667 3.1 0.3 282368 96924 ? S
16:01 0:00 nd2: #24 poll: #108496728: arpnip 99.99.99.36
netdisco 43670 3.3 0.3 282000 96580 ? S
16:01 0:00 nd2: #3 poll: #108496497: arpnip 99.99.99.7
netdisco 43672 3.6 0.3 282740 97272 ? S
16:01 0:00 nd2: #4 poll: #108500244: arpnip 99.99.99.5
netdisco 43674 3.6 0.3 281788 96360 ? S
16:01 0:00 nd2: #22 poll: #108497237: arpnip 99.99.99.14
netdisco 43676 3.7 0.3 282320 96900 ? S
16:01 0:00 nd2: #20 poll: #108496890: arpnip 10.208.77.228
netdisco 43677 3.7 0.3 282868 97248 ? S
16:01 0:00 nd2: #16 poll: #108496626: arpnip 10.240.52.162
netdisco 43680 3.8 0.3 282208 96756 ? S
16:01 0:00 nd2: #21 poll: #108497958: arpnip 99.99.99.1
netdisco 43682 4.2 0.3 282360 96888 ? S
16:01 0:00 nd2: #15 poll: #108495989: arpnip 10.216.54.167
netdisco 43684 5.3 0.3 282056 96632 ? S
16:01 0:00 nd2: #6 poll: #108498053: arpnip 99.99.99.138
netdisco 43685 5.0 0.3 282156 96724 ? S
16:01 0:00 nd2: #19 poll: #108494786: arpnip 99.99.99.7
netdisco 43688 5.6 0.3 282220 96636 ? S
16:01 0:00 nd2: #23 poll: #108498366: arpnip 10.34.1.3
netdisco 43691 13.6 0.3 281808 96360 ? S
16:01 0:00 nd2: #18 poll: #108496464: arpnip 172.21.228.15
netdisco 43693 16.4 0.3 282360 96884 ? S
16:01 0:00 nd2: #12 poll: #108498792: arpnip 10.208.70.194
netdisco 43695 23.3 0.3 281704 96120 ? S
16:01 0:00 nd2: #11 poll: #108497080: arpnip 99.99.99.227
Turning on debug and looking at it for some time I have noticed that
all these job are mostly not doing much since all of those jobs are
SNMP timeouts and netdisco is somehow stuck in a loop for a given set
of 95 device (which is strange, since we have around 13k network
devices in netdisco). See below:
netdisco@mdnetdisco:~/logs$ egrep 'try_connect' netdisco-backend* |
awk '{print $5}' | sort | uniq -c | wc -l
95
Most of the uniq -c output for last 24 hours looks like this:
3472 [10.217.72.129:161 [1]]
3470 [10.217.72.130:161 [2]]
3465 [10.217.72.162:161 [3]]
3467 [10.218.29.225:161 [4]]
3458 [10.218.77.2:161 [5]]
3463 [10.223.120.54:161 [6]]
3469 [10.240.1.65:161 [7]]
3471 [10.240.1.66:161 [8]]
3471 [10.240.52.162:161 [9]]
3472 [10.242.0.5:161 [10]]
3458 [10.242.0.69:161 [11]]
3470 [10.242.1.5:161 [12]]
3469 [10.255.10.60:161 [13]]
3467 [10.34.1.3:161 [14]]
3460 [10.34.1.4:161 [15]]
3458 [10.34.1.5:161 [16]]
3464 [10.47.120.32:161 [17]]
3465 [10.68.120.244:161 [18]]
3463 [10.68.120.73:161 [19]]
3463 [10.69.120.69:161 [20]]
So, poor thing, it keeps on trying to get to these over and over,
stuck in some kind of loop. Does somebody have any idea? I tried to
play with min_age but it seems to be ignoring it. Here are the
settings:
snmpver: 2
snmpforce_v2: ['0.0.0.0/0 [21]']
expire_devices: 30
expire_nodes: 90
expire_nodes_archive: 60
discover_min_age: 60200
macsuck_min_age: 43200
arpnip_min_age: 43200
dns:
max_outstanding: 100
# the default is 2 x number of CPU cores, let's change to 4x number of
CPU cores
workers:
tasks: 'AUTO * 4'
# the default is 50 minutes
jobs_stale_after: '6 hours'
schedule:
macwalk:
when: '20 8,14,21 * * *'
arpwalk:
when: '50 8,14,21 * * *'
Regards,
Pavel
Links:
------
[1] http://10.217.72.129:161
[2] http://10.217.72.130:161
[3] http://10.217.72.162:161
[4] http://10.218.29.225:161
[5] http://10.218.77.2:161
[6] http://10.223.120.54:161
[7] http://10.240.1.65:161
[8] http://10.240.1.66:161
[9] http://10.240.52.162:161
[10] http://10.242.0.5:161
[11] http://10.242.0.69:161
[12] http://10.242.1.5:161
[13] http://10.255.10.60:161
[14] http://10.34.1.3:161
[15] http://10.34.1.4:161
[16] http://10.34.1.5:161
[17] http://10.47.120.32:161
[18] http://10.68.120.244:161
[19] http://10.68.120.73:161
[20] http://10.69.120.69:161
[21] http://0.0.0.0/0
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Netdisco mailing list
[email protected]
https://sourceforge.net/p/netdisco/mailman/netdisco-users/
--- End Message ---
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Netdisco mailing list - Digest Mode
[email protected]
https://lists.sourceforge.net/lists/listinfo/netdisco-users