I think that may be it. The OID I was using is no longer valid. So the
SNMP response that came back had numbers in it, but it also looks like
the checksum was broken.
Not clear to me why I thought I could do this without doing the index thing.
I hate doing the index thing.
bp
On 10/24/2014 10:32 PM, Forrest Christian (List Account) via Af wrote:
A power cycle and a reboot should be identical in almost every case.
The reboot actually triggers a hardware reset internally in the
processor, which should clear everything out. Of course as soon as I
say that it is identical, someone will find an example where it is not.
I'm not where I can look at the trace you sent, but I'm surprised it
contains errors. I do know that the unit will return a response which
may look like this if the oid is invalid.
Did you adjust your oids in cacti after the removal of the mystery
expansion unit from the table? If not, this is likely the problem.
In regards to the unit being there grin the factory.. My guess is if
you had this unit listed in there from the get go, then it probably
was the expansion unit we use to test the expansion bus here. It's
supposed to be factory reset before shipping but it would not shock me
if it wasn't. We actually had a short period that a largish
percentage went out not factory reset due to a tester software
issue. Not really a problem but we hate to have them go out in any
other state.
On Oct 24, 2014 5:08 PM, "Bill Prince via Af" <[email protected]
<mailto:[email protected]>> wrote:
You mean from the web GUI?� Sure.
I presume a power cycle does something different from a reboot?
I was always curious about this particular SiteMonitor, as it came
up with the extra device on the expansion bus from the get-go.�
I'd never worried about it, and then I saw the discussion about
getting rid of old devices with the zeroed-serial trick.
Don't go there!� It's a trap!
bp
On 10/24/2014 2:52 PM, George Skorup (Cyber Broadcasting) via Af
wrote:
Can you post a screenshot of your expansion, binary and analog tabs?
Also, I bet if you power-cycle it, it will be fine again. I was
working with Forrest on a bug where the SyncInjector and some
other newer modules would mysteriously disappear from the bus. He
was able to reproduce and get a fixed up firmware load for the
modules. Something about one thing booting up faster than
another, or something like that.
On 10/24/2014 4:41 PM, Bill Prince via Af wrote:
Gotcha!
I removed all the Data Sources except one (PWR1).� Suddenly
that data was making it into cacti.
Then I added back in all the Data Sources coming _JUST_ from the
SiteMonitor itself.� That also worked.
Then I added in one of the Data Sources from the SyncInjector
(sync events), which happens to be the only unit on the
expansion bus past where I removed the non-existent unit.�
This broke it again.
So I have apparently uncovered a bug where removing a unit from
the expansion bus (by zeroing the serial number) that causes the
SiteMonitor to break SNMP responses.� I think it's probably
just a bad checksum, but I will leave that up to him.� I
forwarded the pcap trace to him.
I will probably also swap out the SiteMonitor that has the problem.
Thanks guys!
bp
On 10/24/2014 1:57 PM, Bill Prince via Af wrote:
Then again....
Not sure why I didn't notice this the first (or second)
time.� Wireshark is telling me I have a malformed packet;
either a broken header or bad checksum.� So even though the
SNMP response is coming in with the expected data, it's getting
dropped before is gets into cacti because of the malformed packet.
This would explain why removing a unit on the expansion bus
changed things...
bp
On 10/24/2014 1:32 PM, Bill Prince via Af wrote:
OK. Confirmed.� The SiteMonitor is getting the SNMP
requests, and it is responding with the expected values.
I ran a pcap trace both at the SiteMonitor as well as at the
ethernet port on the cacti server.� SNMP requests/responses
are going both ways (and at both ends). In fact, spine appears
to be doing 3 retries.
One thing I didn't expect is that just before the SNMP
requests, there are two attempts to open a telnet on the
SiteMonitor.� Not sure where that is coming from, except
perhaps for the Manage plugin (which I de-installed several
weeks ago).
So something is broken inside cacti.� How/why this was
caused by zeroing a serial number from a non-existent
expansion unit is completely baffling to me.
I also have no clue how to fix it, because cacti "thinks"
there was no response.
bp
On 10/24/2014 11:16 AM, George Skorup (Cyber Broadcasting) via
Af wrote:
I am thoroughly confused. Is your community string correct?
Can you increase the device SNMP timeout, like 1000ms instead
of 250ms. What's your device down detection set to? Is it
showing down in the device list?
I have seen some base units go kinda screwy and respond
slower and a reboot doesn't fix it, they needed a power-cycle.
On 10/24/2014 11:25 AM, Bill Prince via Af wrote:
Now thrice.
No joy in Mudville.
bp
On 10/24/2014 8:07 AM, Bill Prince via Af wrote:
Yah.� Twice now.
bp
On 10/23/2014 11:06 PM, George Skorup (Cyber Broadcasting)
via Af wrote:
Gotta be the poller cache. Did you try a rebuild?
On 10/23/2014 11:03 PM, Bill Prince via Af wrote:
Getting closer.� When I look in the SNMP cache, there
is no entry for the device.
Looking in the log (without debug), I get:
10/23/2014 08:34:25 PM - SPINE: Poller[0] Host[797
<http://10.13.112.20/host.php?action=edit&id=797>] TH[1]
DS[12316
<http://10.13.112.20/data_sources.php?action=ds_edit&id=12316>]
WARNING: SNMP timeout detected [250 ms], ignoring host
'10.13.114.254'
So there is something causing the SNMP request to barf
inside cacti.� When I do an snmpget from the CLI, it
all looks fine.� Likewise, the realtime plugin is
working fine too.
So when realtime is doing the SNMP queries outside the
poller, they are fine.� Just when spine is doing the
SNMP requests.
bp
On 10/23/2014 4:12 PM, George Skorup (Cyber Broadcasting)
via Af wrote:
You divided by zero, didn't you?
Are you sure your modules are in the same order as before?
On 10/23/2014 1:29 PM, Bill Prince via Af wrote:
I noticed an "Expansion Unit" on one of my SiteMonitors
this morning.� It said something about "Device
Removed" or something like that.
Remembering the discussion the other day on this topic,
I put a "0" in the Serial # for the non-existent unit,
rescanned, & rebooted.
Now, none of the OIDs work in Cacti.� If I do a
simple snmpget on any of the OIDs that I use, the
correct information comes back. Several of the OIDs are
on the base unit anyway, so they would not have moved,
and further, the OIDs don't reference the serial number.
So... what did I do, and how do I fix it?