On 06/10/2012 19:43, Ryan Lane wrote:
On Sat, Oct 6, 2012 at 9:43 AM, Damian Zaremba
<[email protected]> wrote:
1) DNS is broken/half working/annoying/argh
phoenix:~ damian$ dig wmflabs.org NS @labs-ns0.wikimedia.org

; <<>> DiG 9.6-ESV-R4-P3 <<>> wmflabs.org NS @labs-ns0.wikimedia.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17397
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;wmflabs.org.            IN    NS

;; Query time: 150 msec
;; SERVER: 208.80.152.33#53(208.80.152.33)
;; WHEN: Sat Oct  6 17:33:03 2012
;; MSG SIZE  rcvd: 29

phoenix:~ damian$ dig wmflabs.org NS @labs-ns1.wikimedia.org

; <<>> DiG 9.6-ESV-R4-P3 <<>> wmflabs.org NS @labs-ns1.wikimedia.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46082
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;wmflabs.org.            IN    NS

;; ANSWER SECTION:
wmflabs.org.        3600    IN    NS    labs-ns1.wikimedia.org.
wmflabs.org.        3600    IN    NS    labs-ns0.wikimedia.org.

;; Query time: 175 msec
;; SERVER: 208.80.154.19#53(208.80.154.19)
;; WHEN: Sat Oct  6 17:33:09 2012
;; MSG SIZE  rcvd: 85

Also, the SOA is wrong as it still points to virt0;
phoenix:~ damian$ dig wmflabs.org SOA @labs-ns1.wikimedia.org

; <<>> DiG 9.6-ESV-R4-P3 <<>> wmflabs.org SOA @labs-ns1.wikimedia.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46569
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;wmflabs.org.            IN    SOA

;; ANSWER SECTION:
wmflabs.org.        3600    IN    SOA    virt0.wikimedia.org.
hostmaster.wikimedia.org. 1349449000 1800 3600 86400 7200

;; Query time: 128 msec
;; SERVER: 208.80.154.19#53(208.80.154.19)
;; WHEN: Sat Oct  6 17:33:39 2012
;; MSG SIZE  rcvd: 92


Seems the DNS servers are only pointing at a single LDAP backend, and
the LDAP backend went non-responsive for a little while. I added a bug
for this:

https://bugzilla.wikimedia.org/show_bug.cgi?id=40825

2) Instance reboots tend to result in instances never coming back - please
could someone fix bots-cb (same as sql2, first reboot took it down, second
results in 'failed').

Due to the same issue as sql2. It wasn't defined in libvirt. This is
likely due to when we did the cold migrations off the old hardware.
I'm going to run a script to solve this problem for any future
reboots, on monday.

3) Login's randomly fail due to key auth timing out (seems to be related to
nfs crapping out)

Due to DNS

4) Home dirs sometimes randomly drop their mounts (seems to be related to
nfs crapping out also, dmesg just shows rpc timeouts)

Due to DNS

(Yes, I know it's a Saturday but as the guy in Code Rush said; Writing
software is different from selling real estate. Selling real estate you sell
the people the people sleep at night. When they go to sleep you have to stop
selling real estate. Computers never sleep.)

Meh. No problems there. If something is broken I'm going to fix it
whether it's Saturday or not ;).

- Ryan
// Forwarding reply to list

bots-cb seems to still not be back, never started pinging again after you rebooted it. Trying a reboot still reports 'Failed' with no console output. I assume it failed to boot for some reason but have no other means of poking it :)

Damian

_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Reply via email to