Specs: CentOS 5.2 dom0 and domU, both running 2.6.18-53.1.6.el5xen OpenAFS 1.4.4 rebuilt (for our defaults) from OpenAFS's rpm domU has a 45gb afs cache volume mounted from the dom0
The domU is a webserver running lighttpd, drupal, and a fair amount of custom python. The dom0 does nothing but run xens. Machines are new (dual quad Xeons 2ghz, 32gb on the dom0, 8gb in the domU). There are multiple identical machines this happens on (cloned from a common source). On a frequent basis (sometimes as often as every few minutes), we lose contact with any afs server that we're hitting with any severity, for a couple of minutes at a time: Jan 30 10:28:48 www4 kernel: afs: Lost contact with volume location server 149.169.146.57 in cell mars.asu.edu Jan 30 10:30:03 www4 kernel: afs: volume location server 149.169.146.57 in cell mars.asu.edu is back up I see corresponding errors in lighttpd's log: 2009-01-30 10:28:56: (mod_fastcgi.c.2618) FastCGI-stderr: Traceback (most recent call last): IOError: [Errno 110] Connection timed out: '/afs/mars.asu.edu/themis-data/pds/browse/i267xx/I26712018.png' It isn't isolated to a single AFS server, all the servers in the cell can cause the behavior. Ideas? ...Chris -- Chris Kurtz, [email protected] Systems Manager Mars Space Flight Facility Arizona State University _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
