Re: [OpenAFS] Tracking down AFS Fileserver corruption

Jack Neely Mon, 28 Nov 2011 11:58:07 -0800

On Mon, Nov 28, 2011 at 08:34:00PM +0100, Stephan Wiesand wrote:
> Hi Jack,
> 
> no help, just a few dumb questions inline:
> 
> On Nov 28, 2011, at 19:13 , Jack Neely wrote:
> 
> > Folks,
> > 
> > I'm deploying new OpenAFS 1.6.0 DAFS file servers on fully update RHEL
> > 6.1 servers and I've stumbled across a data corruption problem.  My ext4
> > filesystem on the vice mounts are not getting corrupted, just the AFS
> > volume data.
> > 
> > Our /vicep[ab] mounts are provided by an EMC Clariion SAN array using
> > the PowerPath driver.  Each of the two vice mounts have 4 paths and are
> > not partitioned.  I've directly formatted the /dev/emcpower[ab] block
> > device as ext4.  Of course, the /dev/emcpowerX device is mounted on
> > /vicepX.
> 
> emcpower{a,b} map to sdc{c,e} ?
>


emcpowera is made of the paths: sdc sde sdg sdi

emcpowerb is made of the paths: sdb sdd sdf sdh

Here's the information from the powermt tool:
http://pastebin.com/sfmJX5Kc

> > Every hour our OCS Inventory agent runs which eventually runs "fdisk -l"
> > to get statistics for the storage on the server.  When I was moving test
> > volumes to the new server and the agent ran fdisk -l the kernel would
> > print:
> > 
> >    Nov 28 13:01:39 xxx kernel: sdc: unknown partition table
> >    Nov 28 13:01:39 xxx kernel: sde: unknown partition table
> >    Nov 28 13:01:49 xxx kernel: sdc: unknown partition table
> >    Nov 28 13:01:49 xxx kernel: sde: unknown partition table
> 
> If the devices aren't partitioned, why would it ever find a partition table?

It shouldn't.  But why does it keep looking (and cause corruption)?
Before I figured out that the corruption was happening at the same time
as these messages I didn't think that there was any connection.

> 
> This may have changed, but Red Hat used to not support setups with 
> filesystems on unpartitioned block devices, I believe.
> 

I have a support case open with Red Hat as well and they have not
indicated this.  In fact, not partitioning SAN devices (especially large
ones) seems to be accepted practice nowadays.

> > and the volume being moved at that exact time would be corrupt.  Usually
> > the server would soon detect this and salvage the volume, but the level
> > of corruptions has varied.
> 
> I don't have experience with running 1.6 servers in production yet, but since 
> the AFS fileserver is entirely running in userland, it should not cause this 
> kind of corruption. That being said, there's an open BZ regarding ext4 
> corruption due to Ceph userland processes...
> 

The ext4 file system is not corrupted...so I think the afs daemons are
somehow being disturbed and not writing complete data.

> > The above messages and corruption only seem to happen when volume moves
> > are in progress.  Running fdisk -l on an idle server produces no
> > messages.
> 
> Any messages if you run bonnie++ or iozone on the filesystem when the agent 
> runs?
> 

Haven't tried yet.  Good idea though.

> > Other things cause the above messages to be re-printed, such as running
> > fsck -yf /dev/emcpowera.
> 
> Is this safe to do on a mounted ext4 filesystem?
> 

I ran fsck on the unmounted SAN LUN to make sure I didn't have file
system corruption.  I was surprised that it seemed to trigger partition
rescans again....

Jack

> >  They occur during the early hours of the
> > morning as well from something that appears to be related to a cron job
> > I've not tracked down yet.  
> > 
> > I need some help in figuring out what is causing the corruption and,
> > more importantly, how to fix things.
> 
> If the AFS fileserver could be run under a different account than root, one 
> could be completely confident it's not the culprit. As things are, I'm only 
> 99% confident...
> 
> Best regards,
>       Stephan
> > 
> > Thanks,
> > Jack Neely
> > 
> > -- 
> > Jack Neely <[email protected]>
> > Linux Czar, OIT Campus Linux Services
> > Office of Information Technology, NC State University
> > GPG Fingerprint: 1917 5AC1 E828 9337 7AA4  EA6B 213B 765F 3B6A 5B89
> > _______________________________________________
> > OpenAFS-info mailing list
> > [email protected]
> > https://lists.openafs.org/mailman/listinfo/openafs-info
> 
> -- 
> Stephan Wiesand
> DESY -DV-
> Platanenenallee 6
> 15738 Zeuthen, Germany
> 
> _______________________________________________
> OpenAFS-info mailing list
> [email protected]
> https://lists.openafs.org/mailman/listinfo/openafs-info
> 

-- 
Jack Neely <[email protected]>
Linux Czar, OIT Campus Linux Services
Office of Information Technology, NC State University
GPG Fingerprint: 1917 5AC1 E828 9337 7AA4  EA6B 213B 765F 3B6A 5B89
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Re: [OpenAFS] Tracking down AFS Fileserver corruption

Reply via email to