I rebooted to upgrade to kernel 3.4.1. I accidentally had the
combination of uvesafb, nouveau kms and nvidia-drivers enabled, which
caused my system to go blank after rebooting. I was not able to SSH
into the machine, so I did the magic-sysrq REISUB to reboot into my
previous kernel. When it booted into the previous kernel (3.3.5), I
saw a whole bunch of "I/O error" messages scrolling by, for every disk
in my RAID array. I have never seen these errors before. I hoped it
was just some module confusion because I was booting a different
kernel. I was able to boot into my root filesystem, but the raid did
not assemble. After blacklisting nouveau and rebooting into 3.4.1,
there were none of the I/O errors mentioned, but mdraid failed with
this message:

 * Starting up RAID devices ...
 * mdadm main: failed to get exclusive lock on mapfile
mdadm: /dev/md2 is already in use.
mdadm: /dev/md1 is already in use.
 [ !! ]

Oh no! Heart beating quickly... terabytes of data... Google finds
nothing useful with these messages.

My mdadm.conf has not changed, no physical disks have been added or
removed in over a year. mdadm configuration has not changed at all. I
have of course updated hundreds of packages since my last reboot,
including mdadm.

>From the /proc/mdstat it shows that it's not detecting all of the
member disks/partitions:

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
[raid4] [multipath] [faulty]
md1 : inactive sdb1[0](S)
      1048575868 blocks super 1.1

md2 : inactive sdf2[5](S)
      904938415 blocks super 1.1

unused devices: <none>


Those normally included all disks in sdb through sdf, partition 1 and
2 from each disk.

My mdadm.conf has always had only two ARRAY lines (for /dev/md1 and
/dev/md2) with the UUID of the arrays. Previously the member disks
were always automatically detected and assembled when I booted and
started mdadm. Running mdadm --query --examine on the partitions
showed they did still contain the valid raid information. So I felt
confident in trying to reassemble it.

To fix, I did:

/etc/init.d/mdraid stop

to stop the array (could have also done "mdadm -Ss", which is what the
stop script did)

Then I edited mdadm.conf and added a device line:

DEVICE /dev/sd[bcdef][12]

So now I am telling it specifically where to look. I then restarted mdraid:

/etc/init.d/mdraid start

et voilĂ ! my raid was back and functioning. I don't know if this is a
result of a change in kernel or mdadm behavior, or simply a result of
my REISUB that left the raid in a strange state.

Reply via email to