On Wed, 26 Mar 2008, Jon Peatfield wrote:
On Wed, 26 Mar 2008, Jan Schulze wrote:
Hi all,
I have a disk array with about 4.5 TB and would like to use it as one large
logical volume with an ext3 file system. When mounting the logical volume ,
I get an "Input/Output error: can't read superblock".
Do you get any interesting kernel messages in the output of dmesg (or
/var/log/messages etc)? Which exact kernel is this (uname -r) and what arch
(i386/x86_64 etc; uname -m)?
And what driver/hardware?
I'm using SL 4.2 with kernel 2.6 and this is what I did so far:
- used parted to create a gpt disk label (mklabel gpt) and one large
partition (mkpart primary ext3 0s -1s)
- used parted to enable LVM flag on device (set 1 LVM on)
I know it would be slow but can you test that you can read/write to all of
/dev/sda1?
Using dd's "seek" parameter, this should not take too much time. But if
creating the GPT label & partition was successful, chances are the whole
device is accessible.
- created one physical volume, one volume group and one logical volume
(pvcreate /dev/sda1, vgcreate raid6 /dev/sda1, lvcreate -l 1189706 -n
vol1 raid6)
- created an ext3 filesystem and explicitly specified a 4K blocksize, as
this should allow a filesystem size of up to 16 TB (mkfs.ext3 -m 0 -b
4096 /dev/raid6/vol1)
For some reason my EL4 notes tell me that we also specify -N (number of
inodes), as well as -E (set RAID stride), -J size= (set journal size) and -O
sparse_super,dir_index,filetype though most of that is probably the default
these days...
Specifying the stripe width is also supposed to be a good idea, as is
aligning the start of the partition to a stripe boundary (although that's
more likely to be useful without LVM on top).
However, mounting (mount /dev/raid6/vol1 /raid) gives the superblock error,
mentioned above.
Everything is working as expected, when using ext2 filesystem (with LVM) or
ext3 filesystem (without LVM). Using a smaller volume (< 2 TB) is working
with ext3+LVM as well. Only the combination of > 2TB+ext3+LVM gives me
trouble.
Any ideas or suggestions?
We found that in at least some combinations of kernel/hardware (drivers
really I expect), that support for >2TB block devices was still rather flakey
(at least in the early versions of EL4).
We ended up getting our RAID boxes to present as multiple LUNs each under 2TB
which we can then set up as PVs and join back together into a single VG and
still have an LV which was bigger than 2TB. I'm rather conservative in such
things so we still avoid big block devices at the moment.
[ obviously with single disk sizes growing at the rate they are it means that
the block-devices >2TB code is going to get a LOT more testing! ]
We're successfully using devices up to 7 TB with a single XFS
filesystem on them, under SL4/5 (but I think we started doing this with
4.3, not 4.2). I have no hope to be able to check (xfs_repair) those
should this ever become necessary though - from what I've read it would
require more RAM than fits into a server today.
However, some of the tools (e.g. ext2/3 fsck) still seemed to fail at about
3.5TB so we ended up needing to build the 'very latest' tools to be able to
run fsck properly (the ones included in EL4 - and EL5 I think - get into an
infinite loop at some point while scanning the inode tables).
Currently we try to avoid 'big' ext3 LVs ; the one where we discovered the
fsck problems was originally ~6.8TB but we ended up splitting that into
several smaller LVs since even with working tools it still took ~2 days to
fsck... (and longer to dump/copy/restore it all!)
Some of my co-workers swear by XFS for 'big' volumes but then we do have SGI
boxes where XFS (well CXFS) is the expected default fs. I've not done much
testing with XFS on SL mainly because TUV don't like XFS much...
I think it's still the best choice for large (> 2 TB) filesystems. The xfs
available in SL4 contrib has done very well here. There are some
interesting effects when such a filesystem runs full and you have to
remount it with the "inode64" option in order to be able to create new
files and you discover that quite a few applications are not ready for
64-bit node numbers. But that aside, it has done very well. No other
headaches. We're now beginning to deploy large (>10TB) XFS filesystems
under SL5.
This being said, we now also have lustre OSTs (using a modified ext3)
7.5 TB in size. No problems so far, but then none of them has run full or
required an fsck yet.
--
Stephan Wiesand
DESY - DV -
Platanenallee 6
15738 Zeuthen, Germany