Re: my ongoing battle with large filesystems

Jon Peatfield Wed, 11 Mar 2009 14:25:20 -0700

On Thu, 5 Mar 2009, Miles O'Neal wrote:

recap: new 64 bit Intel quadcore server with Adaptec SATA RAID
controller, 16x1TB drives.  1 drive JBOD for OS.  The rest are
setup as RAID6 with 1 spare.  We've tried EL5.1 + all yum updates,
and EL5.2 stock.  We can't get /dev/sdb1 (12TB) stable with ext2
or xfs (ext3 blows up in the journal setup).


So I decided to carve /dev/sdb up into a dozen partitions and
use LVM.  Initially I want to use one partition per LV and make
each of those one xfs FS.  Then as things grow I can add a PV
(one partition per PV) into the appropriate VG and grow the LV/FS.
Between typos and missteps, I've had to build up and tear down the
LV pieces several times.  And now I get messages such as

 Aborting - please provide new pathname for what used to be 
/dev/disk/by-path/pci-0000:01:00.0-scsi-0:0:1:0-part6
or
 Device /dev/sdb6 not found (or ignored by filtering).

I clean it all up, wipe out all the files in /etc/lvm/*/*
(including cache/.cache), and try again, still broken.

I tried rebooting.  Still broken.

How can I fix this short of a full reinstall?

The whole LVM system feels really kludgy.  I suppose there's
not a better alternative at this time?

Just a little history for you to mull over... A while ago block devicesover 2TB (1TB on some systems!) were not supported properly since with thedefault block-size (512 bytes) that meant that the block numbers went over2^32 (2^31 was the limit on some systems including earlier Linux kernelsbecause the type had been left signed by accident).

While 64 bit block offsets have been added each device driver needs toproperly implement them, so even if people say 'it works for me withdevice xxx', you can't be sure it will work for you unless you also havedevice xxx and exacactly the same (or newer I guess) drivers.

To work round this most RAID devices have long supported a 'hack', where alarger volume is exported to the host as a number of slices - typicallyeach under 2TB as new LUNs on the same bus,target (or whatever). The hostsees these as individual 'disks', but you can fairly easily join thingsback together by putting the PVs into the same VG and then slice things upinto the LV's of your choice.

Note that splitting your large device into smaller partitions isn't thesame since the kernel will still need to use large block offsets to accessthe parts more than 2TB into the device.

Some people think I'm really silly/old-fasioned for still doing thingsthis way (using large devices works for them, and it might work for mewith some hardware), but there is little extra overhead and my sanity isbetter preserved by not tempting fate.

e.g. on one box I'm using atm it has a ~4TB RAID (a Dell PERC/6i in fact),but it is configured to present that as three devices to me, whcih wethen partition and set up as PVs as if they were different disks. e.g.


$ grep sd /proc/partitions
   8     0 1843200000 sda
   8     1     104391 sda1
   8     2 1843089255 sda2
   8    16 1843200000 sdb
   8    17 1843193646 sdb1
   8    32  705822720 sdc
   8    33  705815743 sdc1

The actual numbers for the 'disk' slice sizes don't really matter buthappened to be based on what Dell suggested for a different box we havewith a PERC/6i controller (it wouldn't fit as 2 '<2TB devices' so we needat least 3...)


$ pvscan
  PV /dev/sda2   VG TempLobeSys00   lvm2 [1.72 TB / 0    free]
  PV /dev/sdb1   VG TempLobeSys00   lvm2 [1.72 TB / 78.00 GB free]
  PV /dev/sdc1   VG TempLobeSys00   lvm2 [673.09 GB / 0    free]
  Total: 3 [4.09 TB] / in use: 3 [4.09 TB] / in no VG: 0 [0   ]

$ vgdisplay
  --- Volume group ---
  VG Name               TempLobeSys00
  System ID
  Format                lvm2
  Metadata Areas        3
  Metadata Sequence No  11
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                7
  Open LV               7
  Max PV                0
  Cur PV                3
  Act PV                3
  VG Size               4.09 TB
  PE Size               32.00 MB
  Total PE              134034
  Alloc PE / Size       131538 / 4.01 TB
  Free  PE / Size       2496 / 78.00 GB
  VG UUID               h393E0-mHmv-s4gf-Py2x-mJuU-OWUc-VoWtuP

$ df -hlP
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/TempLobeSys00-root  2.0G  644M  1.3G  35% /
/dev/mapper/TempLobeSys00-scratch   50G  9.2G   38G  20% /local
/dev/mapper/TempLobeSys00-var  3.9G  1.5G  2.2G  41% /var
/dev/mapper/TempLobeSys00-tmp  9.7G  152M  9.1G   2% /tmp
/dev/mapper/TempLobeSys00-usr   12G  5.3G  5.6G  49% /usr
/dev/sda1              99M   25M   70M  26% /boot
tmpfs                  12G     0   12G   0% /dev/shm
/dev/mapper/TempLobeSys00-tardis  4.0T  9.3G  3.9T   1% /local/tardis

This happens to be using XFS for the larger fs but we had it working withext3 - though I understand that this isn't as big as your example.

Re-configuring your RAID controllers to export as <2TB slices isn't fun,but it should be possible without a re-install (if a bit fiddly).


--
/--------------------------------------------------------------------\
| "Computers are different from telephones.  Computers do not ring." |
|       -- A. Tanenbaum, "Computer Networks", p. 32                  |
---------------------------------------------------------------------|
| Jon Peatfield, _Computer_ Officer, DAMTP,  University of Cambridge |
| Mail:  [email protected]     Web:  http://www.damtp.cam.ac.uk/ |
\--------------------------------------------------------------------/

Re: my ongoing battle with large filesystems

Reply via email to