Hi Adam,

On Fri, Sep 27, 2013 at 08:58:10AM +0100, Dr A. J. Trickett wrote:
> I've pretty much decided to get a flash drive as the root file system, my 
> preferred "bidder" are currently building with Intel 335 drives. I'm not sure 
> exactly what combination and mix to go for.
> 
> I don't think the 180 GB drive is large enough on it's own, so I could get a 
> pair of them and then LVM them together and put a single ext4 over the two.

I know you say later in the thread that you backup all your
important stuff, but for me, the lost productivity involved in a
disk failure is worth a lot more than the cost of the disk itself.

The problem with using LVM to concatenate two drives together for
more space is that you've doubled the chance of a failure. SSDs
aren't particularly more reliable than conventional HDDs, and the
HDD is usually the first thing to break.

    http://www.zdnet.com/ssd-infant-mortality-ii-7000003945/
    http://www.tomshardware.com/reviews/ssd-reliability-failure-rate,2923-9.html

It's all a few years old, and I've plenty of anecdotes of people
who've "never seen an SSD failure", but personally I am not yet
prepared to believe they are any more reliable. Nor any less.

When a physical volume in LVM disappears, the physical extents that
were on it are obviously no longer available. Depending on the LVM
allocation policy in use probably some logical volumes would then
have parts or their entirety missing.

The system won't even let you activate a volume group that has
physical volumes missing, although you can override this if you tell
it that you really know what you are doing. That would allow you to
still use the logical volumes that didn't have bits missing.

You're in a bit of an awkward situation here because you want:

- Tons of storage
- Performance
- Reliability

and you haven't got the cash for all three. There isn't going to be
a single correct answer, and there are a variety of trade-offs you
can make depending on what your priorities are.

I will try to think up what is bound to be an incomplete list.

My own preferred answer though would be something like this:

- Two SSDs in my desktop in a RAID-1, mass storage in a separate
  device with some sort of RAID configuration, and a decent backup
  regime.

  I don't consider that over the top in the home. HP Microservers
  are back on cash back offer and make great fairly low energy
  consumption networked storage devices. There's a bunch of other
  cheap dedicated NAS devices that are suitable for home and small
  office use as well.

  The advantages of having the mass storage in a separate device is
  that it makes it a lot easier to manage. If disks die you can
  replace them without downtime. If it's not performing well enough
  you can add disks. Even SSDs when that starts to make sense.

  You'll probably upgrade your desktop machines a lot more often
  than the file server, because the file server doesn't need much
  grunt. No need to keep redesigning how the storage will work with
  each desktop upgrade.

But it's not for everyone, it's inevitably more complicated and
expensive.

So let's say the file server is a no-go. It's got to all be in the
desktop.

- Two SSDs and two HDDs in two separate RAID-1s

  Each SSD should be big enough for your OS and whatever other
  performance storage you feel you need. Potentially that could be
  quite small - your OS should easily fit in about 2G without you
  trying hard. I fit Debian wheezy on a 512M CF card in one of my
  devices without doing anything special, but admittedly it has only
  vi for an editor and I even removed the less command. ;-)

  Mirror the HDDs as well for redundancy (if using Linux software
  RAID consider RAID-10 for the HDDs - it works with only two
  devices and performs better than RAID-1. Stick to RAID-1 for the
  SSDs though because that RAID level supports TRIM/discard).

  If you can afford two small SSDs then I think you could try
  stretching to two SSDs plus two bigger HDDs, because HDDs are
  really cheap.

Can't afford two SSDs?

- One SSD, two HDDs

  There's really no excuse to not have two HDDs. Again put your OS
  and performance stuff on the SSD, put the "bulk" stuff on the HDD
  mirror.

  Even more important than normal to have good backups.

If you have decided to have a desktop with some SSD storage and some
HDD storage in it, regardless of whether they're mirrored or not
it's now a case of working out where to put the data and how to make
the smaller SSD storage speed up the larger HDD storage.

Linux has a bunch of interesting options for caching slow storage
with faster storage.

- ZFS on Linux

  ZFS is going places in the Linux world. Ubuntu and Debian have
  packages for it now (though don't expect to be able to call up
  Mark Shuttleworth at 3am and ask him to assign some minions to fix
  it or anything, like).  But at least no more downloading kernel
  source from a strange web site and having to build it yourself.

  ZFS supports the concept of tiered storage. You build your storage
  pool with a tier of fast stuff like SSDs, and a tier of slow stuff
  like HDDs, and it does the right thing.

  I am not a ZFS expert but tiered storage works like this:

    ZFS has the concept of "cache" devices and "log" devices. It
    calls cache devices L2ARC and it calls log devices ZIL (ZFS
    Intent Log).

    If you tell it that a device is an L2ARC device then it will
    copy hot storage extents into that device and consult it first
    in future when wanting to access them. So this is a fast read
    cache. It only speeds up read operations.

    Since it's only used for read operations you don't need to
    mirror it. In fact you cannot mirror it; if you add more than
    one then ZFS stripes them. If one dies then it reads the data
    from disk again and puts it on the other L2ARC devices. If all
    L2ARC devices are dead then it gets the data from disk every
    time, which works but is slower.

    If you tell it that a device is a ZIL device then writes go to
    the ZIL first, and are later asynchronously written to the
    slower tier. Point being that the ZIL device is going to be
    really quick, so it can tell the OS that this write has
    definitely hit persistent storage and you can be on your way
    now.

    You really should mirror ZIL devices. Otherwise if one goes pop
    then you lose all the data that was written to it that didn't
    yet make it onto the slower media. This might only be a few
    megabytes but that's enough to ruin your file system and your
    day. So, you can mirror those.

  If you have multiple SSDs then you can partition each one into two
  bits, the larger partition being for L2ARC and the smaller one
  being for ZIL. You then end up with two L2ARC devices and one
  mirrored ZIL device.

  If you'd be willing to investigate this you would actually end up
  with just a single logical device and it would be really simple.

  ZFS itself has many years of testing, but it's a comparative
  newcomer on Linux and integration with your favourite distribution
  may not be completely there. That means, for example, that you
  might encounter difficulties installing, doing upgrades and other
  areas where the operating system config needs to know about
  partition layout. Probably nothing that can't be worked around by
  some time on the command line, but maybe you are not comfortable
  with that.

  Also there's unlikely to be much third party support for times of
  trouble, just community support (from the ZFS on Linux community).

- bcache http://bcache.evilpiepirate.org/
  flashcache https://github.com/facebook/flashcache/
  enhanceio https://github.com/stec-inc/EnhanceIO
  dm-cache 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/device-mapper/cache.txt

  These are Linux kernel answers to L2ARC and ZIL. They cache one
  set of devices with another set of devices. They are all pretty
  new. bcache is a Google project, Flashcache is a Facebook project,
  EnhanceIO is a fork of Flashcache that is trying to get more
  community support, dm-cache is more closely tied to the existing
  device-mapper and LVM projects.

  They have different levels of readiness. I think bcache got
  included in the upstream 3.11 kernel. dm-cache maybe since 3.9.
  The others maybe not yet.

  In any case I personally would regard these as highly experimental
  and wouldn't use them except for if I could stand to see it break
  unexpectedly and lose all data at any time. Since it's a layer in
  front of *all your storage* the stakes are kind of high.

  You can take a slightly safer (and slower) path with all of them
  by using them as a read cache only (like L2ARC), but still
  software bugs could equal corruption being written back.

- mdadm write-mostly http://linux.die.net/man/4/md search for
  "write-mostly"

  "write-mostly" is a feature of Linux software RAID-1 where you
  tell it that some devices are unsuitable for reads. Writes still
  go to both.

  Let's say that you could not afford two SSDs. You have just one
  SSD and a pair of HDDs. Example:

  /dev/sda - 180G SSD
  /dev/sdb - 2T HDD
  /dev/sdc - 2T HDD

  Partition sdb and sdc into two bits, one of 180G and the other the
  rest. Make a RAID-10 of sdb1 and sdc1, call it md0. Make a RAID-1
  on top of *that* of sda and md0, call it md1. Tell it that its
  component md0 is write-mostly. Make a RAID-10 of sdb2 and sdc2,
  call it md2. You end up with this:

  /dev/md1 -    180G RAID-1
  /dev/md2 - ~1,820G RAID-10

  In theory you have all the speedy read advantages of an SSD, but
  you've mirrored it onto an HDD so you can continue working even if
  your SSD goes pop. It is of course a lot more complicated.

  People report some success with this strategy:

  http://marc.info/?l=linux-raid&m=126496930530289&w=2

  Of course it's only going to be caching reads. You may want to
  reserve some SSD space outside of the md array to use for
  guaranteed fast write space. Accepting that it won't be redundant
  in the face of failure.

- LVM

  LVM's great when you're not entirely sure what your needs will be
  or if they change often.

  I earlier cautioned against using LVM to concatenate a bunch of
  drives without redundancy, but you can use it in other ways.

  Once you've ended up with a block device that is fast and a block
  device that is slow, you can use them both as LVM physical devices
  and put them both into the same volume group.

  You then create a logical volume for each kind of data you store,
  e.g. one for VM images, one for photos, and so on.

  You can specify the allocation policy of each LV in order to tell
  it where to place the physical extents - you make sure that the
  extents you need to perform well are placed on the fast PV. Best of
  all, if you get it wrong or change your mind or your needs change,
  you can move the extents on a live system without unmounting
  anything.

  If you only had the one SSD, you could combine this with the
  "write-mostly" above. In the above example md1 was the fast

So in summary, given the constraint of having to fit all the storage
into the desktop, this is how I personally would approach this:

  I'd buy two SSDs and two HDDs. If I couldn't afford that, I'd buy
  one SSD and two HDDs.

  If I felt brave enough for ZFS then I'd do that, because it makes
  tiered storage fairly simple.

  Otherwise I'd go the route of LVM in order to balance the
  differing needs of the data across the different types of storage,
  whilst still retaining redundancy. I'd use md "write-mostly"
  underneath the LVM if I only had one SSD.

  Every Linux distribution should have an installer that supports
  software RAID (might have to use "alternate" ISO on Ubuntu), so
  just do that and set the write-mostly afterwards, once it's
  booted, if necessary.

  No matter what scheme I ended up with I'd probably make some
  effort to keep /boot outside of all the complicated stuff,
  because:

  - that really simplifies booting
  - /boot is quite small anyway (512M should be ample)
  - Aside from kernel upgrades, /boot isn't read or written after
    boot, so it doesn't matter if it's on slow media. RAID-1 of
    HDDs.

Cheers,
Andy

-- 
http://bitfolk.com/ -- No-nonsense VPS hosting

-- 
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--------------------------------------------------------------

Reply via email to