http://www.slideshare.net/mclee/beyond-the-file-system-designing-large-scale-file-storage-and-serving
Beyond the File System - Designing Large
Scale File Storage and Serving - Presentation Transcript
- Beyond the File System Designing Large Scale File Storage and
Serving Cal Henderson
- Hello!
Web Builder 2.0 2
- Big file systems? • Too vague! • What is a file system? • What
constitutes big? • Some requirements would be nice
Web Builder 2.0 3
- 1 Scalable Looking at storage and serving infrastructures
Web Builder 2.0 4
- 2 Reliable Looking at redundancy, failure rates, on the fly
changes
Web Builder 2.0 5
- 3 Cheap Looking at upfront costs, TCO and lifetimes
Web Builder 2.0 6
- Four buckets Storage Serving BCP Cost
Web Builder 2.0 7
- Storage
Web Builder 2.0 8
- The storage stack File protocol NFS, CIFS, SMB File system ext,
reiserFS, NTFS Block protocol SCSI, SATA, FC RAID Mirrors, Stripes
Hardware Disks and stuff
Web Builder 2.0 9
- Hardware overview The storage scale Lower Higher Internal DAS SAN
NAS
Web Builder 2.0 10
- Internal storage • A disk in a computer – SCSI, IDE, SATA • 4
disks in 1U is common • 8 for half depth boxes
Web Builder 2.0 11
- DAS Direct attached storage Disk shelf, connected by SCSI/SATA HP
MSA30 – 14 disks in 3U
Web Builder 2.0 12
- SAN • Storage Area Network • Dumb disk shelves • Clients connect
via a ‘fabric’ • Fibre Channel, iSCSI, Infiniband – Low level protocols
Web Builder 2.0 13
- NAS • Network Attached Storage • Intelligent disk shelf • Clients
connect via a network • NFS, SMB, CIFS – High level protocols
Web Builder 2.0 14
- Of course, it’s more confusing than that
Web Builder 2.0 15
- Meet the LUN • Logical Unit Number • A slice of storage space •
Originally for addressing a single drive: – c1t2d3 – Controller,
Target, Disk (Slice) • Now means a virtual partition/volume – LVM,
Logical Volume Management
Web Builder 2.0 16
- NAS vs SAN With SAN, a single host (initiator) owns a single
LUN/volume With NAS, multiple hosts own a single LUN/volume NAS head –
NAS access to a SAN
Web Builder 2.0 17
- SAN Advantages Virtualization within a SAN offers some nice
features: • Real-time LUN replication • Transparent backup • SAN
booting for host replacement
Web Builder 2.0 18
- Some Practical Examples • There are a lot of vendors •
Configurations vary • Prices vary wildly • Let’s look at a couple –
Ones I happen to have experience with – Not an endorsement ;)
Web Builder 2.0 19
- NetApp Filers
Heads and shelves, up to 500TB in 260U FC SAN with 1 or 2 NAS heads
Web Builder 2.0 20
- Isilon IQ • 2U Nodes, 3-96 nodes/cluster, 6-600 TB •
FC/InfiniBand SAN with NAS head on each node
Web Builder 2.0 21
- Scaling Vertical vs Horizontal
Web Builder 2.0 22
- Vertical scaling • Get a bigger box • Bigger disk(s) • More disks
• Limited by current tech – size of each disk and total number in
appliance
Web Builder 2.0 23
- Horizontal scaling • Buy more boxes • Add more servers/appliances
• Scales forever* *sort of
Web Builder 2.0 24
- Storage scaling approaches • Four common models: • Huge FS •
Physical nodes • Virtual nodes • Chunked space
Web Builder 2.0 25
- Huge FS • Create one giant volume with growing space – Sun’s ZFS
– Isilon IQ • Expandable on-the-fly? • Upper limits – Always limited
somewhere
Web Builder 2.0 26
- Huge FS • Pluses – Simple from the application side – Logically
simple – Low administrative overhead • Minuses – All your eggs in one
basket – Hard to expand – Has an upper limit
Web Builder 2.0 27
- Physical nodes • Application handles distribution to multiple
physical nodes – Disks, Boxes, Appliances, whatever • One ‘volume’ per
node • Each node acts by itself • Expandable on-the-fly – add more
nodes • Scales forever
Web Builder 2.0 28
- Physical Nodes • Pluses – Limitless expansion – Easy to expand –
Unlikely to all fail at once • Minuses – Many ‘mounts’ to manage – More
administration
Web Builder 2.0 29
- Virtual nodes • Application handles distribution to multiple
virtual volumes, contained on multiple physical nodes • Multiple
volumes per node • Flexible • Expandable on-the-fly – add more nodes •
Scales forever
Web Builder 2.0 30
- Virtual Nodes • Pluses – Limitless expansion – Easy to expand –
Unlikely to all fail at once – Addressing is logical, not physical –
Flexible volume sizing, consolidation • Minuses – Many ‘mounts’ to
manage – More administration
Web Builder 2.0 31
- Chunked space • Storage layer writes parts of files to different
physical nodes • A higher-level RAID striping • High performance for
large files – read multiple parts simultaneously
Web Builder 2.0 32
- Chunked space • Pluses – High performance – Limitless size •
Minuses – Conceptually complex – Can be hard to expand on the fly –
Can’t manually poke it
Web Builder 2.0 33
- Real Life Case Studies
Web Builder 2.0 34
- GFS – Google File System • Developed by … Google • Proprietary •
Everything we know about it is based on talks they’ve given • Designed
to store huge files for fast access
Web Builder 2.0 35
- GFS – Google File System • Single ‘Master’ node holds metadata –
SPF – Shadow master allows warm swap • Grid of ‘chunkservers’ – 64bit
filenames – 64 MB file chunks
Web Builder 2.0 36
- GFS – Google File System Master 1(a) 2(a) 1(b)
Web Builder 2.0 37
- GFS – Google File System • Client reads metadata from master then
file parts from multiple chunkservers • Designed for big files
(>100MB) • Master server allocates access leases • Replication is
automatic and self repairing – Synchronously for atomicity
Web Builder 2.0 38
- GFS – Google File System • Reading is fast (parallelizable) – But
requires a lease • Master server is required for all reads and writes
Web Builder 2.0 39
- MogileFS – OMG Files • Developed by Danga / SixApart • Open
source • Designed for scalable web app storage
Web Builder 2.0 40
- MogileFS – OMG Files • Single metadata store (MySQL) – MySQL
Cluster avoids SPF • Multiple ‘tracker’ nodes locate files • Multiple
‘storage’ nodes store files
Web Builder 2.0 41
- MogileFS – OMG Files Tracker MySQL Tracker
Web Builder 2.0 42
- MogileFS – OMG Files • Replication of file ‘classes’ happens
transparently • Storage nodes are not mirrored – replication is
piecemeal • Reading and writing go through trackers, but are performed
directly upon storage nodes
Web Builder 2.0 43
- Flickr File System • Developed by Flickr • Proprietary • Designed
for very large scalable web app storage
Web Builder 2.0 44
- Flickr File System • No metadata store – Deal with it yourself •
Multiple ‘StorageMaster’ nodes • Multiple storage nodes with virtual
volumes
Web Builder 2.0 45
- Flickr File System SM SM SM
Web Builder 2.0 46
- Flickr File System • Metadata stored by app – Just a virtual
volume number – App chooses a path • Virtual nodes are mirrored –
Locally and remotely • Reading is done directly from nodes
Web Builder 2.0 47
- Flickr File System • StorageMaster nodes only used for write
operations • Reading and writing can scale separately
Web Builder 2.0 48
- Serving
Web Builder 2.0 49
- Serving files Serving files is easy! Disk Apache
Web Builder 2.0 50
- Serving files Scaling is harder Disk Apache Disk Apache Disk
Apache
Web Builder 2.0 51
- Serving files • This doesn’t scale well • Primary storage is
expensive – And takes a lot of space • In many systems, we only access
a small number of files most of the time
Web Builder 2.0 52
- Caching • Insert caches between the storage and serving nodes •
Cache frequently accessed content to reduce reads on the storage nodes
• Software (Squid, mod_cache) • Hardware (Netcache, Cacheflow)
Web Builder 2.0 53
- Why it works • Keep a smaller working set • Use faster hardware –
Lots of RAM – SCSI – Outer edge of disks (ZCAV) • Use more duplicates –
Cheaper, since they’re smaller
Web Builder 2.0 54
- Two models • Layer 4 – ‘Simple’ balanced cache – Objects in
multiple caches – Good for few objects requested many times • Layer 7 –
URL balances cache – Objects in a single cache – Good for many objects
requested a few times
Web Builder 2.0 55
- Replacement policies • LRU – Least recently used • GDSF – Greedy
dual size frequency • LFUDA – Least frequently used with dynamic aging
• All have advantages and disadvantages • Performance varies greatly
with each
Web Builder 2.0 56
- Cache Churn • How long do objects typically stay in cache? • If
it gets too short, we’re doing badly – But it depends on your traffic
profile • Make the cached object store larger
Web Builder 2.0 57
- Problems • Caching has some problems: – Invalidation is hard –
Replacement is dumb (even LFUDA) • Avoiding caching makes your life
(somewhat) easier
Web Builder 2.0 58
- CDN – Content Delivery Network • Akamai, Savvis, Mirror Image
Internet, etc • Caches operated by other people – Already in-place – In
lots of places • GSLB/DNS balancing
Web Builder 2.0 59
- Edge networks Origin
Web Builder 2.0 60
- Edge networks Cache Cache Cache Origin Cache Cache Cache Cache
Cache
Web Builder 2.0 61
- CDN Models • Simple model – You push content to them, they serve
it • Reverse proxy model – You publish content on an origin, they proxy
and cache it
Web Builder 2.0 62
- CDN Invalidation • You don’t control the caches – Just like those
awful ISP ones • Once something is cached by a CDN, assume it can never
change – Nothing can be deleted – Nothing can be modified
Web Builder 2.0 63
- Versioning • When you start to cache things, you need to care
about versioning – Invalidation & Expiry – Naming & Sync
Web Builder 2.0 64
- Cache Invalidation • If you control the caches, invalidation is
possible • But remember ISP and client caches • Remove deleted content
explicitly – Avoid users finding old content – Save cache space
Web Builder 2.0 65
- Cache versioning • Simple rule of thumb: – If an item is
modified, change its name (URL) • This can be independent of the file
system!
Web Builder 2.0 66
- Virtual versioning • Database indicates version 3 of file Version
3 • Web app writes version number into URL example.com/foo_3.jpg •
Request comes through Cached: foo_3.jpg cache and is cached with the
versioned URL foo_3.jpg -> foo.jpg • mod_rewrite converts versioned
URL to path
Web Builder 2.0 67
- Authentication • Authentication inline layer – Apache / perlbal •
Authentication sideline – ICP (CARP/HTCP) • Authentication by URL –
FlickrFS
Web Builder 2.0 68
- Auth layer • Authenticator sits between client and Authenticator
storage • Typically built into the Cache cache software Origin
Web Builder 2.0 69
- Auth sideline Cache Origin Authenticator • Authenticator sits
beside the cache • Lightweight protocol used for authenticator
Web Builder 2.0 70
- Auth by URL Web Server Cache Origin • Someone else performs
authentication and gives URLs to client (typically the web app) • URLs
hold the ‘keys’ for accessing files
Web Builder 2.0 71
- BCP
Web Builder 2.0 72
- Business Continuity Planning • How can I deal with the
unexpected? – The core of BCP • Redundancy • Replication
Web Builder 2.0 73
- Reality • On a long enough timescale, anything that can fail,
will fail • Of course, everything can fail • True reliability comes
only through redundancy
Web Builder 2.0 74
- Reality • Define your own SLAs • How long can you afford to be
down? • How manual is the recovery process? • How far can you roll
back? • How many node x boxes can fail at once?
Web Builder 2.0 75
- Failure scenarios • Disk failure • Storage array failure •
Storage head failure • Fabric failure • Metadata node failure • Power
outage • Routing outage
Web Builder 2.0 76
- Reliable by design • RAID avoids disk failures, but not head or
fabric failures • Duplicated nodes avoid host and fabric failures, but
not routing or power failures • Dual-colo avoids routing and power
failures, but my need duplication too
Web Builder 2.0 77
- Tend to all points in the stack • Going dual-colo: great • Taking
a whole colo offline because of a single failed disk: bad • We need a
combination of these
Web Builder 2.0 78
- Recovery times • BCP is not just about continuing when things
fail • How can we restore after they come back? • Host and colo level
syncing – replication queuing • Host and colo level rebuilding
Web Builder 2.0 79
- Reliable Reads & Writes • Reliable reads are easy – 2 or more
copies of files • Reliable writes are harder – Write 2 copies at once –
But what do we do when we can’t write to one?
Web Builder 2.0 80
- Dual writes • Queue up data to be written – Where? – Needs itself
to be reliable • Queue up journal of changes – And then read data from
the disk whose write succeeded • Duplicate whole volume after failure –
Slow!
Web Builder 2.0 81
- Cost
Web Builder 2.0 82
- Judging cost • Per GB? • Per GB upfront and per year • Not as
simple as you’d hope – How about an example
Web Builder 2.0 83
- Hardware costs Single Cost Cost of hardware Usable GB
Web Builder 2.0 84
- Power costs Recurring Cost Cost of power per year Usable GB
Web Builder 2.0 85
- Power costs Single Cost Power installation cost Usable GB
Web Builder 2.0 86
- Space costs Recurring Cost ] [ Cost per U x U’s needed (inc
network) Usable GB
Web Builder 2.0 87
- Network costs Single Cost Cost of network gear Usable GB
Web Builder 2.0 88
- Misc costs Single & Recurring Costs ] [ Support contracts +
spare disks + bus adaptors + cables Usable GB
Web Builder 2.0 89
- Human costs Recurring Cost ] [ Admin cost per node x Node count
Usable GB
Web Builder 2.0 90
- TCO • Total cost of ownership in two parts – Upfront – Ongoing •
Architecture plays a huge part in costing – Don’t get tied to hardware
– Allow heterogeneity – Move with the market
Web Builder 2.0 91
- (fin)
- Photo credits • flickr.com/photos/ebright/260823954/ •
flickr.com/photos/thomashawk/243477905 / •
flickr.com/photos/tom-carden/116315962/ •
flickr.com/photos/sillydog/287354869/ •
flickr.com/photos/foreversouls/131972916/ •
flickr.com/photos/julianb/324897/ •
flickr.com/photos/primejunta/140957047/ •
flickr.com/photos/whatknot/28973703/ •
flickr.com/photos/dcjohn/85504455/
Web Builder 2.0 93
- You can find these slides online: iamcal.com/talks/
Web Builder 2.0 94
|