Have you looked at Swift, OpenStack's S3 equivalent?
http://swift.openstack.org/

-salman

On Sat, Feb 25, 2012 at 11:17 PM, David Barrett <[email protected]>wrote:

> As luck would have it, I'm dealt a problem that finally calls for an
> honest-to-god distributed file system.  Basically, my company is in
> talks with a very security-conscious vendor that can't allow any of
> their data to touch Amazon S3, or indeed any other CDN.  This means
> that we have the fun task of building out a private CDN to store a
> whole bunch of ~1MB files, starting slow and quickly ramping up into
> the many-dozens-of-Terabytes range.  (So, big enough to require a bit
> of work, but not so big as to be silly to attempt.)  Based on the
> current (and ever-changing) state of the art, what do you advise?
>
> To give a bit more detail:
> - Files will be uploaded to us from another datacenter, so write speed
> needs to be ok, but it's not like we're generating files locally at
> some insane rate.
> - This isn't a super latency-sensitive problem (just hosting image
> thumbnails), but read performance needs to support interactive
> browsing.
> - Availability needs to be in the 99.9% range inside the rack, though
> the real availability problem is always at the network level.
> - Each file will generally be accessed at most once (they're images
> and thumbnails, and we'll set client-side caching very high).
> - The access pattern will be almost uniformly random (eg, there won't
> be some power-law curve where a few files dominate).
> - We have 5 clusters (physical racks) available in 3 different
> datacenters, though we aren't obligated to use them all for this
> project.
> - The racks will have an IPsec, LAN-to-LAN tunnel, likely with all
> storage boxes on the same VLAN.
> - Files will be named according to their sha1hashes.
>
> Thoughts?  Off the cuff I'd think an easy place to start is to buy a
> hot-swappable enclosure of consumer drives with a software RAID
> controller, make a few huge ext4 volumes, and then just
> fully-replicate all content between two racks in different datacenters
> using rsync or even unison.  But my fear is this'll break down after
> some number of files (this will quickly get into the millions-of-files
> territory).  Accordingly, I assume that distributed filesystems are
> designed to handle this much better.  I've read a bit on Tahoe, but
> it's daunting and I'm not sure where to start.  Zooko, any tips?
>
> Furthermore, while full replication between two racks is certainly
> doable and acceptable cost, it ignores the performance and
> availability opportunities of that third datacenter.  Ideally I'd host
> something like 2/3 of the data in each datacenter, such that we get
> full 2x replication, without the extraneous cost of 3x replication (or
> 5x if we use all racks).  But my fear here is there will be a "read
> penalty" for 1/3 of all reads, as each datacenter won't be able to
> host all its requests from its local storage.  (Unless we start to get
> really tricky with DNS, perhaps having subdomains for "0-9" and "a-f"
> and then configure each datacenter to host precisely 2/3 of the data,
> but a different 2/3 for each; then use DNS such that requests always
> go to a datacenter that has the content...)
>
> Anyway, I'm not the sysadmin on this project, so somebody else with a
> lot more knowledge than me will actually implement this.  But I'd
> welcome some pointers.  Thanks!
>
> -david
> _______________________________________________
> p2p-hackers mailing list
> [email protected]
> http://lists.zooko.com/mailman/listinfo/p2p-hackers
>
_______________________________________________
p2p-hackers mailing list
[email protected]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to