Re: diverting riak as a filesystem replacement

Jeremiah Peschka Sun, 25 Sep 2011 13:29:47 -0700

Responses inline
---
Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
Microsoft SQL Server MVP

On Sep 25, 2011, at 5:30 AM, pille wrote:

> hi,
> 
> i'm quite new to riak and only know it from the docs available online.
> to be honest, i did not search for a key/value store, but for a reliable (HA) 
> distributed, replicated filesystem that allows dynamic growth.

To be honest, what you're looking for is a SAN. EMC's Isilon line, Dell's 
Equallogic, and HP's Lefthand devices all meet your needs very well. They don't 
require a lot of administrative knowledge, they're easy to set up and maintain, 
and they are very easy to expand. SANs provide the features and functionality 
that you're looking for and won't require any additional development or 
maintenance. Yes, they cost money, but they do just sorta work straight out of 
the box.

That being said, I answered the rest of these questions as if you weren't 
willing to just throw a bucket of money and SAN gear at your problem.

> 
> all these filesystems i've dealt with are either immature, abandoned, or are 
> limited in features like dynamic scaling, snapshotting or fail in 
> out-of-diskspace scenarios (as they don't give you high availability and data 
> protection at the same time).
> 
> somehow i stumbled upon this project and liked its features, despite not 
> being a filesystem at all. i can live with its flat structure if it'll bring 
> me all the other features i need.
> 
> so i'm now at the point that after reading the online docs without any 
> hands-on experience leaves some questions unanswered.
> since i'm used to storing all data in a filesystem, our application's storage 
> interface would need a complete rewrite to interface with riak and provide 
> the same services as before. therefore i'd like to ask you to share your 
> knowledge and experience.
> 
> 1) are snapshots provided?
>   i guess they aren't, but i'm more interested weather i can use the 
> vectorclocks for that.
>   i only need one snapshot and live data to provide an consistent old view of 
> the data for our staging instance.

Snapshots are not provided. You could probably cook something up yourself, but 
there's no snapshotting involved that I know of. Vector clocks are used for 
determining object lineage and conflict resolution.

> 
> 2) how does riak deal with different storage capacities of the different 
> nodes? is it a problem, if some nodes provide less space than others? is data 
> distributed uniformly accross all nodes or is its capacity taken into account?

AFAIK, data is distributed evenly across a number of virtual nodes (64 by 
default). Those virtual nodes are then distributed evenly across your physical 
nodes. I don't know of a way to change this, but I've been very wrong before.
> 
> 3) we've got quite huge files for a database to store. is that a problem? 
> what storage backend do you propose?
>   currently we see the following distribution, but i expect more in the range 
> from 512MB to 4GB to come in future:
>         <   1KB: 64053
>     1KB -   1MB: 873795
>     1MB -   2MB: 4776
>     2MB -   4MB: 3131
>     4MB -   8MB: 3136
>     8MB -  16MB: 2842
>    16MB -  32MB: 3136
>    32MB -  64MB: 4032
>    64MB - 128MB: 3118
>   128MB - 256MB: 3361
>   256MB - 512MB: 3221
>   512MB -   1GB: 1423
>     1GB -   2GB: 75

Riak KV's max acceptable performance size is about 64MB for a file, but 
performance would probably start degrading before that. Luwak is an application 
built on top of Riak that probably meets your needs a lot better than plain old 
Riak KV: http://wiki.basho.com/Luwak.html

> 
> 4) is range access possible to read parts of a file^W value or do i need to 
> stream the whole file through? this would not perform well on guge values.

With Luwak it's possible to get a portion of the object using the option Range 
parameter: http://wiki.basho.com/HTTP-Fetch-Luwak-Object.html

> 
> 5) to reduce the impact of a disk failure on the storage backend and i'd like 
> each disk of a server to be assigned to its own riak-node. i guess healing 
> the failed node ofter replacement is faster than raid recovery and less data 
> is at risk.
>   is it possible to reflect the hardware hierarchy in some way to influence 
> the place for replicas? CephFS offers this to make sure replicas are hold on 
> different hardware or even in different locations.
>   e.g. a STORAGE is in a SERVER, which is in a RACK, which is in a 
> DATACENTER. replicas of a file in a STORAGE should never be placed inside the 
> same SERVER, (or RACK, or DATACENTER).

You can purchase Riak EDS which has multi-site replication. Otherwise, Riak is 
just going to throw data into N nodes in your cluster and it will be up to you 
to make sure those nodes are in different racks.

> 
> 6) what happens, if less that R or W nodes report data? does it mean not 
> found or not available? even if the data is on an currently offline node.

If less than R nodes are present, your write will fail. The R value means "this 
many nodes have to respond with data for it to be considered a successful 
read." Anything less than R would, thusly, mean there was a failure.

If less than W nodes are able to write data, a hinted handoff will occur.

> 
> 7) can he client applications connect to some random node?
>   should it simply retry the next one in the list upon failure?

Client applications should connect to a random node, yes. Even better, you 
should put a load balancing proxy server in front of your Riak cluster so 
developers don't have to worry about writing their own load balancing code.

I'd retry on failure, but that's up to you. ;)

> 
> 8) is the data reported back on read is compared/verifies with all replicas 
> to ensure consistency or just its metadata (if R>1)

Yes, R nodes have to respond with *the same* copy of the data before a read is 
successful. You can quickly do this by comparing vector clocks and other 
assorted metadata.

> 
> 9) is data integrity in storage backend is secured through checksums?

I think depends on the storage backend implementation. doing a quick grep 
through the source code turns up the word "checksum" a lot, though. 

> 
> these are the questions puzzling me at the moment.
> if you know some filesystem that matches my featurelist, please don't 
> hesitate to answer them off-topic ;-)

Other options include HDFS and MogileFS (http://danga.com/mogilefs/). Last.fm 
use MogileFS 

> 
> cheers
>  pille
> 
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: diverting riak as a filesystem replacement

Reply via email to