I'm currently evaluating using Cassandra for archival of immutable "files" as well, fairly medium sized ones - within 1-100MB, with a 10MB mean size. Number of objects will probably be in the low millions. We require neither high-throughput nor concurrent access of these objects though, but rather our goals are availability and persistence.
Steady and safe is key, so not being able to stream efficiently isn't such a big deal for us, compared to the immediate benefits cassandra offers in terms of managing replication, scaling and availability. I don't have a real problem working within cassandra's limitations, myself. Splitting blobs into manageable chunks / columns - say 2-8 MB to fit (almost) comfortably within thrift rpc semantics, works well-enough for us. Also, splitting data over several column families helps compaction, and I suppose it could be an optimization to split data across multiple keys as well. I kinda like the idea of having a single record per file though, since it makes managing/deleting and referencing files easier. To increase read performance and allow streaming/sendfile web serving of "hot" files, an up-front cache like Varnish in front of a thin web service to the "file system" would work well. Think of varnish as a buffer, translating (slow) segmented reads to full-bore large-object streaming, at the cost of initial latency and duplicated (frontend) storage. If you're building a web site, cache warmup to handle first-request latency should probably be part of your plan as well... etc, and so on, turned over 'til done... :) While this sort of usage certainly needs special care, and definitely requires application-specific design, I haven't run into any blockers (yet). On the contrary, the forcing of focus unto identifying actual data access patterns throughout a system is both enlightening and rewarding, IMHO. Not everything is a nail, and that's ok. :) /d On Sun, Dec 13, 2009 at 10:55 PM, Michael Koziarski <[email protected]> wrote: > On Sun, Dec 13, 2009 at 9:05 AM, Ran Tavory <[email protected]> wrote: >> As we're designing our systems for a move from mysql to Cassandra we're >> considering moving our file storage to Cassandra as well. Is this wise? >> We're currently using mogilefs to store media items (images) of average size >> of 30Mb (400k images, and growing). Cassandra looks like a performance >> improvement over mogilefs (saves roundtrip, no sql in the middle) but I was >> wondering whether the fact that cassandra stores byte arrays should >> encourage us to store images in it. Is Cassandra a good fit? > > I think that mogile would probably be a much better fit here. While > you may save a tiny bit of round-tripping, those sql queries aren't > likely going be an appreciable percentage of the total time taken to > stream the binary out to the user. > >> Has anyone had any similar experience or can send guidelines? >> To phrase the question in more general terms: What's cassandra's sweet spot >> in terms of Value size per column or total row size? >> Thanks
