To Jeff’s point on tactical vs. strategic, here’s the big picture for me on object storage:
– Object storage is 70% cheaper: Replicated flash block storage is extremely expensive, and more so with compute resources constantly attached. If one were to build a storage platform on top of a cloud provider’s compute and storage infrastructure, selective use of object storage is essential to even being in the ballpark of managed offerings on price. EBS is 8¢/GB. S3 is 2.3¢/GB. It’s over 70% cheaper.
– Local/block storage is priced on storage *provisioned*. Object storage is priced on storage *consumed*: It’s actually better than 70%. Local/block storage is priced based on the size of disks/volumes provisioned. While they may be resizable, resizing is generally inelastic. This typically produces a large gap between storage consumed vs. storage provisioned - and poor utilization. Object storage is typically priced on storage that is actually consumed.
– Object storage integration is the simplest path to complete decoupling of CPU and storage: Block volumes are more fungible than local disk, but aren’t even close in flexibility to an SSTable that can be accessed by any function. Object is also the only sensible path to implementing a serverless database whose query facilities can be deployed on a function-as-a-service platform. That enables one to reach an idle compute cost of zero and an idle storage cost of 2.3¢/GB/month (S3).
– Object storage enables scale-to-zero: Object storage integration is the only path for most databases to provide a scale-to-zero offering that doesn’t rely on keeping hot NVMe or block storage attached 24/7 while a database receives zero queries per second.
– Scale to zero is one of the easiest paths to zero marginal cost (the other is multitenancy - and not mutually exclusive): Database platforms operated in a cluster-as-a-service model incur a constant fixed cost of provisioned resources regardless of whether they are in use. That’s fine for platforms that pass the full cost of resources consumed back to someone — but it produces poor economics and resource waste. Ability to scale to zero dramatically reduces the cost of provisioning and maintaining an un/underutilized database.
– It’s not all or nothing: There are super sensible ways to pair local/block storage and object storage. One might be to store upper-level SSTable data components in object storage; and all other SSTable components (TOC, CompressionInfo, primary index, etc) on local flash. This gives you a way to rapidly enumerate and navigate SSTable metadata while only paying the cost of reads when fetching data (and possibly from upper-level SSTables only). Alternately, one could offload only older partitions of data in TWCS - time series data older than 30 days, a year, etc.
– Mounting a bucket as a filesystem is unusable: Others have made this point. Naively mounting an object storage bucket as a filesystem produces uniformly *terrible* results. S3 time to first byte is in the 20-30ms+ range. S3 one-zone is closer to the 5-10ms range which is on par with the seek latency of a spinning disk. Despite C* originally being designed to operate well on spinning disks, a filesystem-like abstraction backed by object storage today will result in awful surprises due to IO patterns like those found by Jon H. and Jordan recently.
– Object Storage can unify “database” and “lakehouse”: One could imagine Iceberg integration that enables manifest/snapshot-based querying of SSTables in an object store via Spark or similar platforms, with zero ETL or light cone contact with a production database process.
The reason people care about object is that it’s 70%+ cheaper than flash - and 90%+ cheaper if the software querying it isn’t always running, too.
On Mar 4, 2025, at 12:29 PM, Štefan Miklošovič <smikloso...@apache.org> wrote:
Jeff, when it comes to snapshots, there was already discussion in the other thread I am not sure you are aware of (1), here (2) I am talking about Sidecar + snapshots specifically. One "caveat" of Sidecar is that you actually _need_ sidecar if we ever contemplated Sidecar doing upload / backup (by whatever means). "s3 bucket as a mounted directory" bypasses the necessity of having Sidecar deployed. I do not want to repeat what I wrote in (2), all the reasoning is there. My primary motivation to do it by mounting is to 1) not use Sidecar if not needed 2) just save snapshot SSTables outside of table data dirs (which is imho completely fine requirement on its own, e.g. putting snapshots on a different / slow disk etc, why does it have to be on the same disk as data are?) The vibe I got from that thread was that what I was proposing is in general acceptable and I wanted to figure out the details as Blake was mentioning incremental backs as well. Other than that, I was thinking that we are pretty much settled on how that should be, that thread was up for a long time so I was thinking people in general do not have a problem with that. How I read your email, specifically this part: "This is the same feedback I gave the sidecar with the rsync to another machine proposal. If we stop doing one off tactical projects and map out the actual problem we’re solving, we can get the right interfaces." it seems to me that you are categorizing "s3 bucket mounted locally" as one of these "one off tactical projects" which in translation means that you would like to see that approach not implemented and we should rather focus on doing everything via proxies? One big downside of that which nobody answered yet (seems like that to me as far as I checked) is how would this actually look like on deployment and delivery? As I said in (1), we would need to code up proxy for every remote storage, Azure, S3, GCP to name a few. We would need to implement all of these for each and every cloud. Secondly, who would implement that and where would that code live? Is it up to individuals to code it up internally? When we want to talk to s3, we need to put all s3 dependencies to classpath, who is going to integrate that, is that even possible? Similar for other clouds. By mounting a dir, we just do not do anything with Cassandra's class path. It is as it was, simple, easy, and we interact with it as we are used to. I see that proxy might be viable for some applications but I think it has also non-trivial disadvantages operationally-wise. (1) https://lists.apache.org/thread/8cz5fh835ojnxwtn1479q31smm5x7nxt(2) https://lists.apache.org/thread/mttg75ps49qkob6km4l74fmp879v76qsMounted dirs give up the opportunity to change the IO model to account for different behaviors. The configurable channel proxy may suffer from the same IO constraints depending on implementation, too. But it may also become viable.
The snapshot outside of the mounted file system seems like you’re implicitly implementing a one off in process backup. This is the same feedback I gave the sidecar with the rsync to another machine proposal. If we stop doing one off tactical projects and map out the actual problem we’re solving, we can get the right interfaces. Also, you can probably just have the sidecar rsync thing do your snapshot to another directory on host.
But if every sstable makes its way to s3, things like native backup, restoring from backup, recovering from local volumes can look VERY different
For what it's worth, as it might come to somebody I am rejecting this altogether (which is not the case, all I am trying to say is that we should just think about it more) - it would be cool to know more about the experience of others when it comes to this, maybe somebody already tried to mount and it did not work as expected?
On the other hand, there is this "snapshots outside data dir" effort I am doing and if we did it with this, then I can imagine that we could say "and if you deal with snapshots, use this proxy instead" which would transparently upload it to s3.
Then we would not need to do anything at all, code-wise. We would not need to store snapshots "outside of data dir" just to be able to place it on a directory which is mounted as an s3 bucket.
I don't know if it is possible to do it like that. Worth to explore I guess.
I like mounted dirs for its simplicity and I guess that for copying files it might be just enough. Plus we would not need to add all s3 jars on CP either etc ... I don't say that using remote object storage is useless.
I am just saying that I don't see the difference. I have not measured that but I can imagine that s3 mounted would use, under the hood, the same calls to s3 api. How else would it be done? You need to talk to remote s3 storage eventually anyway. So why does it matter if we call s3 api from Java or by other means from some "s3 driver"? It is eventually using same thing, no?
Mounting an s3 bucket as a directory is an easy but poor implementation of object backed storage for databases
Object storage is durable (most data loss is due to bugs not concurrent hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A huge number of modern systems are object-storage-only because the approximately infinite scale / cost / throughput tradeoffs often make up for the latency.
Outright dismissing object storage for Cassandra is short sighted - it needs to be done in a way that makes sense, not just blindly copying over the block access patterns to object.
I do not think we need this CEP, honestly. I don't want to diss this unnecessarily but if you mount a remote storage locally (e.g. mounting s3 bucket as if it was any other directory on node's machine), then what is this CEP good for?
Not talking about the necessity to put all dependencies to be able to talk to respective remote storage to Cassandra's class path, introducing potential problems with dependencies and their possible incompatibilities / different versions etc ... I’d love to see this implemented — where “this” is a proxy for some notion of support for remote object storage, perhaps usable by compaction strategies like TWCS to migrate data older than a threshold from a local filesystem to remote object.
It’s not an area where I can currently dedicate engineering effort. But if others are interested in contributing a feature like this, I’d see it as valuable for the project and would be happy to collaborate on design/architecture/goals.
– Scott Is anyone else interested in continuing to discuss this topic? I discussed this offline with Claude, he is no longer working on this.
It's a pity. I think this is a very valuable thing. Commitlog's archiving and restore may be able to use the relevant code if it is completed.
Thanks for reviving this one!
Is there any update on this topic? It seems that things can make a big progress if Jake Luciani can find someone who can make the FileSystemProvider code accessible. At a high level I really like the idea of being able to better leverage cheaper storage especially object stores like S3.
One important thing though - I feel pretty strongly that there's a big, deal breaking downside. Backups, disk failure policies, snapshots and possibly repairs would get more complicated which haven't been particularly great in the past, and of course there's the issue of failure recovery being only partially possible if you're looking at a durable block store paired with an ephemeral one with some of your data not replicated to the cold side. That introduces a failure case that's unacceptable for most teams, which results in needing to implement potentially 2 different backup solutions. This is operationally complex with a lot of surface area for headaches. I think a lot of teams would probably have an issue with the big question mark around durability and I probably would avoid it myself.
On the other hand, I'm +1 if we approach it something slightly differently - where _all_ the data is located on the cold storage, with the local hot storage used as a cache. This means we can use the cold directories for the complete dataset, simplifying backups and node replacements.
For a little background, we had a ticket several years ago where I pointed out it was possible to do this *today* at the operating system level as long as you're using block devices (vs an object store) and LVM [1]. For example, this works well with GP3 EBS w/ low IOPS provisioning + local NVMe to get a nice balance of great read performance without going nuts on the cost for IOPS. I also wrote about this in a little more detail in my blog [2]. There's also the new mount point tech in AWS which pretty much does exactly what I've suggested above [3] that's probably worth evaluating just to get a feel for it.
I'm not insisting we require LVM or the AWS S3 fs, since that would rule out other cloud providers, but I am pretty confident that the entire dataset should reside in the "cold" side of things for the practical and technical reasons I listed above. I don't think it massively changes the proposal, and should simplify things for everyone.
Jon
Is there still interest in this? Can we get some points down on electrons so that we all understand the issues?
While it is fairly simple to redirect the read/write to something other than the local system for a single node this will not solve the problem for tiered storage.
Tiered storage will require that on read/write the primary key be assessed and determine if the read/write should be redirected. My reasoning for this statement is that in a cluster with a replication factor greater than 1 the node will store data for the keys that would be allocated to it in a cluster with a replication factor = 1, as well as some keys from nodes earlier in the ring.
Even if we can get the primary keys for all the data we want to write to "cold storage" to map to a single node a replication factor > 1 means that data will also be placed in "normal storage" on subsequent nodes.
To overcome this, we have to explore ways to route data to different storage based on the keys and that different storage may have to be available on _all_ the nodes.
Have any of the partial solutions mentioned in this email chain (or others) solved this problem?
Claude
|