I've come around on the rsync vs built in topic, and I think this is the same. Having this managed in process gives us more options for control.
I think it's critical that 100% of the data be pushed to the object store. I mentioned this in my email on Dec 15, but nobody directly responded to that. I keep seeing messages that sound like people only want a subset of SSTables to live in the object store, I think that would be a massive mistake. Jon On Tue, Mar 4, 2025 at 8:11 AM Jeff Jirsa <jji...@gmail.com> wrote: > Mounted dirs give up the opportunity to change the IO model to account for > different behaviors. The configurable channel proxy may suffer from the > same IO constraints depending on implementation, too. But it may also > become viable. > > The snapshot outside of the mounted file system seems like you’re > implicitly implementing a one off in process backup. This is the same > feedback I gave the sidecar with the rsync to another machine proposal. If > we stop doing one off tactical projects and map out the actual problem > we’re solving, we can get the right interfaces. Also, you can probably > just have the sidecar rsync thing do your snapshot to another directory on > host. > > But if every sstable makes its way to s3, things like native backup, > restoring from backup, recovering from local volumes can look VERY different > > > > On Mar 4, 2025, at 3:57 PM, Štefan Miklošovič <smikloso...@apache.org> > wrote: > > > For what it's worth, as it might come to somebody I am rejecting this > altogether (which is not the case, all I am trying to say is that we should > just think about it more) - it would be cool to know more about the > experience of others when it comes to this, maybe somebody already tried to > mount and it did not work as expected? > > On the other hand, there is this "snapshots outside data dir" effort I am > doing and if we did it with this, then I can imagine that we could say "and > if you deal with snapshots, use this proxy instead" which would > transparently upload it to s3. > > Then we would not need to do anything at all, code-wise. We would not need > to store snapshots "outside of data dir" just to be able to place it on a > directory which is mounted as an s3 bucket. > > I don't know if it is possible to do it like that. Worth to explore I > guess. > > I like mounted dirs for its simplicity and I guess that for copying files > it might be just enough. Plus we would not need to add all s3 jars on CP > either etc ... > > On Tue, Mar 4, 2025 at 2:46 PM Štefan Miklošovič <smikloso...@apache.org> > wrote: > >> I don't say that using remote object storage is useless. >> >> I am just saying that I don't see the difference. I have not measured >> that but I can imagine that s3 mounted would use, under the hood, the same >> calls to s3 api. How else would it be done? You need to talk to remote s3 >> storage eventually anyway. So why does it matter if we call s3 api from >> Java or by other means from some "s3 driver"? It is eventually using same >> thing, no? >> >> On Tue, Mar 4, 2025 at 12:47 PM Jeff Jirsa <jji...@gmail.com> wrote: >> >>> Mounting an s3 bucket as a directory is an easy but poor implementation >>> of object backed storage for databases >>> >>> Object storage is durable (most data loss is due to bugs not concurrent >>> hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A huge >>> number of modern systems are object-storage-only because the approximately >>> infinite scale / cost / throughput tradeoffs often make up for the latency. >>> >>> Outright dismissing object storage for Cassandra is short sighted - it >>> needs to be done in a way that makes sense, not just blindly copying over >>> the block access patterns to object. >>> >>> >>> On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič <smikloso...@apache.org> >>> wrote: >>> >>> >>> I do not think we need this CEP, honestly. I don't want to diss this >>> unnecessarily but if you mount a remote storage locally (e.g. mounting s3 >>> bucket as if it was any other directory on node's machine), then what is >>> this CEP good for? >>> >>> Not talking about the necessity to put all dependencies to be able to >>> talk to respective remote storage to Cassandra's class path, introducing >>> potential problems with dependencies and their possible incompatibilities / >>> different versions etc ... >>> >>> On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas <sc...@paradoxica.net> >>> wrote: >>> >>>> I’d love to see this implemented — where “this” is a proxy for some >>>> notion of support for remote object storage, perhaps usable by compaction >>>> strategies like TWCS to migrate data older than a threshold from a local >>>> filesystem to remote object. >>>> >>>> It’s not an area where I can currently dedicate engineering effort. But >>>> if others are interested in contributing a feature like this, I’d see it as >>>> valuable for the project and would be happy to collaborate on >>>> design/architecture/goals. >>>> >>>> – Scott >>>> >>>> On Feb 26, 2025, at 6:56 AM, guo Maxwell <cclive1...@gmail.com> wrote: >>>> >>>> >>>> Is anyone else interested in continuing to discuss this topic? >>>> >>>> guo Maxwell <cclive1...@gmail.com> 于2024年9月20日周五 09:44写道: >>>> >>>>> I discussed this offline with Claude, he is no longer working on this. >>>>> >>>>> It's a pity. I think this is a very valuable thing. Commitlog's >>>>> archiving and restore may be able to use the relevant code if it is >>>>> completed. >>>>> >>>>> Patrick McFadin <pmcfa...@gmail.com>于2024年9月20日 周五上午2:01写道: >>>>> >>>>>> Thanks for reviving this one! >>>>>> >>>>>> On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell <cclive1...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Is there any update on this topic? It seems that things can make a >>>>>>> big progress if Jake Luciani can find someone who can make the >>>>>>> FileSystemProvider code accessible. >>>>>>> >>>>>>> Jon Haddad <j...@jonhaddad.com> 于2023年12月16日周六 05:29写道: >>>>>>> >>>>>>>> At a high level I really like the idea of being able to better >>>>>>>> leverage cheaper storage especially object stores like S3. >>>>>>>> >>>>>>>> One important thing though - I feel pretty strongly that there's a >>>>>>>> big, deal breaking downside. Backups, disk failure policies, >>>>>>>> snapshots >>>>>>>> and possibly repairs would get more complicated which haven't been >>>>>>>> particularly great in the past, and of course there's the issue of >>>>>>>> failure >>>>>>>> recovery being only partially possible if you're looking at a durable >>>>>>>> block >>>>>>>> store paired with an ephemeral one with some of your data not >>>>>>>> replicated to >>>>>>>> the cold side. That introduces a failure case that's unacceptable for >>>>>>>> most >>>>>>>> teams, which results in needing to implement potentially 2 different >>>>>>>> backup >>>>>>>> solutions. This is operationally complex with a lot of surface area >>>>>>>> for >>>>>>>> headaches. I think a lot of teams would probably have an issue with >>>>>>>> the >>>>>>>> big question mark around durability and I probably would avoid it >>>>>>>> myself. >>>>>>>> >>>>>>>> On the other hand, I'm +1 if we approach it something slightly >>>>>>>> differently - where _all_ the data is located on the cold storage, >>>>>>>> with the >>>>>>>> local hot storage used as a cache. This means we can use the cold >>>>>>>> directories for the complete dataset, simplifying backups and node >>>>>>>> replacements. >>>>>>>> >>>>>>>> For a little background, we had a ticket several years ago where I >>>>>>>> pointed out it was possible to do this *today* at the operating system >>>>>>>> level as long as you're using block devices (vs an object store) and >>>>>>>> LVM >>>>>>>> [1]. For example, this works well with GP3 EBS w/ low IOPS >>>>>>>> provisioning + >>>>>>>> local NVMe to get a nice balance of great read performance without >>>>>>>> going >>>>>>>> nuts on the cost for IOPS. I also wrote about this in a little more >>>>>>>> detail >>>>>>>> in my blog [2]. There's also the new mount point tech in AWS which >>>>>>>> pretty >>>>>>>> much does exactly what I've suggested above [3] that's probably worth >>>>>>>> evaluating just to get a feel for it. >>>>>>>> >>>>>>>> I'm not insisting we require LVM or the AWS S3 fs, since that would >>>>>>>> rule out other cloud providers, but I am pretty confident that the >>>>>>>> entire >>>>>>>> dataset should reside in the "cold" side of things for the practical >>>>>>>> and >>>>>>>> technical reasons I listed above. I don't think it massively changes >>>>>>>> the >>>>>>>> proposal, and should simplify things for everyone. >>>>>>>> >>>>>>>> Jon >>>>>>>> >>>>>>>> [1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/ >>>>>>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-8460 >>>>>>>> [3] >>>>>>>> https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/ >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Dec 14, 2023 at 1:56 AM Claude Warren <cla...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Is there still interest in this? Can we get some points down on >>>>>>>>> electrons so that we all understand the issues? >>>>>>>>> >>>>>>>>> While it is fairly simple to redirect the read/write to something >>>>>>>>> other than the local system for a single node this will not solve the >>>>>>>>> problem for tiered storage. >>>>>>>>> >>>>>>>>> Tiered storage will require that on read/write the primary key be >>>>>>>>> assessed and determine if the read/write should be redirected. My >>>>>>>>> reasoning for this statement is that in a cluster with a replication >>>>>>>>> factor >>>>>>>>> greater than 1 the node will store data for the keys that would be >>>>>>>>> allocated to it in a cluster with a replication factor = 1, as well >>>>>>>>> as some >>>>>>>>> keys from nodes earlier in the ring. >>>>>>>>> >>>>>>>>> Even if we can get the primary keys for all the data we want to >>>>>>>>> write to "cold storage" to map to a single node a replication factor >>>>>>>>> > 1 >>>>>>>>> means that data will also be placed in "normal storage" on subsequent >>>>>>>>> nodes. >>>>>>>>> >>>>>>>>> To overcome this, we have to explore ways to route data to >>>>>>>>> different storage based on the keys and that different storage may >>>>>>>>> have to >>>>>>>>> be available on _all_ the nodes. >>>>>>>>> >>>>>>>>> Have any of the partial solutions mentioned in this email chain >>>>>>>>> (or others) solved this problem? >>>>>>>>> >>>>>>>>> Claude >>>>>>>>> >>>>>>>>