Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

Jon Haddad Tue, 04 Mar 2025 15:53:28 -0800

I've come around on the rsync vs built in topic, and I think this is the
same.  Having this managed in process gives us more options for control.


I think it's critical that 100% of the data be pushed to the object store.
I mentioned this in my email on Dec 15, but nobody directly responded to
that.  I keep seeing messages that sound like people only want a subset of
SSTables to live in the object store, I think that would be a massive
mistake.

Jon



On Tue, Mar 4, 2025 at 8:11 AM Jeff Jirsa <jji...@gmail.com> wrote:

> Mounted dirs give up the opportunity to change the IO model to account for
> different behaviors. The configurable channel proxy may suffer from the
> same IO constraints depending on implementation, too. But it may also
> become viable.
>
> The snapshot outside of the mounted file system seems like you’re
> implicitly implementing a one off in process backup. This is the same
> feedback I gave the sidecar with the rsync to another machine proposal. If
> we stop doing one off tactical projects and map out the actual problem
> we’re solving, we can get the right interfaces.  Also, you can probably
> just have the sidecar rsync thing do your snapshot to another directory on
> host.
>
> But if every sstable makes its way to s3, things like native backup,
> restoring from backup, recovering from local volumes can look VERY different
>
>
>
> On Mar 4, 2025, at 3:57 PM, Štefan Miklošovič <smikloso...@apache.org>
> wrote:
>
> 
> For what it's worth, as it might come to somebody I am rejecting this
> altogether (which is not the case, all I am trying to say is that we should
> just think about it more) - it would be cool to know more about the
> experience of others when it comes to this, maybe somebody already tried to
> mount and it did not work as expected?
>
> On the other hand, there is this "snapshots outside data dir" effort I am
> doing and if we did it with this, then I can imagine that we could say "and
> if you deal with snapshots, use this proxy instead" which would
> transparently upload it to s3.
>
> Then we would not need to do anything at all, code-wise. We would not need
> to store snapshots "outside of data dir" just to be able to place it on a
> directory which is mounted as an s3 bucket.
>
> I don't know if it is possible to do it like that. Worth to explore I
> guess.
>
> I like mounted dirs for its simplicity and I guess that for copying files
> it might be just enough. Plus we would not need to add all s3 jars on CP
> either etc ...
>
> On Tue, Mar 4, 2025 at 2:46 PM Štefan Miklošovič <smikloso...@apache.org>
> wrote:
>
>> I don't say that using remote object storage is useless.
>>
>> I am just saying that I don't see the difference. I have not measured
>> that but I can imagine that s3 mounted would use, under the hood, the same
>> calls to s3 api. How else would it be done? You need to talk to remote s3
>> storage eventually anyway. So why does it matter if we call s3 api from
>> Java or by other means from some "s3 driver"?  It is eventually using same
>> thing, no?
>>
>> On Tue, Mar 4, 2025 at 12:47 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>
>>> Mounting an s3 bucket as a directory is an easy but poor implementation
>>> of object backed storage for databases
>>>
>>> Object storage is durable (most data loss is due to bugs not concurrent
>>> hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A  huge
>>> number of modern systems are object-storage-only because the approximately
>>> infinite scale / cost / throughput tradeoffs often make up for the latency.
>>>
>>> Outright dismissing object storage for Cassandra is short sighted - it
>>> needs to be done in a way that makes sense, not just blindly copying over
>>> the block access patterns to object.
>>>
>>>
>>> On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič <smikloso...@apache.org>
>>> wrote:
>>>
>>> 
>>> I do not think we need this CEP, honestly. I don't want to diss this
>>> unnecessarily but if you mount a remote storage locally (e.g. mounting s3
>>> bucket as if it was any other directory on node's machine), then what is
>>> this CEP good for?
>>>
>>> Not talking about the necessity to put all dependencies to be able to
>>> talk to respective remote storage to Cassandra's class path, introducing
>>> potential problems with dependencies and their possible incompatibilities /
>>> different versions etc ...
>>>
>>> On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas <sc...@paradoxica.net>
>>> wrote:
>>>
>>>> I’d love to see this implemented — where “this” is a proxy for some
>>>> notion of support for remote object storage, perhaps usable by compaction
>>>> strategies like TWCS to migrate data older than a threshold from a local
>>>> filesystem to remote object.
>>>>
>>>> It’s not an area where I can currently dedicate engineering effort. But
>>>> if others are interested in contributing a feature like this, I’d see it as
>>>> valuable for the project and would be happy to collaborate on
>>>> design/architecture/goals.
>>>>
>>>> – Scott
>>>>
>>>> On Feb 26, 2025, at 6:56 AM, guo Maxwell <cclive1...@gmail.com> wrote:
>>>>
>>>> 
>>>> Is anyone else interested in continuing to discuss this topic?
>>>>
>>>> guo Maxwell <cclive1...@gmail.com> 于2024年9月20日周五 09:44写道：
>>>>
>>>>> I discussed this offline with Claude, he is no longer working on this.
>>>>>
>>>>> It's a pity. I think this is a very valuable thing. Commitlog's
>>>>> archiving and restore may be able to use the relevant code if it is
>>>>> completed.
>>>>>
>>>>> Patrick McFadin <pmcfa...@gmail.com>于2024年9月20日 周五上午2:01写道：
>>>>>
>>>>>> Thanks for reviving this one!
>>>>>>
>>>>>> On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell <cclive1...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Is there any update on this topic?  It seems that things can make a
>>>>>>> big progress if  Jake Luciani  can find someone who can make the
>>>>>>> FileSystemProvider code accessible.
>>>>>>>
>>>>>>> Jon Haddad <j...@jonhaddad.com> 于2023年12月16日周六 05:29写道：
>>>>>>>
>>>>>>>> At a high level I really like the idea of being able to better
>>>>>>>> leverage cheaper storage especially object stores like S3.
>>>>>>>>
>>>>>>>> One important thing though - I feel pretty strongly that there's a
>>>>>>>> big, deal breaking downside.   Backups, disk failure policies, 
>>>>>>>> snapshots
>>>>>>>> and possibly repairs would get more complicated which haven't been
>>>>>>>> particularly great in the past, and of course there's the issue of 
>>>>>>>> failure
>>>>>>>> recovery being only partially possible if you're looking at a durable 
>>>>>>>> block
>>>>>>>> store paired with an ephemeral one with some of your data not 
>>>>>>>> replicated to
>>>>>>>> the cold side.  That introduces a failure case that's unacceptable for 
>>>>>>>> most
>>>>>>>> teams, which results in needing to implement potentially 2 different 
>>>>>>>> backup
>>>>>>>> solutions.  This is operationally complex with a lot of surface area 
>>>>>>>> for
>>>>>>>> headaches.  I think a lot of teams would probably have an issue with 
>>>>>>>> the
>>>>>>>> big question mark around durability and I probably would avoid it 
>>>>>>>> myself.
>>>>>>>>
>>>>>>>> On the other hand, I'm +1 if we approach it something slightly
>>>>>>>> differently - where _all_ the data is located on the cold storage, 
>>>>>>>> with the
>>>>>>>> local hot storage used as a cache.  This means we can use the cold
>>>>>>>> directories for the complete dataset, simplifying backups and node
>>>>>>>> replacements.
>>>>>>>>
>>>>>>>> For a little background, we had a ticket several years ago where I
>>>>>>>> pointed out it was possible to do this *today* at the operating system
>>>>>>>> level as long as you're using block devices (vs an object store) and 
>>>>>>>> LVM
>>>>>>>> [1].  For example, this works well with GP3 EBS w/ low IOPS 
>>>>>>>> provisioning +
>>>>>>>> local NVMe to get a nice balance of great read performance without 
>>>>>>>> going
>>>>>>>> nuts on the cost for IOPS.  I also wrote about this in a little more 
>>>>>>>> detail
>>>>>>>> in my blog [2].  There's also the new mount point tech in AWS which 
>>>>>>>> pretty
>>>>>>>> much does exactly what I've suggested above [3] that's probably worth
>>>>>>>> evaluating just to get a feel for it.
>>>>>>>>
>>>>>>>> I'm not insisting we require LVM or the AWS S3 fs, since that would
>>>>>>>> rule out other cloud providers, but I am pretty confident that the 
>>>>>>>> entire
>>>>>>>> dataset should reside in the "cold" side of things for the practical 
>>>>>>>> and
>>>>>>>> technical reasons I listed above.  I don't think it massively changes 
>>>>>>>> the
>>>>>>>> proposal, and should simplify things for everyone.
>>>>>>>>
>>>>>>>> Jon
>>>>>>>>
>>>>>>>> [1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/
>>>>>>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-8460
>>>>>>>> [3]
>>>>>>>> https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 14, 2023 at 1:56 AM Claude Warren <cla...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Is there still interest in this?  Can we get some points down on
>>>>>>>>> electrons so that we all understand the issues?
>>>>>>>>>
>>>>>>>>> While it is fairly simple to redirect the read/write to something
>>>>>>>>> other  than the local system for a single node this will not solve the
>>>>>>>>> problem for tiered storage.
>>>>>>>>>
>>>>>>>>> Tiered storage will require that on read/write the primary key be
>>>>>>>>> assessed and determine if the read/write should be redirected.  My
>>>>>>>>> reasoning for this statement is that in a cluster with a replication 
>>>>>>>>> factor
>>>>>>>>> greater than 1 the node will store data for the keys that would be
>>>>>>>>> allocated to it in a cluster with a replication factor = 1, as well 
>>>>>>>>> as some
>>>>>>>>> keys from nodes earlier in the ring.
>>>>>>>>>
>>>>>>>>> Even if we can get the primary keys for all the data we want to
>>>>>>>>> write to "cold storage" to map to a single node a replication factor 
>>>>>>>>> > 1
>>>>>>>>> means that data will also be placed in "normal storage" on subsequent 
>>>>>>>>> nodes.
>>>>>>>>>
>>>>>>>>> To overcome this, we have to explore ways to route data to
>>>>>>>>> different storage based on the keys and that different storage may 
>>>>>>>>> have to
>>>>>>>>> be available on _all_  the nodes.
>>>>>>>>>
>>>>>>>>> Have any of the partial solutions mentioned in this email chain
>>>>>>>>> (or others) solved this problem?
>>>>>>>>>
>>>>>>>>> Claude
>>>>>>>>>
>>>>>>>>

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

Reply via email to