Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

Jon Haddad Fri, 07 Mar 2025 11:21:23 -0800

Supporting a filesystem mount is perfectly reasonable. If you wanted to use
that with the S3 mount, there's nothing that should prevent you from doing
so, and the filesystem version is probably the default implementation that
we'd want to ship with since, to your point, it doesn't require additional
dependencies.


Allowing support for object stores is just that - allowing it.  It's just
more functionality, more flexibility.  I don't claim to know every object
store or remote filesystem, there are going to be folks that want to do
something custom I can't think of right now and there's no reason to box
ourselves in.  The abstraction to allow it is a small one.  If folks want
to put SSTables on HDFS, they should be able to.  Who am I to police them,
especially if we can do it in a way that doesn't carry any risk, making it
available as a separate plugin?

My comment with regard to not wanting to treat object store as tiered
storage has everything to do with what I want to do with the data.  I want
100% of it on object store (with a copy of hot data on the local node) for
multiple reasons, some of which (maybe all) were made by Jeff and Scott:

* Analytics is easier, no transfer off C*
* Replace is dead simple, just pull the data off the mount when it boots
up.  Like using EBS, but way cheaper.  (thanks Scott for some numbers on
this)
* It makes backups easier, just copy the bucket.
* If you're using object store for backups anyways than I don't see why you
wouldn't keep all your data there
* I hadn't even really thought about scale to zero before but I love this
too

Some folks want to treat the object store as a second tier, implying to me
that once the SSTables reach a certain age or aren't touched, they're
uploaded to the object store. Supporting this use case shouldn't be that
much different.  Maybe you don't care about the most recent data, and
you're OK with losing everything from the last few days, because you can
reload from Kafka.  You'd be treating C* as a temporary staging area for
whatever purpose, and you only want to do analytics on the cold data.   As
I've expressed already, it's just a difference in policy of when to
upload.  Yes I'm being a bit hand wavy about this part but its an email
discussion.  I don't have this use case today, but it's certainly valid.
Or maybe it works in conjunction with witness replicas / transient
replication, allowing you to offload data from a node until there's a
failure, in which case C* can grab it.  I'm just throwing out ideas here.

Feel free to treat object store here as "secondary location".  It can be a
NAS, an S3 mount, a custom FUSE filesystem, or the S3 api, or whatever else
people come up with.

In the past, I've made the case that this functionality can be achieved
with LVM cache pools [1][2], and even provided some benchmarks showing it
can be used.  I've also argued that we can do node replacements with
rsync.  While these things are both technically true, others have convinced
me that having this functionality as first class in the database makes it
easier for our users and thus better for the project.  Should someone have
to *just* understand all of LVM, or *just* understand the nuance of
potential data loss due to rsync's default second resolution?  I keep
relearning that whenever I say "*just* do X", that X isn't as convenient or
easy for other people as it is for me, and I need to relax a bit.

The view of the world where someone *just* needs to know dozens of
workarounds to use the database makes it harder for non-experts to use.
The bar for usability is constantly being raised, and whatever makes it
better for a first time user is better for the community.

Anyways, that's where I'm at now.  Easier = better, even if we're
reinventing some of the wheel to do so.  Sometimes you get a better wheel,
too.

Jon

[1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/
[2] https://issues.apache.org/jira/browse/CASSANDRA-8460



On Fri, Mar 7, 2025 at 10:08 AM Štefan Miklošovič <smikloso...@apache.org>
wrote:

> Because an earlier reply hinted that mounting a bucket yields "terrible
> results". That has moved the discussion, in my mind, practically to the
> place of "we are not going to do this", to which I explained that in this
> particular case I do not find the speed important, because the use cases
> you want to use it for do not have anything in common with what I want to
> use it for.
>
> Since then, in my eyes this was binary "either / or", I was repeatedly
> trying to get an opinion about being able to mount it regardless, to which,
> afaik, only you explicitly expressed an opinion that it is OK but you are
> not a fan of it:
>
> "I personally can't see myself using something that treats an object store
> as cold storage where SSTables are moved (implying they weren't there
> before), and I've expressed my concerns with this, but other folks seem to
> want it and that's OK."
>
> So my assumption was, except you being ok with it, that mounting is not
> viable, so it looks like we are forcing it.
>
> To be super honest, if we made custom storage providers / proxies possible
> and it was already in place, then my urge to do "something fast and
> functioning" (e.g. mounting a bucket) would not exist. I would not use
> mounted buckets if we had this already in place and configurable in such a
> way that we could say that everything except (e.g) snapshots would be
> treated as it is now.
>
> But, I can see how this will take a very, very long time to implement.
> This is a complex CEP to tackle. I remember this topic being discussed in
> the past as well. I _think_ there were at least two occasions when this was
> already discussed, that it might be ported / retrofitted from what Mick was
> showing etc. Nothing happened. Maybe mounting a bucket is not perfect and
> doing it the other way is a more fitting solution etc. but as the saying
> goes "perfect is the enemy of good".
>
> On Fri, Mar 7, 2025 at 6:32 PM Jon Haddad <j...@rustyrazorblade.com> wrote:
>
>> If that's not your intent, then you should be more careful with your
>> replies.  When you write something like this:
>>
>> > While this might work, what I find tricky is that we are forcing this
>> to users. Not everybody is interested in putting everything to a bucket and
>> server traffic from that. They just don't want to do that. Because reasons.
>> They are just happy with what they have etc, it works fine for years and so
>> on. They just want to upload SSTables upon snapshotting and call it a day.
>>
>> > I don't think we should force our worldview on them if they are not
>> interested in it.
>>
>> It comes off *extremely* negative.  You use the word "force" here
>> multiple times.
>>
>>
>>
>>
>> On Fri, Mar 7, 2025 at 9:18 AM Štefan Miklošovič <smikloso...@apache.org>
>> wrote:
>>
>>> I was explaining multiple times (1) that I don't have anything against
>>> what is discussed here.
>>>
>>> Having questions about what that is going to look like does not mean I
>>> am dismissive.
>>>
>>> (1) https://lists.apache.org/thread/ofh2q52p92cr89wh2l3djsm5n9dmzzsg
>>>
>>> On Fri, Mar 7, 2025 at 5:44 PM Jon Haddad <j...@rustyrazorblade.com>
>>> wrote:
>>>
>>>> Nobody is saying you can't work with a mount, and this isn't a
>>>> conversation about snapshots.
>>>>
>>>> Nobody is forcing users to use object storage either.
>>>>
>>>> You're making a ton of negative assumptions here about both the
>>>> discussion, and the people you're having it with.  Try to be more open
>>>> minded.
>>>>
>>>>
>>>> On Fri, Mar 7, 2025 at 2:28 AM Štefan Miklošovič <
>>>> smikloso...@apache.org> wrote:
>>>>
>>>>> The only way I see that working is that, if everything was in a
>>>>> bucket, if you take a snapshot, these SSTables would be "copied" from live
>>>>> data dir (living in a bucket) to snapshots dir (living in a bucket).
>>>>> Basically, we would need to say "and if you go to take a snapshot on this
>>>>> table, instead of hardlinking these SSTables, do a copy". But this
>>>>> "copying" would be internal to a bucket itself. We would not need to
>>>>> "upload" from node's machine to s3.
>>>>>
>>>>> While this might work, what I find tricky is that we are forcing this
>>>>> to users. Not everybody is interested in putting everything to a bucket 
>>>>> and
>>>>> server traffic from that. They just don't want to do that. Because 
>>>>> reasons.
>>>>> They are just happy with what they have etc, it works fine for years and 
>>>>> so
>>>>> on. They just want to upload SSTables upon snapshotting and call it a day.
>>>>>
>>>>> I don't think we should force our worldview on them if they are not
>>>>> interested in it.
>>>>>
>>>>> On Fri, Mar 7, 2025 at 11:02 AM Štefan Miklošovič <
>>>>> smikloso...@apache.org> wrote:
>>>>>
>>>>>> BTW, snapshots are quite special because these are not "files", they
>>>>>> are just hard links. They "materialize" as regular files once underlying
>>>>>> SSTables are compacted away. How are you going to hardlink from local
>>>>>> storage to an object storage anyway? We will always need to "upload".
>>>>>>
>>>>>> On Fri, Mar 7, 2025 at 10:51 AM Štefan Miklošovič <
>>>>>> smikloso...@apache.org> wrote:
>>>>>>
>>>>>>> Jon,
>>>>>>>
>>>>>>> all "big three" support mounting a bucket locally. That being said,
>>>>>>> I do not think that completely ditching this possibility for Cassandra
>>>>>>> working with a mount, e.g. for just uploading snapshots there etc, is
>>>>>>> reasonable.
>>>>>>>
>>>>>>> GCP
>>>>>>>
>>>>>>>
>>>>>>> https://cloud.google.com/storage/docs/cloud-storage-fuse/quickstart-mount-bucket
>>>>>>>
>>>>>>> Azure (this one is quite sophisticated), lot of options ...
>>>>>>>
>>>>>>>
>>>>>>> https://learn.microsoft.com/en-us/azure/storage/blobs/blobfuse2-how-to-deploy?tabs=RHEL
>>>>>>>
>>>>>>> S3, lot of options how to mount that
>>>>>>>
>>>>>>> https://bluexp.netapp.com/blog/amazon-s3-as-a-file-system
>>>>>>>
>>>>>>> On Thu, Mar 6, 2025 at 4:17 PM Jon Haddad <j...@rustyrazorblade.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Assuming everything else is identical, might not matter for S3.
>>>>>>>> However, not every object store has a filesystem mount.
>>>>>>>>
>>>>>>>> Regarding sprawling dependencies, we can always make the provider
>>>>>>>> specific libraries available as a separate download and put them on 
>>>>>>>> their
>>>>>>>> own thread with a separate class path. I think in JVM dtest does this
>>>>>>>> already.  Someone just started asking about IAM for login, it sounds 
>>>>>>>> like a
>>>>>>>> similar problem.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 6, 2025 at 12:53 AM Benedict <bened...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think another way of saying what Stefan may be getting at is
>>>>>>>>> what does a library give us that an appropriately configured mount dir
>>>>>>>>> doesn’t?
>>>>>>>>>
>>>>>>>>> We don’t want to treat S3 the same as local disk, but this can be
>>>>>>>>> achieved easily with config. Is there some other benefit of direct
>>>>>>>>> integration? Well defined exceptions if we need to distinguish cases 
>>>>>>>>> is one
>>>>>>>>> that maybe springs to mind but perhaps there are others?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 6 Mar 2025, at 08:39, Štefan Miklošovič <smikloso...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>> That is cool but this still does not show / explain how it would
>>>>>>>>> look like when it comes to dependencies needed for actually talking to
>>>>>>>>> storages like s3.
>>>>>>>>>
>>>>>>>>> Maybe I am missing something here and please explain when I am
>>>>>>>>> mistaken but If I understand that correctly, for talking to s3 we 
>>>>>>>>> would
>>>>>>>>> need to use a library like this, right? (1). So that would be added 
>>>>>>>>> among
>>>>>>>>> Cassandra dependencies? Hence Cassandra starts to be biased against 
>>>>>>>>> s3? Why
>>>>>>>>> s3? Every time somebody comes up with a new remote storage support, 
>>>>>>>>> that
>>>>>>>>> would be added to classpath as well? How are these dependencies going 
>>>>>>>>> to
>>>>>>>>> play with each other and with Cassandra in general? Will all these 
>>>>>>>>> storage
>>>>>>>>> provider libraries for arbitrary clouds be even compatible with 
>>>>>>>>> Cassandra
>>>>>>>>> licence-wise?
>>>>>>>>>
>>>>>>>>> I am sorry I keep repeating these questions but this part of that
>>>>>>>>> I just don't get at all.
>>>>>>>>>
>>>>>>>>> We can indeed add an API for this, sure sure, why not. But for
>>>>>>>>> people who do not want to deal with this at all and just be OK with a 
>>>>>>>>> FS
>>>>>>>>> mounted, why would we block them doing that?
>>>>>>>>>
>>>>>>>>> (1)
>>>>>>>>> https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xml
>>>>>>>>>
>>>>>>>>> On Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever <m...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>    .
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It’s not an area where I can currently dedicate engineering
>>>>>>>>>>> effort. But if others are interested in contributing a feature like 
>>>>>>>>>>> this,
>>>>>>>>>>> I’d see it as valuable for the project and would be happy to 
>>>>>>>>>>> collaborate on
>>>>>>>>>>> design/architecture/goals.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Jake mentioned 17 months ago a custom FileSystemProvider we could
>>>>>>>>>> offer.
>>>>>>>>>>
>>>>>>>>>> None of us at DataStax has gotten around to providing that, but
>>>>>>>>>> to quickly throw something over the wall this is it:
>>>>>>>>>>
>>>>>>>>>> https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java
>>>>>>>>>>
>>>>>>>>>>   (with a few friend classes under o.a.c.io.util)
>>>>>>>>>>
>>>>>>>>>> We then have a RemoteStorageProvider, private in another repo,
>>>>>>>>>> that implements that and also provides the RemoteFileSystemProvider 
>>>>>>>>>> that
>>>>>>>>>> Jake refers to.
>>>>>>>>>>
>>>>>>>>>> Hopefully that's a start to get people thinking about CEP level
>>>>>>>>>> details, while we get a cleaned abstract of RemoteStorageProvider and
>>>>>>>>>> friends to offer.
>>>>>>>>>>
>>>>>>>>>>

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

Reply via email to