Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

Štefan Miklošovič Tue, 04 Mar 2025 06:06:33 -0800

I would be very cautious about not "reinventing the wheel" here. Are we
confident that we implement it in such a way which is bug-free / robust
enough? Do we think we can do a better job than the authors of "s3 driver"?
If they did a good job (which I assume they did) then failing an operation
while writing to / reading from remote storage mounted locally _should_ be
ideally indistinguishable from dealing with it locally.


So, what are these differences exactly? Not asking you specifically but in
general. I think this all should be answered before doing this in order to
know what we are getting ourselves into.

Plus the disadvantages when it comes to putting all the dependencies to
classpath. I think we would never release it. We would at most expose an
API for others to integrate with and it would be their job to deal with all
the complexity.

On Tue, Mar 4, 2025 at 2:57 PM Brandon Williams <dri...@gmail.com> wrote:

> A failing remote api that you are calling and a failing filesystem you
> are using have different implications.
>
> Kind Regards,
> Brandon
>
> On Tue, Mar 4, 2025 at 7:47 AM Štefan Miklošovič <smikloso...@apache.org>
> wrote:
> >
> > I don't say that using remote object storage is useless.
> >
> > I am just saying that I don't see the difference. I have not measured
> that but I can imagine that s3 mounted would use, under the hood, the same
> calls to s3 api. How else would it be done? You need to talk to remote s3
> storage eventually anyway. So why does it matter if we call s3 api from
> Java or by other means from some "s3 driver"?  It is eventually using same
> thing, no?
> >
> > On Tue, Mar 4, 2025 at 12:47 PM Jeff Jirsa <jji...@gmail.com> wrote:
> >>
> >> Mounting an s3 bucket as a directory is an easy but poor implementation
> of object backed storage for databases
> >>
> >> Object storage is durable (most data loss is due to bugs not concurrent
> hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A  huge
> number of modern systems are object-storage-only because the approximately
> infinite scale / cost / throughput tradeoffs often make up for the latency.
> >>
> >> Outright dismissing object storage for Cassandra is short sighted - it
> needs to be done in a way that makes sense, not just blindly copying over
> the block access patterns to object.
> >>
> >>
> >> On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič <smikloso...@apache.org>
> wrote:
> >>
> >> 
> >> I do not think we need this CEP, honestly. I don't want to diss this
> unnecessarily but if you mount a remote storage locally (e.g. mounting s3
> bucket as if it was any other directory on node's machine), then what is
> this CEP good for?
> >>
> >> Not talking about the necessity to put all dependencies to be able to
> talk to respective remote storage to Cassandra's class path, introducing
> potential problems with dependencies and their possible incompatibilities /
> different versions etc ...
> >>
> >> On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas <sc...@paradoxica.net>
> wrote:
> >>>
> >>> I’d love to see this implemented — where “this” is a proxy for some
> notion of support for remote object storage, perhaps usable by compaction
> strategies like TWCS to migrate data older than a threshold from a local
> filesystem to remote object.
> >>>
> >>> It’s not an area where I can currently dedicate engineering effort.
> But if others are interested in contributing a feature like this, I’d see
> it as valuable for the project and would be happy to collaborate on
> design/architecture/goals.
> >>>
> >>> – Scott
> >>>
> >>> On Feb 26, 2025, at 6:56 AM, guo Maxwell <cclive1...@gmail.com> wrote:
> >>>
> >>> 
> >>> Is anyone else interested in continuing to discuss this topic?
> >>>
> >>> guo Maxwell <cclive1...@gmail.com> 于2024年9月20日周五 09:44写道：
> >>>>
> >>>> I discussed this offline with Claude, he is no longer working on this.
> >>>>
> >>>> It's a pity. I think this is a very valuable thing. Commitlog's
> archiving and restore may be able to use the relevant code if it is
> completed.
> >>>>
> >>>> Patrick McFadin <pmcfa...@gmail.com>于2024年9月20日 周五上午2:01写道：
> >>>>>
> >>>>> Thanks for reviving this one!
> >>>>>
> >>>>> On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell <cclive1...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> Is there any update on this topic?  It seems that things can make a
> big progress if  Jake Luciani  can find someone who can make the
> FileSystemProvider code accessible.
> >>>>>>
> >>>>>> Jon Haddad <j...@jonhaddad.com> 于2023年12月16日周六 05:29写道：
> >>>>>>>
> >>>>>>> At a high level I really like the idea of being able to better
> leverage cheaper storage especially object stores like S3.
> >>>>>>>
> >>>>>>> One important thing though - I feel pretty strongly that there's a
> big, deal breaking downside.   Backups, disk failure policies, snapshots
> and possibly repairs would get more complicated which haven't been
> particularly great in the past, and of course there's the issue of failure
> recovery being only partially possible if you're looking at a durable block
> store paired with an ephemeral one with some of your data not replicated to
> the cold side.  That introduces a failure case that's unacceptable for most
> teams, which results in needing to implement potentially 2 different backup
> solutions.  This is operationally complex with a lot of surface area for
> headaches.  I think a lot of teams would probably have an issue with the
> big question mark around durability and I probably would avoid it myself.
> >>>>>>>
> >>>>>>> On the other hand, I'm +1 if we approach it something slightly
> differently - where _all_ the data is located on the cold storage, with the
> local hot storage used as a cache.  This means we can use the cold
> directories for the complete dataset, simplifying backups and node
> replacements.
> >>>>>>>
> >>>>>>> For a little background, we had a ticket several years ago where I
> pointed out it was possible to do this *today* at the operating system
> level as long as you're using block devices (vs an object store) and LVM
> [1].  For example, this works well with GP3 EBS w/ low IOPS provisioning +
> local NVMe to get a nice balance of great read performance without going
> nuts on the cost for IOPS.  I also wrote about this in a little more detail
> in my blog [2].  There's also the new mount point tech in AWS which pretty
> much does exactly what I've suggested above [3] that's probably worth
> evaluating just to get a feel for it.
> >>>>>>>
> >>>>>>> I'm not insisting we require LVM or the AWS S3 fs, since that
> would rule out other cloud providers, but I am pretty confident that the
> entire dataset should reside in the "cold" side of things for the practical
> and technical reasons I listed above.  I don't think it massively changes
> the proposal, and should simplify things for everyone.
> >>>>>>>
> >>>>>>> Jon
> >>>>>>>
> >>>>>>> [1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/
> >>>>>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-8460
> >>>>>>> [3]
> https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Dec 14, 2023 at 1:56 AM Claude Warren <cla...@apache.org>
> wrote:
> >>>>>>>>
> >>>>>>>> Is there still interest in this?  Can we get some points down on
> electrons so that we all understand the issues?
> >>>>>>>>
> >>>>>>>> While it is fairly simple to redirect the read/write to something
> other  than the local system for a single node this will not solve the
> problem for tiered storage.
> >>>>>>>>
> >>>>>>>> Tiered storage will require that on read/write the primary key be
> assessed and determine if the read/write should be redirected.  My
> reasoning for this statement is that in a cluster with a replication factor
> greater than 1 the node will store data for the keys that would be
> allocated to it in a cluster with a replication factor = 1, as well as some
> keys from nodes earlier in the ring.
> >>>>>>>>
> >>>>>>>> Even if we can get the primary keys for all the data we want to
> write to "cold storage" to map to a single node a replication factor > 1
> means that data will also be placed in "normal storage" on subsequent nodes.
> >>>>>>>>
> >>>>>>>> To overcome this, we have to explore ways to route data to
> different storage based on the keys and that different storage may have to
> be available on _all_  the nodes.
> >>>>>>>>
> >>>>>>>> Have any of the partial solutions mentioned in this email chain
> (or others) solved this problem?
> >>>>>>>>
> >>>>>>>> Claude
>

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

Reply via email to