Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

Rolo, Carlos via dev Tue, 04 Mar 2025 02:46:33 -0800

Hello,

I would love to discuss this and provide feedback and design work for this. 
Since I'm not an experienced Java programmer I can't "hands-on" on the code. 
But I pick this up and try to carry it forward.

Carlos

________________________________
From: guo Maxwell <cclive1...@gmail.com>
Sent: 26 February 2025 14:54
To: dev@cassandra.apache.org <dev@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external 
storage locations

EXTERNAL EMAIL - USE CAUTION when clicking links or attachments

Is anyone else interested in continuing to discuss this topic?

guo Maxwell <cclive1...@gmail.com<mailto:cclive1...@gmail.com>> 于2024年9月20日周五 
09:44写道：
I discussed this offline with Claude, he is no longer working on this.

It's a pity. I think this is a very valuable thing. Commitlog's archiving and 
restore may be able to use the relevant code if it is completed.

Patrick McFadin <pmcfa...@gmail.com<mailto:pmcfa...@gmail.com>>于2024年9月20日 
周五上午2:01写道：
Thanks for reviving this one!

On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell 
<cclive1...@gmail.com<mailto:cclive1...@gmail.com>> wrote:
Is there any update on this topic?  It seems that things can make a big 
progress if  Jake Luciani  can find someone who can make the FileSystemProvider 
code accessible.

Jon Haddad <j...@jonhaddad.com<mailto:j...@jonhaddad.com>> 于2023年12月16日周六 
05:29写道：
At a high level I really like the idea of being able to better leverage cheaper 
storage especially object stores like S3.

One important thing though - I feel pretty strongly that there's a big, deal 
breaking downside.   Backups, disk failure policies, snapshots and possibly 
repairs would get more complicated which haven't been particularly great in the 
past, and of course there's the issue of failure recovery being only partially 
possible if you're looking at a durable block store paired with an ephemeral 
one with some of your data not replicated to the cold side.  That introduces a 
failure case that's unacceptable for most teams, which results in needing to 
implement potentially 2 different backup solutions.  This is operationally 
complex with a lot of surface area for headaches.  I think a lot of teams would 
probably have an issue with the big question mark around durability and I 
probably would avoid it myself.

On the other hand, I'm +1 if we approach it something slightly differently - 
where _all_ the data is located on the cold storage, with the local hot storage 
used as a cache.  This means we can use the cold directories for the complete 
dataset, simplifying backups and node replacements.

For a little background, we had a ticket several years ago where I pointed out 
it was possible to do this *today* at the operating system level as long as 
you're using block devices (vs an object store) and LVM [1].  For example, this 
works well with GP3 EBS w/ low IOPS provisioning + local NVMe to get a nice 
balance of great read performance without going nuts on the cost for IOPS.  I 
also wrote about this in a little more detail in my blog [2].  There's also the 
new mount point tech in AWS which pretty much does exactly what I've suggested 
above [3] that's probably worth evaluating just to get a feel for it.

I'm not insisting we require LVM or the AWS S3 fs, since that would rule out 
other cloud providers, but I am pretty confident that the entire dataset should 
reside in the "cold" side of things for the practical and technical reasons I 
listed above.  I don't think it massively changes the proposal, and should 
simplify things for everyone.

Jon

[1] 
https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/<https://urldefense.com/v3/__https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/__;!!Nhn8V6BzJA!XscMrj2bGAg48tjrbrz4xaoPNcc3Jyng4qbE3V8hSTOafMdmpAEn51Tjd3sgj9qYE810Gawvv_za5kuuhjBy$>
[2] 
https://issues.apache.org/jira/browse/CASSANDRA-8460<https://urldefense.com/v3/__https://issues.apache.org/jira/browse/CASSANDRA-8460__;!!Nhn8V6BzJA!XscMrj2bGAg48tjrbrz4xaoPNcc3Jyng4qbE3V8hSTOafMdmpAEn51Tjd3sgj9qYE810Gawvv_za5gahHlhq$>
[3] 
https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/<https://urldefense.com/v3/__https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/__;!!Nhn8V6BzJA!XscMrj2bGAg48tjrbrz4xaoPNcc3Jyng4qbE3V8hSTOafMdmpAEn51Tjd3sgj9qYE810Gawvv_za5uWVlHcs$>

On Thu, Dec 14, 2023 at 1:56 AM Claude Warren 
<cla...@apache.org<mailto:cla...@apache.org>> wrote:
Is there still interest in this?  Can we get some points down on electrons so 
that we all understand the issues?

While it is fairly simple to redirect the read/write to something other  than 
the local system for a single node this will not solve the problem for tiered 
storage.

Tiered storage will require that on read/write the primary key be assessed and 
determine if the read/write should be redirected.  My reasoning for this 
statement is that in a cluster with a replication factor greater than 1 the 
node will store data for the keys that would be allocated to it in a cluster 
with a replication factor = 1, as well as some keys from nodes earlier in the 
ring.

Even if we can get the primary keys for all the data we want to write to "cold 
storage" to map to a single node a replication factor > 1 means that data will 
also be placed in "normal storage" on subsequent nodes.

To overcome this, we have to explore ways to route data to different storage 
based on the keys and that different storage may have to be available on _all_  
the nodes.

Have any of the partial solutions mentioned in this email chain (or others) 
solved this problem?

Claude

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

Reply via email to