Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

Brandon Williams Tue, 04 Mar 2025 05:55:43 -0800

A failing remote api that you are calling and a failing filesystem you
are using have different implications.


Kind Regards,
Brandon

On Tue, Mar 4, 2025 at 7:47 AM Štefan Miklošovič <smikloso...@apache.org> wrote:
>
> I don't say that using remote object storage is useless.
>
> I am just saying that I don't see the difference. I have not measured that 
> but I can imagine that s3 mounted would use, under the hood, the same calls 
> to s3 api. How else would it be done? You need to talk to remote s3 storage 
> eventually anyway. So why does it matter if we call s3 api from Java or by 
> other means from some "s3 driver"?  It is eventually using same thing, no?
>
> On Tue, Mar 4, 2025 at 12:47 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>
>> Mounting an s3 bucket as a directory is an easy but poor implementation of 
>> object backed storage for databases
>>
>> Object storage is durable (most data loss is due to bugs not concurrent 
>> hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A  huge number 
>> of modern systems are object-storage-only because the approximately infinite 
>> scale / cost / throughput tradeoffs often make up for the latency.
>>
>> Outright dismissing object storage for Cassandra is short sighted - it needs 
>> to be done in a way that makes sense, not just blindly copying over the 
>> block access patterns to object.
>>
>>
>> On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič <smikloso...@apache.org> 
>> wrote:
>>
>> 
>> I do not think we need this CEP, honestly. I don't want to diss this 
>> unnecessarily but if you mount a remote storage locally (e.g. mounting s3 
>> bucket as if it was any other directory on node's machine), then what is 
>> this CEP good for?
>>
>> Not talking about the necessity to put all dependencies to be able to talk 
>> to respective remote storage to Cassandra's class path, introducing 
>> potential problems with dependencies and their possible incompatibilities / 
>> different versions etc ...
>>
>> On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas <sc...@paradoxica.net> 
>> wrote:
>>>
>>> I’d love to see this implemented — where “this” is a proxy for some notion 
>>> of support for remote object storage, perhaps usable by compaction 
>>> strategies like TWCS to migrate data older than a threshold from a local 
>>> filesystem to remote object.
>>>
>>> It’s not an area where I can currently dedicate engineering effort. But if 
>>> others are interested in contributing a feature like this, I’d see it as 
>>> valuable for the project and would be happy to collaborate on 
>>> design/architecture/goals.
>>>
>>> – Scott
>>>
>>> On Feb 26, 2025, at 6:56 AM, guo Maxwell <cclive1...@gmail.com> wrote:
>>>
>>> 
>>> Is anyone else interested in continuing to discuss this topic?
>>>
>>> guo Maxwell <cclive1...@gmail.com> 于2024年9月20日周五 09:44写道：
>>>>
>>>> I discussed this offline with Claude, he is no longer working on this.
>>>>
>>>> It's a pity. I think this is a very valuable thing. Commitlog's archiving 
>>>> and restore may be able to use the relevant code if it is completed.
>>>>
>>>> Patrick McFadin <pmcfa...@gmail.com>于2024年9月20日 周五上午2:01写道：
>>>>>
>>>>> Thanks for reviving this one!
>>>>>
>>>>> On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell <cclive1...@gmail.com> wrote:
>>>>>>
>>>>>> Is there any update on this topic?  It seems that things can make a big 
>>>>>> progress if  Jake Luciani  can find someone who can make the 
>>>>>> FileSystemProvider code accessible.
>>>>>>
>>>>>> Jon Haddad <j...@jonhaddad.com> 于2023年12月16日周六 05:29写道：
>>>>>>>
>>>>>>> At a high level I really like the idea of being able to better leverage 
>>>>>>> cheaper storage especially object stores like S3.
>>>>>>>
>>>>>>> One important thing though - I feel pretty strongly that there's a big, 
>>>>>>> deal breaking downside.   Backups, disk failure policies, snapshots and 
>>>>>>> possibly repairs would get more complicated which haven't been 
>>>>>>> particularly great in the past, and of course there's the issue of 
>>>>>>> failure recovery being only partially possible if you're looking at a 
>>>>>>> durable block store paired with an ephemeral one with some of your data 
>>>>>>> not replicated to the cold side.  That introduces a failure case that's 
>>>>>>> unacceptable for most teams, which results in needing to implement 
>>>>>>> potentially 2 different backup solutions.  This is operationally 
>>>>>>> complex with a lot of surface area for headaches.  I think a lot of 
>>>>>>> teams would probably have an issue with the big question mark around 
>>>>>>> durability and I probably would avoid it myself.
>>>>>>>
>>>>>>> On the other hand, I'm +1 if we approach it something slightly 
>>>>>>> differently - where _all_ the data is located on the cold storage, with 
>>>>>>> the local hot storage used as a cache.  This means we can use the cold 
>>>>>>> directories for the complete dataset, simplifying backups and node 
>>>>>>> replacements.
>>>>>>>
>>>>>>> For a little background, we had a ticket several years ago where I 
>>>>>>> pointed out it was possible to do this *today* at the operating system 
>>>>>>> level as long as you're using block devices (vs an object store) and 
>>>>>>> LVM [1].  For example, this works well with GP3 EBS w/ low IOPS 
>>>>>>> provisioning + local NVMe to get a nice balance of great read 
>>>>>>> performance without going nuts on the cost for IOPS.  I also wrote 
>>>>>>> about this in a little more detail in my blog [2].  There's also the 
>>>>>>> new mount point tech in AWS which pretty much does exactly what I've 
>>>>>>> suggested above [3] that's probably worth evaluating just to get a feel 
>>>>>>> for it.
>>>>>>>
>>>>>>> I'm not insisting we require LVM or the AWS S3 fs, since that would 
>>>>>>> rule out other cloud providers, but I am pretty confident that the 
>>>>>>> entire dataset should reside in the "cold" side of things for the 
>>>>>>> practical and technical reasons I listed above.  I don't think it 
>>>>>>> massively changes the proposal, and should simplify things for everyone.
>>>>>>>
>>>>>>> Jon
>>>>>>>
>>>>>>> [1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/
>>>>>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-8460
>>>>>>> [3] 
>>>>>>> https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 14, 2023 at 1:56 AM Claude Warren <cla...@apache.org> wrote:
>>>>>>>>
>>>>>>>> Is there still interest in this?  Can we get some points down on 
>>>>>>>> electrons so that we all understand the issues?
>>>>>>>>
>>>>>>>> While it is fairly simple to redirect the read/write to something 
>>>>>>>> other  than the local system for a single node this will not solve the 
>>>>>>>> problem for tiered storage.
>>>>>>>>
>>>>>>>> Tiered storage will require that on read/write the primary key be 
>>>>>>>> assessed and determine if the read/write should be redirected.  My 
>>>>>>>> reasoning for this statement is that in a cluster with a replication 
>>>>>>>> factor greater than 1 the node will store data for the keys that would 
>>>>>>>> be allocated to it in a cluster with a replication factor = 1, as well 
>>>>>>>> as some keys from nodes earlier in the ring.
>>>>>>>>
>>>>>>>> Even if we can get the primary keys for all the data we want to write 
>>>>>>>> to "cold storage" to map to a single node a replication factor > 1 
>>>>>>>> means that data will also be placed in "normal storage" on subsequent 
>>>>>>>> nodes.
>>>>>>>>
>>>>>>>> To overcome this, we have to explore ways to route data to different 
>>>>>>>> storage based on the keys and that different storage may have to be 
>>>>>>>> available on _all_  the nodes.
>>>>>>>>
>>>>>>>> Have any of the partial solutions mentioned in this email chain (or 
>>>>>>>> others) solved this problem?
>>>>>>>>
>>>>>>>> Claude

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

Reply via email to