Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Jeremiah D Jordan Tue, 28 Mar 2023 07:54:11 -0700


>> Given the sidecar is running on the same node as the main C* process, the 
>> only real resource isolation you have is in heap/GC?  CPU/Memory/IO are all 
>> still shared between the main C* process and the side car, and coordinating 
>> those across processes is harder than coordinating them within a single 
>> process. For example if we wanted to have the compaction throughput, 
>> streaming throughput, and analytics read throughput all tied back to a 
>> single disk IO cap, that is harder with an external process.
> 
> Relatively trivial, for CPU and memory, to run them in different 
> containers/cgroups/etc, so you can put an exact cpu/memory limit on the 
> sidecar. That's different from a jmx rate limiter/throttle, but (arguably) 
> more precise, because it actually limits the underlying physical resource 
> instead of a proxy for it in a config setting. 
>


If we want to bring groups/containers/etc into the default deployment 
mechanisms of C*, great.  I am all for dividing it up into micro services given 
we solve all the problems I listed in the complexity section.

I am actually all for dividing C* up into multiple micro services, but the 
project needs to buy in to containers as the default mechanism for running it 
for that to be viable in my mind.

>  
>> 
>>> - Complexity. Considering the existence of the Sidecar project, it would be 
>>> less complex to avoid adding another (http?) service in Cassandra.
>> 
>> Not sure that is really very complex, running an http service is a pretty 
>> easy?  We already have netty in use to instantiate one from.
>> I worry more about the complexity of having the matching schema for a set of 
>> sstables being read.  The complexity of new sstable versions/formats being 
>> introduced.  The complexity of having up to date data from memtables being 
>> considered by this API without having to flush before every query of it.  
>> The complexity of dealing with the new memtable API introduced in CEP-11.  
>> The complexity of coordinating compaction/streaming adding and removing 
>> files with these APIs reading them.  There are a lot of edge cases to 
>> consider for this external access to sstables that the main process 
>> considers itself the “owner” of.
>> 
>> All of this is not to say that I think separating things out into other 
>> processes/services is bad.  But I think we need to be very careful with how 
>> we do it, or end users will end up running into all the sharp edges and the 
>> feature will fail.
>> 
>> -Jeremiah
>> 
>>> On Mar 24, 2023, at 8:15 PM, Yifan Cai <yc25c...@gmail.com 
>>> <mailto:yc25c...@gmail.com>> wrote:
>>> 
>>> Hi Jeremiah, 
>>> 
>>> There are good reasons to not have these inside Cassandra. Consider the 
>>> following.
>>> - Resources isolation. Having the said service running within the same JVM 
>>> may negatively impact Cassandra storage's performance. It could be more 
>>> beneficial to have them in Sidecar, which offers strong resource isolation 
>>> guarantees.
>>> - Availability. If the Cassandra cluster is being bounced, using sidecar 
>>> would not affect the SBR/SBW functionality, e.g. SBR can still read 
>>> SSTables via sidecar endpoints. 
>>> - Compatibility. Sidecar provides stable REST-based APIs, such as uploading 
>>> SSTables endpoint, which would remain compatible with different versions of 
>>> Cassandra. The current implementation supports versions 3.0 and 4.0.
>>> - Complexity. Considering the existence of the Sidecar project, it would be 
>>> less complex to avoid adding another (http?) service in Cassandra.
>>> - Release velocity. Sidecar, as an independent project, can have a quicker 
>>> release cycle from Cassandra. 
>>> - The features in sidecar are mostly implemented based on various existing 
>>> tools/APIs exposed from Cassandra, e.g. ring, commit sstable, snapshot, etc.
>>> 
>>> Regarding authentication and authorization
>>> - We will add it as a follow-on CEP in Sidecar, but we don't want to hold 
>>> up this CEP. It would be a feature that benefits all Sidecar endpoints.
>>> 
>>> - Yifan
>>> 
>>> On Fri, Mar 24, 2023 at 2:43 PM Doug Rohrer <droh...@apple.com 
>>> <mailto:droh...@apple.com>> wrote:
>>>> I agree that the analytics library will need to support vnodes. To be 
>>>> clear, there’s nothing preventing the solution from working with vnodes 
>>>> right now, and no assumptions about a 1:1 topology between a token and a 
>>>> node. However, we don’t, today, have the ability to test vnode support 
>>>> end-to-end. We are working towards that, however, and should be able to 
>>>> remove the caveat from the released analytics library once we can properly 
>>>> test vnode support.
>>>> If it helps, I can update the CEP to say something more like “Caveat: 
>>>> Currently untested with vnodes - work is ongoing to remove this 
>>>> limitation” if that helps?
>>>> 
>>>> Doug
>>>> 
>>>> > On Mar 24, 2023, at 11:43 AM, Brandon Williams <dri...@gmail.com 
>>>> > <mailto:dri...@gmail.com>> wrote:
>>>> > 
>>>> > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
>>>> > <jeremiah.jor...@gmail.com <mailto:jeremiah.jor...@gmail.com>> wrote:
>>>> >> 
>>>> >> I have concerns with the majority of this being in the sidecar and not 
>>>> >> in the database itself.  I think it would make sense for the server 
>>>> >> side of this to be a new service exposed by the database, not in the 
>>>> >> sidecar.  That way it can be able to properly integrate with the 
>>>> >> authentication and authorization apis, and to make it a first class 
>>>> >> citizen in terms of having unit/integration tests in the main DB 
>>>> >> ensuring no one breaks it.
>>>> > 
>>>> > I don't think this can/should happen until it supports the database's
>>>> > default configuration with vnodes.
>>>> 
>>

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to