Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Jeremiah D Jordan Fri, 24 Mar 2023 08:39:29 -0700

I have concerns with the majority of this being in the sidecar and not in the 
database itself.  I think it would make sense for the server side of this to be 
a new service exposed by the database, not in the sidecar.  That way it can be 
able to properly integrate with the authentication and authorization apis, and 
to make it a first class citizen in terms of having unit/integration tests in 
the main DB ensuring no one breaks it.


-Jeremiah

> On Mar 24, 2023, at 10:29 AM, Dinesh Joshi <djo...@apache.org> wrote:
> 
> Hi Benjamin,
> 
> I agree with your concern about long term maintenance of the code. Doug
> has contributed several patches to Cassandra over the years. Besides him
> there will be several other maintainers that will take on maintenance of
> this code including Yifan and myself. Given how closely it is coupled
> with the Cassandra Sidecar project, I would prefer that we keep this
> within the Cassandra project umbrella as a separate repository and a
> sub-project.
> 
> Thanks,
> 
> Dinesh
> 
> 
> On 3/24/23 02:35, Benjamin Lerer wrote:
>> Hi Doug,
>> 
>> Outside of the changes to the Cassandra Sidecar that are mentioned, what
>> the CEP proposes is the donation of a library for Spark integration. It
>> seems to me that this library could be offered as an open source project
>> outside of the Cassandra project itself. If we accept Spark Bulk
>> Analytic as part of the Cassandra project it means that the community
>> will commit to maintain it and ensure that for each Cassandra release it
>> will be fully compatible. Considering our history with Hadoop
>> integration which has basically been unmaintained for years, I am not
>> convinced that it is what we should do.
>> We only started to expand the scope of the project recently and I would
>> personally prefer that we do that slowly starting with the drivers that
>> are critical for C*. Now, it is only my personal opinion and other
>> people might have a different view on those things.
>> 
>> Le jeu. 23 mars 2023 à 23:29, Miklosovic, Stefan
>> <stefan.mikloso...@netapp.com <mailto:stefan.mikloso...@netapp.com> 
>> <mailto:stefan.mikloso...@netapp.com>> a
>> écrit :
>> 
>>    Hi,
>> 
>>    I think this might be a great contribution in the light of removed
>>    Hadoop integration recently (CASSANDRA-18323) as it will not be in
>>    5.0 anymore. If this CEP is adopted and delivered, I can see how it
>>    might be a logical replacement of that.
>> 
>>    Regards
>> 
>>    ________________________________________
>>    From: Doug Rohrer <droh...@apple.com <mailto:droh...@apple.com> 
>> <mailto:droh...@apple.com>>
>>    Sent: Thursday, March 23, 2023 18:33
>>    To: dev@cassandra.apache.org <mailto:dev@cassandra.apache.org> 
>> <mailto:dev@cassandra.apache.org>
>>    Cc: James Berragan
>>    Subject: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with
>>    Spark Bulk Analytics
>> 
>>    NetApp Security WARNING: This is an external email. Do not click
>>    links or open attachments unless you recognize the sender and know
>>    the content is safe.
>> 
>> 
>> 
>> 
>>    Hi everyone,
>> 
>>    Wiki:
>>    
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics>
>> 
>>    We’d like to propose this CEP for adoption by the community.
>> 
>>    It is common for teams using Cassandra to find themselves looking
>>    for a way to interact with large amounts of data for analytics
>>    workloads. However, Cassandra’s standard APIs aren’t designed for
>>    large scale data egress/ingest as the native read/write paths
>>    weren’t designed for bulk analytics.
>> 
>>    We’re proposing this CEP for this exact purpose. It enables the
>>    implementation of custom Spark (or similar) applications that can
>>    either read or write large amounts of Cassandra data at line rates,
>>    by accessing the persistent storage of nodes in the cluster via the
>>    Cassandra Sidecar.
>> 
>>    This CEP proposes new APIs in the Cassandra Sidecar and a companion
>>    library that allows deep integration into Apache Spark that allows
>>    its users to bulk import or export data from a running Cassandra
>>    cluster with minimal to no impact to the read/write traffic.
>> 
>>    We will shortly publish a branch with code that will accompany this
>>    CEP to help readers understand it better.
>> 
>>    As a reminder, please keep the discussion here on the dev list vs.
>>    in the wiki, as we’ve found it easier to manage via email.
>> 
>>    Sincerely,
>> 
>>    Doug Rohrer & James Berragan

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to