Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread guo Maxwell
Thanks Dinesh ,
That will be great.

Dinesh Joshi 于2023年5月4日 周四下午11:06写道:

> Hi Guo,
>
> I would expect that there would be release artifacts for the sidecar as
> well as the library once this functionality is available.
>
> Dinesh
>
> On May 4, 2023, at 12:03 AM, guo Maxwell  wrote:
>
> This is a very meaningful work, thanks , but I would like to ask a
> question that is not particularly related to the cep project's code design 
> itself
> but the project engineering management : what is the future development and
> release plan of this project?
> As far as I know, project Cassandra Sidecar does not actually have an
> finnally release version. I think everyone will definitely not want the
> project code to be merged, but it has been unable to release for a long
> time as this project relies on Cassandra sidecar.
>
> Dinesh Joshi  于2023年5月4日周四 02:35写道:
>
>> If there aren't additional questions / comments I will start the VOTE
>> thread on this CEP tonight.
>>
>> On 2023/05/01 19:50:12 Dinesh Joshi wrote:
>> > Does anybody have any questions that we could answer about this
>> proposal?
>>
>
>
> --
> you are the apple of my eye !
>
>
> --
you are the apple of my eye !


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread Dinesh Joshi
Hi Guo,

I would expect that there would be release artifacts for the sidecar as well as 
the library once this functionality is available.

Dinesh

> On May 4, 2023, at 12:03 AM, guo Maxwell  wrote:
> 
> This is a very meaningful work, thanks , but I would like to ask a question 
> that is not particularly related to the cep project's code design itself but 
> the project engineering management : what is the future development and 
> release plan of this project? 
> As far as I know, project Cassandra Sidecar does not actually have an 
> finnally release version. I think everyone will definitely not want the 
> project code to be merged, but it has been unable to release for a long time 
> as this project relies on Cassandra sidecar.
> 
> Dinesh Joshi mailto:djo...@apache.org>> 于2023年5月4日周四 
> 02:35写道:
>> If there aren't additional questions / comments I will start the VOTE thread 
>> on this CEP tonight.
>> 
>> On 2023/05/01 19:50:12 Dinesh Joshi wrote:
>> > Does anybody have any questions that we could answer about this proposal?
> 
> 
> -- 
> you are the apple of my eye !



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread guo Maxwell
This is a very meaningful work, thanks , but I would like to ask a question
that is not particularly related to the cep project's code design itself
but the project engineering management : what is the future development and
release plan of this project?
As far as I know, project Cassandra Sidecar does not actually have an
finnally release version. I think everyone will definitely not want the
project code to be merged, but it has been unable to release for a long
time as this project relies on Cassandra sidecar.

Dinesh Joshi  于2023年5月4日周四 02:35写道:

> If there aren't additional questions / comments I will start the VOTE
> thread on this CEP tonight.
>
> On 2023/05/01 19:50:12 Dinesh Joshi wrote:
> > Does anybody have any questions that we could answer about this proposal?
>


-- 
you are the apple of my eye !


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-03 Thread Dinesh Joshi
If there aren't additional questions / comments I will start the VOTE thread on 
this CEP tonight.

On 2023/05/01 19:50:12 Dinesh Joshi wrote:
> Does anybody have any questions that we could answer about this proposal?


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Dinesh Joshi
We're reusing existing Cassandra code so the performance characteristics for 
parsing should be the same as Cassandra. I will need to check if we have 
benchmarks. If we do, we'll add it to the CEP wiki page.

On 2023/05/02 19:52:28 Sebastian Estevez wrote:
> Hey Dinesh,
> 
> Yeah it makes sense that the sstable streaming is network bound since it's
> mostly just moving files.
> 
> Do you have any performance stats on the sstable parsing side inside spark?
> 
> --Seb
> 
> On Tue, May 2, 2023 at 3:31 PM Dinesh Joshi  wrote:
> 
> > It is line rate / network bound. We have a patch out in vert.x that should
> > use the zero copy path for it. But it's not a strict prereq for it.


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Sebastian Estevez
Hey Dinesh,

Yeah it makes sense that the sstable streaming is network bound since it's
mostly just moving files.

Do you have any performance stats on the sstable parsing side inside spark?

--Seb

On Tue, May 2, 2023 at 3:31 PM Dinesh Joshi  wrote:

> It is line rate / network bound. We have a patch out in vert.x that should
> use the zero copy path for it. But it's not a strict prereq for it.
>
> On 2023/05/02 15:39:02 Sebastian Estevez wrote:
> > Hi folks,
> >
> > Great stuff thanks for sharing.
> >
> > The performance numbers I've seen so far are for the sidecar streaming
> > sstables (seems like this is just network bound?). What kind of perf are
> > you seeing at the Spark executors (at the per task level)?
> >
> > --Seb
> >
> > On Mon, May 1, 2023 at 3:50 PM Dinesh Joshi  wrote:
> >
> > > Does anybody have any questions that we could answer about this
> proposal?
> > >
> > > On Apr 27, 2023, at 1:24 PM, Francisco Guerrero <
> frank.guerr...@gmail.com>
> > > wrote:
> > >
> > > Hi folks,
> > >
> > > We have updated the confluence page with the source code for CEP-28.
> > > There are two repositories with contributions. One is the patch [1]
> > > for Cassandra Sidecar with the bulk APIs that enable the Cassandra
> > > Spark Analytics library. The second is a new repository [2] with
> > > contributions to the Cassandra Spark Analytics code
> > >
> > > We also have a README markdown file that you can follow to give the
> > > code a try:
> > >
> > >
> > >
> https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md
> > >
> > > Best,
> > > - Francisco
> > >
> > > [1] Apache Cassandra Sidecar bulk APIs source code:
> > > https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
> > > [2] Apache Cassandra Spark Analytics source code:
> > > https://github.com/frankgh/cassandra-analytics
> > >
> > >
> > > On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in
> > > responding here - yes, we can add some diagrams to the CEP - I’ll try
> to
> > > get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28,
> 2023,
> > > at 1:14 PM, J. D. Jordan  wrote: > > > >
> Maybe
> > > some data flow diagrams could be added to the cep showing some example
> > > operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM,
> Yifan Cai
> > >  wrote: > >> > >>  > >> A lot of great
> discussions!
> > > > >> > >> On the sidecar front, especially what the role sidecar plays
> in
> > > terms of this CEP, I feel there might be some confusion. Once the code
> is
> > > published, we should have clarity. > >> Sidecar does not read sstables
> nor
> > > do any coordination for analytics queries. It is local to the companion
> > > Cassandra instance. For bulk read, it takes snapshots and streams
> sstables
> > > to spark workers to read. For bulk write, it imports the sstables
> uploaded
> > > from spark workers. All commands are existing jmx/nodetool
> functionalities
> > > from Cassandra. Sidecar adds the http interface to them. It might be an
> > > over simplified description. The complex computation is performed in
> spark
> > > clusters only. > >> > >> In the long run, Cassandra might evolve into a
> > > database that does both OLTP and OLAP. (Not what this thread aims for)
> > >>
> > > At the current stage, Spark is very suited for analytic purposes. > >>
> > >>
> > > On Tue, Mar 28, 2023 at 9:06 AM Benedict  > > bened...@apache.org>> wrote: > >>> I disagree with the first claim, as
> > > the process has all the information it chooses to utilise about which
> > > resources it’s using and what it’s using those resources for. > >>> >
> >>>
> > > The inability to isolate GC domains is something we cannot address, but
> > > also probably not a problem if we were doing everything with memory
> > > management as well as we could be. > >>> > >>> But, not worth detailing
> > > this thread for. Today we do very little well on this front within the
> > > process, and a separate process is well justified given the state of
> play.
> > > > >>> >  On 28 Mar 2023, at 16:38, Derek Chen-Becker <
> > > de...@chen-becker.org > wrote: >  >
> > >   >  >  On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <
> > > joe.e.ly...@gmail.com > wrote: > 
> ... >
> > >  > > I think we might be underselling how valuable JVM
> isolation
> > > is, > > especially for analytics queries that are going to pass the
> > > entire > > dataset through heap somewhat constantly. >  > 
> Big
> > > +1 here. The JVM simply does not have significant granularity of
> control
> > > for resource utilization, but this is explicitly a feature of separate
> > > processes. Add in being able to separate GC domains and you can avoid
> a lot
> > > of noisy neighbor in-VM behavior for the disparate workloads. >  >
> 
> > > Cheers, >  >  Derek >  >  >  -- > 
> > > 

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Dinesh Joshi
It is line rate / network bound. We have a patch out in vert.x that should use 
the zero copy path for it. But it's not a strict prereq for it.

On 2023/05/02 15:39:02 Sebastian Estevez wrote:
> Hi folks,
> 
> Great stuff thanks for sharing.
> 
> The performance numbers I've seen so far are for the sidecar streaming
> sstables (seems like this is just network bound?). What kind of perf are
> you seeing at the Spark executors (at the per task level)?
> 
> --Seb
> 
> On Mon, May 1, 2023 at 3:50 PM Dinesh Joshi  wrote:
> 
> > Does anybody have any questions that we could answer about this proposal?
> >
> > On Apr 27, 2023, at 1:24 PM, Francisco Guerrero 
> > wrote:
> >
> > Hi folks,
> >
> > We have updated the confluence page with the source code for CEP-28.
> > There are two repositories with contributions. One is the patch [1]
> > for Cassandra Sidecar with the bulk APIs that enable the Cassandra
> > Spark Analytics library. The second is a new repository [2] with
> > contributions to the Cassandra Spark Analytics code
> >
> > We also have a README markdown file that you can follow to give the
> > code a try:
> >
> >
> > https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md
> >
> > Best,
> > - Francisco
> >
> > [1] Apache Cassandra Sidecar bulk APIs source code:
> > https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
> > [2] Apache Cassandra Spark Analytics source code:
> > https://github.com/frankgh/cassandra-analytics
> >
> >
> > On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in
> > responding here - yes, we can add some diagrams to the CEP - I’ll try to
> > get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28, 2023,
> > at 1:14 PM, J. D. Jordan  wrote: > > > > Maybe
> > some data flow diagrams could be added to the cep showing some example
> > operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM, Yifan Cai
> >  wrote: > >> > >>  > >> A lot of great discussions!
> > > >> > >> On the sidecar front, especially what the role sidecar plays in
> > terms of this CEP, I feel there might be some confusion. Once the code is
> > published, we should have clarity. > >> Sidecar does not read sstables nor
> > do any coordination for analytics queries. It is local to the companion
> > Cassandra instance. For bulk read, it takes snapshots and streams sstables
> > to spark workers to read. For bulk write, it imports the sstables uploaded
> > from spark workers. All commands are existing jmx/nodetool functionalities
> > from Cassandra. Sidecar adds the http interface to them. It might be an
> > over simplified description. The complex computation is performed in spark
> > clusters only. > >> > >> In the long run, Cassandra might evolve into a
> > database that does both OLTP and OLAP. (Not what this thread aims for) > >>
> > At the current stage, Spark is very suited for analytic purposes. > >> > >>
> > On Tue, Mar 28, 2023 at 9:06 AM Benedict  > bened...@apache.org>> wrote: > >>> I disagree with the first claim, as
> > the process has all the information it chooses to utilise about which
> > resources it’s using and what it’s using those resources for. > >>> > >>>
> > The inability to isolate GC domains is something we cannot address, but
> > also probably not a problem if we were doing everything with memory
> > management as well as we could be. > >>> > >>> But, not worth detailing
> > this thread for. Today we do very little well on this front within the
> > process, and a separate process is well justified given the state of play.
> > > >>> >  On 28 Mar 2023, at 16:38, Derek Chen-Becker <
> > de...@chen-becker.org > wrote: >  >
> >   >  >  On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <
> > joe.e.ly...@gmail.com > wrote: >  ... >
> >  > > I think we might be underselling how valuable JVM isolation
> > is, > > especially for analytics queries that are going to pass the
> > entire > > dataset through heap somewhat constantly. >  >  Big
> > +1 here. The JVM simply does not have significant granularity of control
> > for resource utilization, but this is explicitly a feature of separate
> > processes. Add in being able to separate GC domains and you can avoid a lot
> > of noisy neighbor in-VM behavior for the disparate workloads. >  > 
> > Cheers, >  >  Derek >  >  >  -- > 
> > +---+ >  |
> > Derek Chen-Becker | >  | GPG Key available at
> > https://keybase.io/dchenbecker and | >  |
> > https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | >  |
> > Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > 
> > +---+ >  >
> > >
> > --
> > Francisco Guerrero
> >
> >
> >
> 
> -- 
> All the best,
> 
> Sebastián
> 


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Sebastian Estevez
Hi folks,

Great stuff thanks for sharing.

The performance numbers I've seen so far are for the sidecar streaming
sstables (seems like this is just network bound?). What kind of perf are
you seeing at the Spark executors (at the per task level)?

--Seb

On Mon, May 1, 2023 at 3:50 PM Dinesh Joshi  wrote:

> Does anybody have any questions that we could answer about this proposal?
>
> On Apr 27, 2023, at 1:24 PM, Francisco Guerrero 
> wrote:
>
> Hi folks,
>
> We have updated the confluence page with the source code for CEP-28.
> There are two repositories with contributions. One is the patch [1]
> for Cassandra Sidecar with the bulk APIs that enable the Cassandra
> Spark Analytics library. The second is a new repository [2] with
> contributions to the Cassandra Spark Analytics code
>
> We also have a README markdown file that you can follow to give the
> code a try:
>
>
> https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md
>
> Best,
> - Francisco
>
> [1] Apache Cassandra Sidecar bulk APIs source code:
> https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
> [2] Apache Cassandra Spark Analytics source code:
> https://github.com/frankgh/cassandra-analytics
>
>
> On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in
> responding here - yes, we can add some diagrams to the CEP - I’ll try to
> get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28, 2023,
> at 1:14 PM, J. D. Jordan  wrote: > > > > Maybe
> some data flow diagrams could be added to the cep showing some example
> operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM, Yifan Cai
>  wrote: > >> > >>  > >> A lot of great discussions!
> > >> > >> On the sidecar front, especially what the role sidecar plays in
> terms of this CEP, I feel there might be some confusion. Once the code is
> published, we should have clarity. > >> Sidecar does not read sstables nor
> do any coordination for analytics queries. It is local to the companion
> Cassandra instance. For bulk read, it takes snapshots and streams sstables
> to spark workers to read. For bulk write, it imports the sstables uploaded
> from spark workers. All commands are existing jmx/nodetool functionalities
> from Cassandra. Sidecar adds the http interface to them. It might be an
> over simplified description. The complex computation is performed in spark
> clusters only. > >> > >> In the long run, Cassandra might evolve into a
> database that does both OLTP and OLAP. (Not what this thread aims for) > >>
> At the current stage, Spark is very suited for analytic purposes. > >> > >>
> On Tue, Mar 28, 2023 at 9:06 AM Benedict  bened...@apache.org>> wrote: > >>> I disagree with the first claim, as
> the process has all the information it chooses to utilise about which
> resources it’s using and what it’s using those resources for. > >>> > >>>
> The inability to isolate GC domains is something we cannot address, but
> also probably not a problem if we were doing everything with memory
> management as well as we could be. > >>> > >>> But, not worth detailing
> this thread for. Today we do very little well on this front within the
> process, and a separate process is well justified given the state of play.
> > >>> >  On 28 Mar 2023, at 16:38, Derek Chen-Becker <
> de...@chen-becker.org > wrote: >  >
>   >  >  On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <
> joe.e.ly...@gmail.com > wrote: >  ... >
>  > > I think we might be underselling how valuable JVM isolation
> is, > > especially for analytics queries that are going to pass the
> entire > > dataset through heap somewhat constantly. >  >  Big
> +1 here. The JVM simply does not have significant granularity of control
> for resource utilization, but this is explicitly a feature of separate
> processes. Add in being able to separate GC domains and you can avoid a lot
> of noisy neighbor in-VM behavior for the disparate workloads. >  > 
> Cheers, >  >  Derek >  >  >  -- > 
> +---+ >  |
> Derek Chen-Becker | >  | GPG Key available at
> https://keybase.io/dchenbecker and | >  |
> https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | >  |
> Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > 
> +---+ >  >
> >
> --
> Francisco Guerrero
>
>
>

-- 
All the best,

Sebastián


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-01 Thread Dinesh Joshi
Does anybody have any questions that we could answer about this proposal?

> On Apr 27, 2023, at 1:24 PM, Francisco Guerrero  
> wrote:
> 
> Hi folks,
> 
> We have updated the confluence page with the source code for CEP-28.
> There are two repositories with contributions. One is the patch [1]
> for Cassandra Sidecar with the bulk APIs that enable the Cassandra
> Spark Analytics library. The second is a new repository [2] with
> contributions to the Cassandra Spark Analytics code
> 
> We also have a README markdown file that you can follow to give the
> code a try:
> 
> https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md
> 
> Best,
> - Francisco
> 
> [1] Apache Cassandra Sidecar bulk APIs source code: 
> https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
> [2] Apache Cassandra Spark Analytics source code: 
> https://github.com/frankgh/cassandra-analytics
> 
> 
> On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in responding 
> here - yes, we can add some diagrams to the CEP - I’ll try to get that done 
> by end-of-week. > > Thanks, > > Doug > > > On Mar 28, 2023, at 1:14 PM, J. D. 
> Jordan mailto:jeremiah.jor...@gmail.com>> wrote: 
> > > > > Maybe some data flow diagrams could be added to the cep showing some 
> example operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM, 
> Yifan Cai mailto:yc25c...@gmail.com>> wrote: > >> > >>  
> > >> A lot of great discussions! > >> > >> On the sidecar front, especially 
> what the role sidecar plays in terms of this CEP, I feel there might be some 
> confusion. Once the code is published, we should have clarity. > >> Sidecar 
> does not read sstables nor do any coordination for analytics queries. It is 
> local to the companion Cassandra instance. For bulk read, it takes snapshots 
> and streams sstables to spark workers to read. For bulk write, it imports the 
> sstables uploaded from spark workers. All commands are existing jmx/nodetool 
> functionalities from Cassandra. Sidecar adds the http interface to them. It 
> might be an over simplified description. The complex computation is performed 
> in spark clusters only. > >> > >> In the long run, Cassandra might evolve 
> into a database that does both OLTP and OLAP. (Not what this thread aims for) 
> > >> At the current stage, Spark is very suited for analytic purposes. > >> > 
> >> On Tue, Mar 28, 2023 at 9:06 AM Benedict    >> wrote: > >>> I disagree with the first claim, 
> as the process has all the information it chooses to utilise about which 
> resources it’s using and what it’s using those resources for. > >>> > >>> The 
> inability to isolate GC domains is something we cannot address, but also 
> probably not a problem if we were doing everything with memory management as 
> well as we could be. > >>> > >>> But, not worth detailing this thread for. 
> Today we do very little well on this front within the process, and a separate 
> process is well justified given the state of play. > >>> >  On 28 Mar 
> 2023, at 16:38, Derek Chen-Becker    >> wrote: >  >   >  >  On Tue, 
> Mar 28, 2023 at 9:03 AM Joseph Lynch    >> wrote: >  ... >  > > I think we 
> might be underselling how valuable JVM isolation is, > > especially for 
> analytics queries that are going to pass the entire > > dataset through 
> heap somewhat constantly. >  >  Big +1 here. The JVM simply does not 
> have significant granularity of control for resource utilization, but this is 
> explicitly a feature of separate processes. Add in being able to separate GC 
> domains and you can avoid a lot of noisy neighbor in-VM behavior for the 
> disparate workloads. >  >  Cheers, >  >  Derek >  >  
> >  -- >  
> +---+ >  | 
> Derek Chen-Becker | >  | GPG Key available at 
> https://keybase.io/dchenbecker and | >  | 
> https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | >  | 
> Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | >  
> +---+ >  > >
> -- 
> Francisco Guerrero



RE: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-27 Thread Francisco Guerrero
Hi folks,


We have updated the confluence page with the source code for CEP-28.

There are two repositories with contributions. One is the patch [1]

for Cassandra Sidecar with the bulk APIs that enable the Cassandra

Spark Analytics library. The second is a new repository [2] with

contributions to the Cassandra Spark Analytics code


We also have a README markdown file that you can follow to give the

code a try:


https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md


Best,

- Francisco


[1] Apache Cassandra Sidecar bulk APIs source code:
https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis

[2] Apache Cassandra Spark Analytics source code:
https://github.com/frankgh/cassandra-analytics


On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in
responding here - yes, we can add some diagrams to the CEP - I’ll try to
get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28, 2023,
at 1:14 PM, J. D. Jordan  wrote: > > > > Maybe
some data flow diagrams could be added to the cep showing some example
operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM, Yifan Cai
 wrote: > >> > >>  > >> A lot of great discussions! >
>> > >> On the sidecar front, especially what the role sidecar plays in
terms of this CEP, I feel there might be some confusion. Once the code is
published, we should have clarity. > >> Sidecar does not read sstables nor
do any coordination for analytics queries. It is local to the companion
Cassandra instance. For bulk read, it takes snapshots and streams sstables
to spark workers to read. For bulk write, it imports the sstables uploaded
from spark workers. All commands are existing jmx/nodetool functionalities
from Cassandra. Sidecar adds the http interface to them. It might be an
over simplified description. The complex computation is performed in spark
clusters only. > >> > >> In the long run, Cassandra might evolve into a
database that does both OLTP and OLAP. (Not what this thread aims for) > >>
At the current stage, Spark is very suited for analytic purposes. > >> > >>
On Tue, Mar 28, 2023 at 9:06 AM Benedict > wrote: > >>> I disagree with the first claim, as the
process has all the information it chooses to utilise about which resources
it’s using and what it’s using those resources for. > >>> > >>> The
inability to isolate GC domains is something we cannot address, but also
probably not a problem if we were doing everything with memory management
as well as we could be. > >>> > >>> But, not worth detailing this thread
for. Today we do very little well on this front within the process, and a
separate process is well justified given the state of play. > >>> >  On
28 Mar 2023, at 16:38, Derek Chen-Becker > wrote: >  >   >  >  On Tue, Mar
28, 2023 at 9:03 AM Joseph Lynch > wrote: >  ... >  > > I think we might
be underselling how valuable JVM isolation is, > > especially for
analytics queries that are going to pass the entire > > dataset through
heap somewhat constantly. >  >  Big +1 here. The JVM simply does
not have significant granularity of control for resource utilization, but
this is explicitly a feature of separate processes. Add in being able to
separate GC domains and you can avoid a lot of noisy neighbor in-VM
behavior for the disparate workloads. >  >  Cheers, >  > 
Derek >  >  >  -- > 
+---+ >  |
Derek Chen-Becker | >  | GPG Key available at
https://keybase.io/dchenbecker and | >  |
https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | >  |
Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > 
+---+ >  >
>
-- 
Francisco Guerrero


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-12 Thread James Berragan
Hi Stefan, CDC is something we are also thinking about, and worthy of a 
separate discussion. We have tested Spark Streaming for CDC and I hope we can 
bolt on in the future, but streaming technologies also come with more caveats 
and nuances (we have found it beneficial with CDC to store a small amount of 
state, which is at odds with Spark’s more stateless architecture). From that 
perspective I think it makes sense to keep CDC technology agnostic and let the 
user plug in to whichever system they want (Spark Streaming, Flink, custom etc).

James.

> On Apr 11, 2023, at 1:19 PM, Miklosovic, Stefan 
>  wrote:
> 
> Doug,
> 
> thanks for the diagrams, really helpful.
> 
> Do you think there might be some extension to this CEP (does not need to be 
> necessarily included from the very beginning, just food for though at this 
> point) which would read data from the commit log / CDC?
> 
> The main motivation behind this is that when one looks around in terms of 
> what is currently possible with Spark, Cassandra often exists as a sink only 
> when comes to streaming. For example, take Spark. We can use Kafka connector 
> (1) so data would come to Kafka, it would be streamed to Spark as RDDs and 
> Spark would save it to Cassandra via Spark Cassandra Connector. Such 
> transformation / pipeline is indeed possible.
> 
> We have also Cassandra + Ignite integration (2, 3) so Ignite can act as 
> in-memory caching layer on top of Cassandra which enables users to do 
> transformations over IgniteRDD and queries which are not possible normally. 
> (e.g. joins in SQL in Ignite over these caches etc). Very handy. But there is 
> no Ignite streamer which would consider Cassandra to be a realtime / near 
> realtime source.
> 
> So, there is currently no integration done (correct me if I am wrong) which 
> would have Cassandra as _real time_ source.
> 
> Looking into these diagrams, when you are able to load data from Cassandra 
> from SSTables, would it be possible to continually fetch offset in CDC index 
> file (these changes were done in 4.0 for the first time I think, ask Josh 
> McKenzie about the details), read these mutations and send it via Sidecar to 
> Spark?
> 
> Currently, the only solution I know of which is doing realtime-ish streaming 
> of mutations from CDC is Debezium Cassandra connector but it is pushing these 
> mutations straight to Kafka only. I would love to have it in Spark first and 
> then I can do whatever I want with that.
> 
> (1) https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> (2) 
> https://ignite.apache.org/docs/latest/extensions-and-integrations/cassandra/overview
> (3) 
> https://ignite.apache.org/docs/latest/extensions-and-integrations/ignite-for-spark/ignitecontext-and-rdd
> (4) https://github.com/debezium/debezium-connector-cassandra
> 
> 
> From: Doug Rohrer mailto:droh...@apple.com>>
> Sent: Tuesday, April 11, 2023 0:37
> To: dev@cassandra.apache.org <mailto:dev@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark 
> Bulk Analytics
> 
> NetApp Security WARNING: This is an external email. Do not click links or 
> open attachments unless you recognize the sender and know the content is safe.
> 
> 
> 
> I’ve updated the CEP with two overview diagrams of the interactions between 
> Sidecar, Cassandra, and the Bulk Analytics library.  Hope this helps folks 
> better understand how things work, and thanks for the patience as it took a 
> bit longer than expected for me to find the time for this.
> 
> Doug
> 
> On Apr 5, 2023, at 11:18 AM, Doug Rohrer  wrote:
> 
> Sorry for the delay in responding here - yes, we can add some diagrams to the 
> CEP - I’ll try to get that done by end-of-week.
> 
> Thanks,
> 
> Doug
> 
> On Mar 28, 2023, at 1:14 PM, J. D. Jordan  wrote:
> 
> Maybe some data flow diagrams could be added to the cep showing some example 
> operations for read/write?
> 
> On Mar 28, 2023, at 11:35 AM, Yifan Cai  wrote:
> 
> 
> A lot of great discussions!
> 
> On the sidecar front, especially what the role sidecar plays in terms of this 
> CEP, I feel there might be some confusion. Once the code is published, we 
> should have clarity.
> Sidecar does not read sstables nor do any coordination for analytics queries. 
> It is local to the companion Cassandra instance. For bulk read, it takes 
> snapshots and streams sstables to spark workers to read. For bulk write, it 
> imports the sstables uploaded from spark workers. All commands are existing 
> jmx/nodetool functionalities from Cassandra. Sidecar adds the http interface 
> to them. It might be an over simplified description. 

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-11 Thread Miklosovic, Stefan
Doug,

thanks for the diagrams, really helpful.

Do you think there might be some extension to this CEP (does not need to be 
necessarily included from the very beginning, just food for though at this 
point) which would read data from the commit log / CDC?

The main motivation behind this is that when one looks around in terms of what 
is currently possible with Spark, Cassandra often exists as a sink only when 
comes to streaming. For example, take Spark. We can use Kafka connector (1) so 
data would come to Kafka, it would be streamed to Spark as RDDs and Spark would 
save it to Cassandra via Spark Cassandra Connector. Such transformation / 
pipeline is indeed possible.

We have also Cassandra + Ignite integration (2, 3) so Ignite can act as 
in-memory caching layer on top of Cassandra which enables users to do 
transformations over IgniteRDD and queries which are not possible normally. 
(e.g. joins in SQL in Ignite over these caches etc). Very handy. But there is 
no Ignite streamer which would consider Cassandra to be a realtime / near 
realtime source.

So, there is currently no integration done (correct me if I am wrong) which 
would have Cassandra as _real time_ source.

Looking into these diagrams, when you are able to load data from Cassandra from 
SSTables, would it be possible to continually fetch offset in CDC index file 
(these changes were done in 4.0 for the first time I think, ask Josh McKenzie 
about the details), read these mutations and send it via Sidecar to Spark?

Currently, the only solution I know of which is doing realtime-ish streaming of 
mutations from CDC is Debezium Cassandra connector but it is pushing these 
mutations straight to Kafka only. I would love to have it in Spark first and 
then I can do whatever I want with that.

(1) https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
(2) 
https://ignite.apache.org/docs/latest/extensions-and-integrations/cassandra/overview
(3) 
https://ignite.apache.org/docs/latest/extensions-and-integrations/ignite-for-spark/ignitecontext-and-rdd
(4) https://github.com/debezium/debezium-connector-cassandra


From: Doug Rohrer 
Sent: Tuesday, April 11, 2023 0:37
To: dev@cassandra.apache.org
Subject: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark 
Bulk Analytics

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



I’ve updated the CEP with two overview diagrams of the interactions between 
Sidecar, Cassandra, and the Bulk Analytics library.  Hope this helps folks 
better understand how things work, and thanks for the patience as it took a bit 
longer than expected for me to find the time for this.

Doug

On Apr 5, 2023, at 11:18 AM, Doug Rohrer  wrote:

Sorry for the delay in responding here - yes, we can add some diagrams to the 
CEP - I’ll try to get that done by end-of-week.

Thanks,

Doug

On Mar 28, 2023, at 1:14 PM, J. D. Jordan  wrote:

Maybe some data flow diagrams could be added to the cep showing some example 
operations for read/write?

On Mar 28, 2023, at 11:35 AM, Yifan Cai  wrote:


A lot of great discussions!

On the sidecar front, especially what the role sidecar plays in terms of this 
CEP, I feel there might be some confusion. Once the code is published, we 
should have clarity.
Sidecar does not read sstables nor do any coordination for analytics queries. 
It is local to the companion Cassandra instance. For bulk read, it takes 
snapshots and streams sstables to spark workers to read. For bulk write, it 
imports the sstables uploaded from spark workers. All commands are existing 
jmx/nodetool functionalities from Cassandra. Sidecar adds the http interface to 
them. It might be an over simplified description. The complex computation is 
performed in spark clusters only.

In the long run, Cassandra might evolve into a database that does both OLTP and 
OLAP. (Not what this thread aims for)
At the current stage, Spark is very suited for analytic purposes.

On Tue, Mar 28, 2023 at 9:06 AM Benedict 
mailto:bened...@apache.org>> wrote:
I disagree with the first claim, as the process has all the information it 
chooses to utilise about which resources it’s using and what it’s using those 
resources for.

The inability to isolate GC domains is something we cannot address, but also 
probably not a problem if we were doing everything with memory management as 
well as we could be.

But, not worth detailing this thread for. Today we do very little well on this 
front within the process, and a separate process is well justified given the 
state of play.

On 28 Mar 2023, at 16:38, Derek Chen-Becker 
mailto:de...@chen-becker.org>> wrote:



On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch 
mailto:joe.e.ly...@gmail.com>> wrote:
...

I think we might be underselling how valuable JVM isolation is,
especially for analytics queries that

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-11 Thread J. D. Jordan
Thanks for those. They are very helpful.I think the CEP needs to call out all of the classes/interfaces from the cassandra-all jar that the “Spark driver” is using.Given this CEP is exposing “sstables as an external API” I would think all the interfaces and code associated with using those would need to be treated as user API now?For example the spark driver is actually calling the compaction classes and using the internal C* objects to process the data. I don’t think any of those classes have previously been considered “public” in anyway.Is said spark driver also being donated as part of the CEP?  Or just the code to implement the interfaces in the side car?-JeremiahOn Apr 10, 2023, at 5:37 PM, Doug Rohrer  wrote:I’ve updated the CEP with two overview diagrams of the interactions between Sidecar, Cassandra, and the Bulk Analytics library.  Hope this helps folks better understand how things work, and thanks for the patience as it took a bit longer than expected for me to find the time for this.DougOn Apr 5, 2023, at 11:18 AM, Doug Rohrer  wrote:Sorry for the delay in responding here - yes, we can add some diagrams to the CEP - I’ll try to get that done by end-of-week.Thanks,DougOn Mar 28, 2023, at 1:14 PM, J. D. Jordan  wrote:Maybe some data flow diagrams could be added to the cep showing some example operations for read/write?On Mar 28, 2023, at 11:35 AM, Yifan Cai  wrote:A lot of great discussions! On the sidecar front, especially what the role sidecar plays in terms of this CEP, I feel there might be some confusion. Once the code is published, we should have clarity.Sidecar does not read sstables nor do any coordination for analytics queries. It is local to the companion Cassandra instance. For bulk read, it takes snapshots and streams sstables to spark workers to read. For bulk write, it imports the sstables uploaded from spark workers. All commands are existing jmx/nodetool functionalities from Cassandra. Sidecar adds the http interface to them. It might be an over simplified description. The complex computation is performed in spark clusters only.In the long run, Cassandra might evolve into a database that does both OLTP and OLAP. (Not what this thread aims for) At the current stage, Spark is very suited for analytic purposes. On Tue, Mar 28, 2023 at 9:06 AM Benedict  wrote:I disagree with the first claim, as the process has all the information it chooses to utilise about which resources it’s using and what it’s using those resources for.The inability to isolate GC domains is something we cannot address, but also probably not a problem if we were doing everything with memory management as well as we could be.But, not worth detailing this thread for. Today we do very little well on this front within the process, and a separate process is well justified given the state of play.On 28 Mar 2023, at 16:38, Derek Chen-Becker  wrote:On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch  wrote:...
I think we might be underselling how valuable JVM isolation is,
especially for analytics queries that are going to pass the entire
dataset through heap somewhat constantly. Big +1 here. The JVM simply does not have significant granularity of control for resource utilization, but this is explicitly a feature of separate processes. Add in being able to separate GC domains and you can avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.Cheers,Derek-- +---+| Derek Chen-Becker                                             || GPG Key available at https://keybase.io/dchenbecker and       || https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org || Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |+---+



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-10 Thread Doug Rohrer
I’ve updated the CEP with two overview diagrams of the interactions between 
Sidecar, Cassandra, and the Bulk Analytics library.  Hope this helps folks 
better understand how things work, and thanks for the patience as it took a bit 
longer than expected for me to find the time for this.

Doug

> On Apr 5, 2023, at 11:18 AM, Doug Rohrer  wrote:
> 
> Sorry for the delay in responding here - yes, we can add some diagrams to the 
> CEP - I’ll try to get that done by end-of-week.
> 
> Thanks,
> 
> Doug
> 
>> On Mar 28, 2023, at 1:14 PM, J. D. Jordan  wrote:
>> 
>> Maybe some data flow diagrams could be added to the cep showing some example 
>> operations for read/write?
>> 
>>> On Mar 28, 2023, at 11:35 AM, Yifan Cai  wrote:
>>> 
>>> 
>>> A lot of great discussions! 
>>> 
>>> On the sidecar front, especially what the role sidecar plays in terms of 
>>> this CEP, I feel there might be some confusion. Once the code is published, 
>>> we should have clarity.
>>> Sidecar does not read sstables nor do any coordination for analytics 
>>> queries. It is local to the companion Cassandra instance. For bulk read, it 
>>> takes snapshots and streams sstables to spark workers to read. For bulk 
>>> write, it imports the sstables uploaded from spark workers. All commands 
>>> are existing jmx/nodetool functionalities from Cassandra. Sidecar adds the 
>>> http interface to them. It might be an over simplified description. The 
>>> complex computation is performed in spark clusters only.
>>> 
>>> In the long run, Cassandra might evolve into a database that does both OLTP 
>>> and OLAP. (Not what this thread aims for) 
>>> At the current stage, Spark is very suited for analytic purposes. 
>>> 
>>> On Tue, Mar 28, 2023 at 9:06 AM Benedict >> > wrote:
 I disagree with the first claim, as the process has all the information it 
 chooses to utilise about which resources it’s using and what it’s using 
 those resources for.
 
 The inability to isolate GC domains is something we cannot address, but 
 also probably not a problem if we were doing everything with memory 
 management as well as we could be.
 
 But, not worth detailing this thread for. Today we do very little well on 
 this front within the process, and a separate process is well justified 
 given the state of play.
 
> On 28 Mar 2023, at 16:38, Derek Chen-Becker  > wrote:
> 
> 
> 
> On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch  > wrote:
> ...
> 
>> I think we might be underselling how valuable JVM isolation is,
>> especially for analytics queries that are going to pass the entire
>> dataset through heap somewhat constantly. 
> 
> Big +1 here. The JVM simply does not have significant granularity of 
> control for resource utilization, but this is explicitly a feature of 
> separate processes. Add in being able to separate GC domains and you can 
> avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.
> 
> Cheers,
> 
> Derek
> 
> 
> -- 
> +---+
> | Derek Chen-Becker |
> | GPG Key available at https://keybase.io/dchenbecker and   |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---+
> 
> 



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-05 Thread Doug Rohrer
Sorry for the delay in responding here - yes, we can add some diagrams to the 
CEP - I’ll try to get that done by end-of-week.

Thanks,

Doug

> On Mar 28, 2023, at 1:14 PM, J. D. Jordan  wrote:
> 
> Maybe some data flow diagrams could be added to the cep showing some example 
> operations for read/write?
> 
>> On Mar 28, 2023, at 11:35 AM, Yifan Cai  wrote:
>> 
>> 
>> A lot of great discussions! 
>> 
>> On the sidecar front, especially what the role sidecar plays in terms of 
>> this CEP, I feel there might be some confusion. Once the code is published, 
>> we should have clarity.
>> Sidecar does not read sstables nor do any coordination for analytics 
>> queries. It is local to the companion Cassandra instance. For bulk read, it 
>> takes snapshots and streams sstables to spark workers to read. For bulk 
>> write, it imports the sstables uploaded from spark workers. All commands are 
>> existing jmx/nodetool functionalities from Cassandra. Sidecar adds the http 
>> interface to them. It might be an over simplified description. The complex 
>> computation is performed in spark clusters only.
>> 
>> In the long run, Cassandra might evolve into a database that does both OLTP 
>> and OLAP. (Not what this thread aims for) 
>> At the current stage, Spark is very suited for analytic purposes. 
>> 
>> On Tue, Mar 28, 2023 at 9:06 AM Benedict > > wrote:
>>> I disagree with the first claim, as the process has all the information it 
>>> chooses to utilise about which resources it’s using and what it’s using 
>>> those resources for.
>>> 
>>> The inability to isolate GC domains is something we cannot address, but 
>>> also probably not a problem if we were doing everything with memory 
>>> management as well as we could be.
>>> 
>>> But, not worth detailing this thread for. Today we do very little well on 
>>> this front within the process, and a separate process is well justified 
>>> given the state of play.
>>> 
 On 28 Mar 2023, at 16:38, Derek Chen-Becker >>> > wrote:
 
 
 
 On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch >>> > wrote:
 ...
 
> I think we might be underselling how valuable JVM isolation is,
> especially for analytics queries that are going to pass the entire
> dataset through heap somewhat constantly. 
 
 Big +1 here. The JVM simply does not have significant granularity of 
 control for resource utilization, but this is explicitly a feature of 
 separate processes. Add in being able to separate GC domains and you can 
 avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.
 
 Cheers,
 
 Derek
 
 
 -- 
 +---+
 | Derek Chen-Becker |
 | GPG Key available at https://keybase.io/dchenbecker and   |
 | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
 | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
 +---+
 



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread J. D. Jordan
Maybe some data flow diagrams could be added to the cep showing some example operations for read/write?On Mar 28, 2023, at 11:35 AM, Yifan Cai  wrote:A lot of great discussions! On the sidecar front, especially what the role sidecar plays in terms of this CEP, I feel there might be some confusion. Once the code is published, we should have clarity.Sidecar does not read sstables nor do any coordination for analytics queries. It is local to the companion Cassandra instance. For bulk read, it takes snapshots and streams sstables to spark workers to read. For bulk write, it imports the sstables uploaded from spark workers. All commands are existing jmx/nodetool functionalities from Cassandra. Sidecar adds the http interface to them. It might be an over simplified description. The complex computation is performed in spark clusters only.In the long run, Cassandra might evolve into a database that does both OLTP and OLAP. (Not what this thread aims for) At the current stage, Spark is very suited for analytic purposes. On Tue, Mar 28, 2023 at 9:06 AM Benedict  wrote:I disagree with the first claim, as the process has all the information it chooses to utilise about which resources it’s using and what it’s using those resources for.The inability to isolate GC domains is something we cannot address, but also probably not a problem if we were doing everything with memory management as well as we could be.But, not worth detailing this thread for. Today we do very little well on this front within the process, and a separate process is well justified given the state of play.On 28 Mar 2023, at 16:38, Derek Chen-Becker  wrote:On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch  wrote:...
I think we might be underselling how valuable JVM isolation is,
especially for analytics queries that are going to pass the entire
dataset through heap somewhat constantly. Big +1 here. The JVM simply does not have significant granularity of control for resource utilization, but this is explicitly a feature of separate processes. Add in being able to separate GC domains and you can avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.Cheers,Derek-- +---+| Derek Chen-Becker                                             || GPG Key available at https://keybase.io/dchenbecker and       || https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org || Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |+---+



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Yifan Cai
A lot of great discussions!

On the sidecar front, especially what the role sidecar plays in terms of
this CEP, I feel there might be some confusion. Once the code is published,
we should have clarity.
Sidecar does not read sstables nor do any coordination for analytics
queries. It is local to the companion Cassandra instance. For bulk read, it
takes snapshots and streams sstables to spark workers to read. For bulk
write, it imports the sstables uploaded from spark workers. All commands
are existing jmx/nodetool functionalities from Cassandra. Sidecar adds the
http interface to them. It might be an over simplified description. The
complex computation is performed in spark clusters only.

In the long run, Cassandra might evolve into a database that does both OLTP
and OLAP. (Not what this thread aims for)
At the current stage, Spark is very suited for analytic purposes.

On Tue, Mar 28, 2023 at 9:06 AM Benedict  wrote:

> I disagree with the first claim, as the process has all the information it
> chooses to utilise about which resources it’s using and what it’s using
> those resources for.
>
> The inability to isolate GC domains is something we cannot address, but
> also probably not a problem if we were doing everything with memory
> management as well as we could be.
>
> But, not worth detailing this thread for. Today we do very little well on
> this front within the process, and a separate process is well justified
> given the state of play.
>
> On 28 Mar 2023, at 16:38, Derek Chen-Becker  wrote:
>
> 
>
> On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch 
> wrote:
> ...
>
> I think we might be underselling how valuable JVM isolation is,
>> especially for analytics queries that are going to pass the entire
>> dataset through heap somewhat constantly.
>>
>
> Big +1 here. The JVM simply does not have significant granularity of
> control for resource utilization, but this is explicitly a feature of
> separate processes. Add in being able to separate GC domains and you can
> avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.
>
> Cheers,
>
> Derek
>
>
> --
> +---+
> | Derek Chen-Becker |
> | GPG Key available at https://keybase.io/dchenbecker and   |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---+
>
>


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Benedict
I disagree with the first claim, as the process has all the information it chooses to utilise about which resources it’s using and what it’s using those resources for.The inability to isolate GC domains is something we cannot address, but also probably not a problem if we were doing everything with memory management as well as we could be.But, not worth detailing this thread for. Today we do very little well on this front within the process, and a separate process is well justified given the state of play.On 28 Mar 2023, at 16:38, Derek Chen-Becker  wrote:On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch  wrote:...
I think we might be underselling how valuable JVM isolation is,
especially for analytics queries that are going to pass the entire
dataset through heap somewhat constantly. Big +1 here. The JVM simply does not have significant granularity of control for resource utilization, but this is explicitly a feature of separate processes. Add in being able to separate GC domains and you can avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.Cheers,Derek-- +---+| Derek Chen-Becker                                             || GPG Key available at https://keybase.io/dchenbecker and       || https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org || Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |+---+


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Derek Chen-Becker
On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch  wrote:
...

I think we might be underselling how valuable JVM isolation is,
> especially for analytics queries that are going to pass the entire
> dataset through heap somewhat constantly.
>

Big +1 here. The JVM simply does not have significant granularity of
control for resource utilization, but this is explicitly a feature of
separate processes. Add in being able to separate GC domains and you can
avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.

Cheers,

Derek


-- 
+---+
| Derek Chen-Becker |
| GPG Key available at https://keybase.io/dchenbecker and   |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---+


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Jeremiah D Jordan
> One of the explicit goals of making an official sidecar project was to
> try to make it something the project does not break compatibility with
> as one of the main issues the third-party sidecars (that handle
> distributed control, backup, repair, etc ...) have is they break
> constantly because C* breaks the control interfaces (JMX and config
> files in particular) constantly. If it helps with the mental model,
> maybe think of the Cassandra sidecar as part of the Cassandra
> distribution and we try not to break the distribution? Just like we
> can't break CQL and break the CQL client ecosystem, we hopefully don't
> break control interfaces of the sidecar either.

Do we have tests which enforce this?  I agree we said we won’t break stuff, 
agreeing to something and actually doing it are different things.  We have for 
years said “we won’t break interface X in a patch release”, but we always end 
up doing it if there is no test enforcing the contract with a comment saying 
not to break it.  Without such guards a contributor who has no clue about the 
“what we said” changes it, and the reviewer misses it (and possible also 
doesn’t know/remember “what we said” because we said it 3 years back)…

This is not impossible, we just need to make sure that we are pro-active about 
marking such things.  Maybe the answer is “running the side car integration 
tests” as part of C* patch CI?

> In addition to that, having
> this in a separate process gives us access to easy-to-use OS level
> protections over CPU time, memory, network, and disk via cgroups; as
> well as taking advantage of the existing isolation techniques kernels
> already offer to protect processes from each other e.g. CPU schedulers
> like CFS [1], network qdiscs like tc-fq/tc-prio[2, 3], and io
> schedulers like kyber/bfq [4].

How do we get this tuning to be part of the default install for all users of C* 
+ sidecar?



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Benedict
Fwiw I’m sceptical of the performance angle long term. You can do a lot more to 
control QoS when you understand what each query is doing, and what your SLOs 
are. You can also more efficiently apportion your resources (not leaving any 
lying fallow to ensure it’s free later)

But, we’re a long way from that.

My personal view of the sidecar is to offer these sorts of facilities more 
rapidly than we might in Cassandra proper, but that we might eventually (when 
mature enough and Cassandra is ready for it) bring them in process.

Certainly, managing consistency (repair etc) and serving bulk operations should 
*long term* live in Cassandra IMO.

But that isn’t the state of the world today, so I support a separate process.

Though, I am nervous about the issues Jeremiah raises - we need to ensure we 
are not tightly coupling things and creating new problems. Managing other 
processes reliably and promptly seeing sstable changes and memtable flushes 
isn’t something that would be pretty, and we should probably offer weak 
guarantees about what’s visible when - ideally the sidecar would rely on file 
system watch notifications, or perhaps at most some fsync like functionality 
for flushing memtables.

> On 28 Mar 2023, at 16:09, Joseph Lynch  wrote:
> 
> 
>> 
>> If we want to bring groups/containers/etc into the default deployment 
>> mechanisms of C*, great.  I am all for dividing it up into micro services 
>> given we solve all the problems I listed in the complexity section.
>> 
>> I am actually all for dividing C* up into multiple micro services, but the 
>> project needs to buy in to containers as the default mechanism for running 
>> it for that to be viable in my mind.
> 
> I was under the impression that with CEP-1 the project did buy into
> the direction of moving the workloads that are non-latency sensitive
> out of the main process? At the time of the discussion folks mentioned
> repair, bulk workloads, backup, restore, compaction etc ... as all
> possible things we would like to extract over time to the sidecar.
> 
> I don't think we want to go full on micro services, with like 12
> processes all handling one thing, but 2 seems like a good step? One
> for latency sensitive requests (reads/writes - the current process),
> and one for non latency sensitive requests (control plane, bulk work,
> etc ... - the sidecar).
> 
> -Joey



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Joseph Lynch
> If we want to bring groups/containers/etc into the default deployment 
> mechanisms of C*, great.  I am all for dividing it up into micro services 
> given we solve all the problems I listed in the complexity section.
>
> I am actually all for dividing C* up into multiple micro services, but the 
> project needs to buy in to containers as the default mechanism for running it 
> for that to be viable in my mind.

I was under the impression that with CEP-1 the project did buy into
the direction of moving the workloads that are non-latency sensitive
out of the main process? At the time of the discussion folks mentioned
repair, bulk workloads, backup, restore, compaction etc ... as all
possible things we would like to extract over time to the sidecar.

I don't think we want to go full on micro services, with like 12
processes all handling one thing, but 2 seems like a good step? One
for latency sensitive requests (reads/writes - the current process),
and one for non latency sensitive requests (control plane, bulk work,
etc ... - the sidecar).

-Joey


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Joseph Lynch
One of the explicit goals of making an official sidecar project was to
try to make it something the project does not break compatibility with
as one of the main issues the third-party sidecars (that handle
distributed control, backup, repair, etc ...) have is they break
constantly because C* breaks the control interfaces (JMX and config
files in particular) constantly. If it helps with the mental model,
maybe think of the Cassandra sidecar as part of the Cassandra
distribution and we try not to break the distribution? Just like we
can't break CQL and break the CQL client ecosystem, we hopefully don't
break control interfaces of the sidecar either.

On Tue, Mar 28, 2023 at 10:30 AM Jeremiah D Jordan
 wrote:
>
> - Resources isolation. Having the said service running within the same JVM 
> may negatively impact Cassandra storage's performance. It could be more 
> beneficial to have them in Sidecar, which offers strong resource isolation 
> guarantees.
>
> How does having this in a side car change the impact on “storage 
> performance”?  The side car reading sstables will have the same impact on 
> storage IO as the main process reading sstables.  Given the sidecar is 
> running on the same node as the main C* process, the only real resource 
> isolation you have is in heap/GC?  CPU/Memory/IO are all still shared between 
> the main C* process and the side car, and coordinating those across processes 
> is harder than coordinating them within a single process.  For example if we 
> wanted to have the compaction throughput, streaming throughput, and analytics 
> read throughput all tied back to a single disk IO cap, that is harder with an 
> external process.

I think we might be underselling how valuable JVM isolation is,
especially for analytics queries that are going to pass the entire
dataset through heap somewhat constantly. In addition to that, having
this in a separate process gives us access to easy-to-use OS level
protections over CPU time, memory, network, and disk via cgroups; as
well as taking advantage of the existing isolation techniques kernels
already offer to protect processes from each other e.g. CPU schedulers
like CFS [1], network qdiscs like tc-fq/tc-prio[2, 3], and io
schedulers like kyber/bfq [4].

Mixing latency sensitive point queries with throughput sensitive ones
in the same JVM just seems fraught with peril and I don't buy we will
build the same level of performance isolation that the kernel has.
Note you do not need containers to do this, the kernel by default uses
these isolation mechanisms to enforce fairness to resources, cgroups
just make it better (and can be used regardless of containerization).
This was the thinking behind backup/restore, repair, bulk operations,
etc ... living in a separate process.

As has been mentioned elsewhere, being able to run that workload on
different physical machines is even better to isolate, and I could
totally see a wonderful architecture in the future where you have
sidecar doing incremental backups from source nodes and restores every
~10 minutes to the "analytics" nodes where spark bulk readers are
pointed. For isolation the best would be a separate process on a
separate machine, followed by a separate process on the same machine,
followed by a separate thread on the same machine (historically what
C* does) ... now thats not so say we need to go straight to best, but
we probably shouldn't do the worst thing?

-Joey

[1] https://man7.org/linux/man-pages/man7/sched.7.html
[2] https://man7.org/linux/man-pages/man8/tc-fq.8.html
[3] https://man7.org/linux/man-pages/man8/tc-prio.8.html
[4] https://docs.kernel.org/block/index.html


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Jeremiah D Jordan


>> Given the sidecar is running on the same node as the main C* process, the 
>> only real resource isolation you have is in heap/GC?  CPU/Memory/IO are all 
>> still shared between the main C* process and the side car, and coordinating 
>> those across processes is harder than coordinating them within a single 
>> process. For example if we wanted to have the compaction throughput, 
>> streaming throughput, and analytics read throughput all tied back to a 
>> single disk IO cap, that is harder with an external process.
> 
> Relatively trivial, for CPU and memory, to run them in different 
> containers/cgroups/etc, so you can put an exact cpu/memory limit on the 
> sidecar. That's different from a jmx rate limiter/throttle, but (arguably) 
> more precise, because it actually limits the underlying physical resource 
> instead of a proxy for it in a config setting. 
> 

If we want to bring groups/containers/etc into the default deployment 
mechanisms of C*, great.  I am all for dividing it up into micro services given 
we solve all the problems I listed in the complexity section.

I am actually all for dividing C* up into multiple micro services, but the 
project needs to buy in to containers as the default mechanism for running it 
for that to be viable in my mind.

>  
>> 
>>> - Complexity. Considering the existence of the Sidecar project, it would be 
>>> less complex to avoid adding another (http?) service in Cassandra.
>> 
>> Not sure that is really very complex, running an http service is a pretty 
>> easy?  We already have netty in use to instantiate one from.
>> I worry more about the complexity of having the matching schema for a set of 
>> sstables being read.  The complexity of new sstable versions/formats being 
>> introduced.  The complexity of having up to date data from memtables being 
>> considered by this API without having to flush before every query of it.  
>> The complexity of dealing with the new memtable API introduced in CEP-11.  
>> The complexity of coordinating compaction/streaming adding and removing 
>> files with these APIs reading them.  There are a lot of edge cases to 
>> consider for this external access to sstables that the main process 
>> considers itself the “owner” of.
>> 
>> All of this is not to say that I think separating things out into other 
>> processes/services is bad.  But I think we need to be very careful with how 
>> we do it, or end users will end up running into all the sharp edges and the 
>> feature will fail.
>> 
>> -Jeremiah
>> 
>>> On Mar 24, 2023, at 8:15 PM, Yifan Cai >> > wrote:
>>> 
>>> Hi Jeremiah, 
>>> 
>>> There are good reasons to not have these inside Cassandra. Consider the 
>>> following.
>>> - Resources isolation. Having the said service running within the same JVM 
>>> may negatively impact Cassandra storage's performance. It could be more 
>>> beneficial to have them in Sidecar, which offers strong resource isolation 
>>> guarantees.
>>> - Availability. If the Cassandra cluster is being bounced, using sidecar 
>>> would not affect the SBR/SBW functionality, e.g. SBR can still read 
>>> SSTables via sidecar endpoints. 
>>> - Compatibility. Sidecar provides stable REST-based APIs, such as uploading 
>>> SSTables endpoint, which would remain compatible with different versions of 
>>> Cassandra. The current implementation supports versions 3.0 and 4.0.
>>> - Complexity. Considering the existence of the Sidecar project, it would be 
>>> less complex to avoid adding another (http?) service in Cassandra.
>>> - Release velocity. Sidecar, as an independent project, can have a quicker 
>>> release cycle from Cassandra. 
>>> - The features in sidecar are mostly implemented based on various existing 
>>> tools/APIs exposed from Cassandra, e.g. ring, commit sstable, snapshot, etc.
>>> 
>>> Regarding authentication and authorization
>>> - We will add it as a follow-on CEP in Sidecar, but we don't want to hold 
>>> up this CEP. It would be a feature that benefits all Sidecar endpoints.
>>> 
>>> - Yifan
>>> 
>>> On Fri, Mar 24, 2023 at 2:43 PM Doug Rohrer >> > wrote:
 I agree that the analytics library will need to support vnodes. To be 
 clear, there’s nothing preventing the solution from working with vnodes 
 right now, and no assumptions about a 1:1 topology between a token and a 
 node. However, we don’t, today, have the ability to test vnode support 
 end-to-end. We are working towards that, however, and should be able to 
 remove the caveat from the released analytics library once we can properly 
 test vnode support.
 If it helps, I can update the CEP to say something more like “Caveat: 
 Currently untested with vnodes - work is ongoing to remove this 
 limitation” if that helps?
 
 Doug
 
 > On Mar 24, 2023, at 11:43 AM, Brandon Williams >>> > > wrote:
 > 
 > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D 

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Jeff Jirsa
On Tue, Mar 28, 2023 at 7:30 AM Jeremiah D Jordan 
wrote:

> - Resources isolation. Having the said service running within the same JVM
> may negatively impact Cassandra storage's performance. It could be more
> beneficial to have them in Sidecar, which offers strong resource isolation
> guarantees.
>
>
> How does having this in a side car change the impact on “storage
> performance”?  The side car reading sstables will have the same impact on
> storage IO as the main process reading sstables.
>

This is true.


>  Given the sidecar is running on the same node as the main C* process, the
> only real resource isolation you have is in heap/GC?  CPU/Memory/IO are all
> still shared between the main C* process and the side car, and coordinating
> those across processes is harder than coordinating them within a single
> process. For example if we wanted to have the compaction throughput,
> streaming throughput, and analytics read throughput all tied back to a
> single disk IO cap, that is harder with an external process.
>

Relatively trivial, for CPU and memory, to run them in different
containers/cgroups/etc, so you can put an exact cpu/memory limit on the
sidecar. That's different from a jmx rate limiter/throttle, but (arguably)
more precise, because it actually limits the underlying physical resource
instead of a proxy for it in a config setting.



>
> - Complexity. Considering the existence of the Sidecar project, it would
> be less complex to avoid adding another (http?) service in Cassandra.
>
>
> Not sure that is really very complex, running an http service is a pretty
> easy?  We already have netty in use to instantiate one from.
> I worry more about the complexity of having the matching schema for a set
> of sstables being read.  The complexity of new sstable versions/formats
> being introduced.  The complexity of having up to date data from memtables
> being considered by this API without having to flush before every query of
> it.  The complexity of dealing with the new memtable API introduced in
> CEP-11.  The complexity of coordinating compaction/streaming adding and
> removing files with these APIs reading them.  There are a lot of edge cases
> to consider for this external access to sstables that the main process
> considers itself the “owner” of.
>
> All of this is not to say that I think separating things out into other
> processes/services is bad.  But I think we need to be very careful with how
> we do it, or end users will end up running into all the sharp edges and the
> feature will fail.
>
> -Jeremiah
>
> On Mar 24, 2023, at 8:15 PM, Yifan Cai  wrote:
>
> Hi Jeremiah,
>
> There are good reasons to not have these inside Cassandra. Consider the
> following.
> - Resources isolation. Having the said service running within the same JVM
> may negatively impact Cassandra storage's performance. It could be more
> beneficial to have them in Sidecar, which offers strong resource isolation
> guarantees.
> - Availability. If the Cassandra cluster is being bounced, using sidecar
> would not affect the SBR/SBW functionality, e.g. SBR can still read
> SSTables via sidecar endpoints.
> - Compatibility. Sidecar provides stable REST-based APIs, such as
> uploading SSTables endpoint, which would remain compatible with different
> versions of Cassandra. The current implementation supports versions 3.0 and
> 4.0.
> - Complexity. Considering the existence of the Sidecar project, it would
> be less complex to avoid adding another (http?) service in Cassandra.
> - Release velocity. Sidecar, as an independent project, can have a quicker
> release cycle from Cassandra.
> - The features in sidecar are mostly implemented based on various existing
> tools/APIs exposed from Cassandra, e.g. ring, commit sstable, snapshot, etc.
>
> Regarding authentication and authorization
> - We will add it as a follow-on CEP in Sidecar, but we don't want to hold
> up this CEP. It would be a feature that benefits all Sidecar endpoints.
>
> - Yifan
>
> On Fri, Mar 24, 2023 at 2:43 PM Doug Rohrer  wrote:
>
>> I agree that the analytics library will need to support vnodes. To be
>> clear, there’s nothing preventing the solution from working with vnodes
>> right now, and no assumptions about a 1:1 topology between a token and a
>> node. However, we don’t, today, have the ability to test vnode support
>> end-to-end. We are working towards that, however, and should be able to
>> remove the caveat from the released analytics library once we can properly
>> test vnode support.
>> If it helps, I can update the CEP to say something more like “Caveat:
>> Currently untested with vnodes - work is ongoing to remove this limitation”
>> if that helps?
>>
>> Doug
>>
>> > On Mar 24, 2023, at 11:43 AM, Brandon Williams 
>> wrote:
>> >
>> > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
>> >  wrote:
>> >>
>> >> I have concerns with the majority of this being in the sidecar and not
>> in the database itself.  I think it would make sense for the 

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Jeremiah D Jordan
> - Resources isolation. Having the said service running within the same JVM 
> may negatively impact Cassandra storage's performance. It could be more 
> beneficial to have them in Sidecar, which offers strong resource isolation 
> guarantees.

How does having this in a side car change the impact on “storage performance”?  
The side car reading sstables will have the same impact on storage IO as the 
main process reading sstables.  Given the sidecar is running on the same node 
as the main C* process, the only real resource isolation you have is in 
heap/GC?  CPU/Memory/IO are all still shared between the main C* process and 
the side car, and coordinating those across processes is harder than 
coordinating them within a single process.  For example if we wanted to have 
the compaction throughput, streaming throughput, and analytics read throughput 
all tied back to a single disk IO cap, that is harder with an external process.

> - Complexity. Considering the existence of the Sidecar project, it would be 
> less complex to avoid adding another (http?) service in Cassandra.

Not sure that is really very complex, running an http service is a pretty easy? 
 We already have netty in use to instantiate one from.
I worry more about the complexity of having the matching schema for a set of 
sstables being read.  The complexity of new sstable versions/formats being 
introduced.  The complexity of having up to date data from memtables being 
considered by this API without having to flush before every query of it.  The 
complexity of dealing with the new memtable API introduced in CEP-11.  The 
complexity of coordinating compaction/streaming adding and removing files with 
these APIs reading them.  There are a lot of edge cases to consider for this 
external access to sstables that the main process considers itself the “owner” 
of.

All of this is not to say that I think separating things out into other 
processes/services is bad.  But I think we need to be very careful with how we 
do it, or end users will end up running into all the sharp edges and the 
feature will fail.

-Jeremiah

> On Mar 24, 2023, at 8:15 PM, Yifan Cai  wrote:
> 
> Hi Jeremiah, 
> 
> There are good reasons to not have these inside Cassandra. Consider the 
> following.
> - Resources isolation. Having the said service running within the same JVM 
> may negatively impact Cassandra storage's performance. It could be more 
> beneficial to have them in Sidecar, which offers strong resource isolation 
> guarantees.
> - Availability. If the Cassandra cluster is being bounced, using sidecar 
> would not affect the SBR/SBW functionality, e.g. SBR can still read SSTables 
> via sidecar endpoints. 
> - Compatibility. Sidecar provides stable REST-based APIs, such as uploading 
> SSTables endpoint, which would remain compatible with different versions of 
> Cassandra. The current implementation supports versions 3.0 and 4.0.
> - Complexity. Considering the existence of the Sidecar project, it would be 
> less complex to avoid adding another (http?) service in Cassandra.
> - Release velocity. Sidecar, as an independent project, can have a quicker 
> release cycle from Cassandra. 
> - The features in sidecar are mostly implemented based on various existing 
> tools/APIs exposed from Cassandra, e.g. ring, commit sstable, snapshot, etc.
> 
> Regarding authentication and authorization
> - We will add it as a follow-on CEP in Sidecar, but we don't want to hold up 
> this CEP. It would be a feature that benefits all Sidecar endpoints.
> 
> - Yifan
> 
> On Fri, Mar 24, 2023 at 2:43 PM Doug Rohrer  > wrote:
>> I agree that the analytics library will need to support vnodes. To be clear, 
>> there’s nothing preventing the solution from working with vnodes right now, 
>> and no assumptions about a 1:1 topology between a token and a node. However, 
>> we don’t, today, have the ability to test vnode support end-to-end. We are 
>> working towards that, however, and should be able to remove the caveat from 
>> the released analytics library once we can properly test vnode support.
>> If it helps, I can update the CEP to say something more like “Caveat: 
>> Currently untested with vnodes - work is ongoing to remove this limitation” 
>> if that helps?
>> 
>> Doug
>> 
>> > On Mar 24, 2023, at 11:43 AM, Brandon Williams > > > wrote:
>> > 
>> > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
>> > mailto:jeremiah.jor...@gmail.com>> wrote:
>> >> 
>> >> I have concerns with the majority of this being in the sidecar and not in 
>> >> the database itself.  I think it would make sense for the server side of 
>> >> this to be a new service exposed by the database, not in the sidecar.  
>> >> That way it can be able to properly integrate with the authentication and 
>> >> authorization apis, and to make it a first class citizen in terms of 
>> >> having unit/integration tests in the main DB ensuring no one breaks it.
>> > 
>> > 

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-27 Thread James Berragan
Complex predicates on non-partition keys naturally require pulling the entire 
data set into the Spark DataFrame to perform the query. We have some 
optimizations around column filtering and partition key predicates, utilizing 
the Filter.db/Summary.db/Index.db files to only read the data it needs. We have 
chatted to Caleb about utilizing the index file for SAIs but at present it is 
purely theoretical.

In terms of internals, beyond some util/serializer classes, the writer part 
depends on the CQLSSTableWriter and the reader uses the SSTableSimpleIterator 
and the CompactionIterator.

James.

> On Mar 27, 2023, at 3:06 PM, Jeremy Hanna  wrote:
> 
> Thank you for the write-up and the efforts on CASSANDRA-16222.  It sounds 
> like you've been using this for some time.  I understand from the rejected 
> alternatives that the Spark Cassandra Connector was slower because it goes 
> through the read and write path for C* rather than this backdoor mechanism.  
> In your experience using this, under what circumstances have you seen that 
> this tool is not a good fit for analytics - such as complex predicates?  The 
> challenge with the Spark Cassandra Connector and previously the Hadoop 
> integration is that it had to do full table scans even to get small amounts 
> of data.  It sounds like this is similar in that it has to do a full table 
> scan but with the advantage of being faster and less load on the cluster.  In 
> other words, I'm asking if this has been a replacement for the Spark 
> Cassandra Connector or if there are cases in your work where SCC is a better 
> fit.
> 
> Also to Benjamin's point in the comments on the CEP itself, how coupled is 
> this to internals?  Are there going to be higher level APIs or is it going to 
> call internal storage classes directly?
> 
> Thanks!
> 
> Jeremy
> 
> 
>> On Mar 23, 2023, at 12:33 PM, Doug Rohrer  wrote:
>> 
>> Hi everyone,
>> 
>> Wiki: 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
>> 
>> We’d like to propose this CEP for adoption by the community.
>> 
>> It is common for teams using Cassandra to find themselves looking for a way 
>> to interact with large amounts of data for analytics workloads. However, 
>> Cassandra’s standard APIs aren’t designed for large scale data egress/ingest 
>> as the native read/write paths weren’t designed for bulk analytics.
>> 
>> We’re proposing this CEP for this exact purpose. It enables the 
>> implementation of custom Spark (or similar) applications that can either 
>> read or write large amounts of Cassandra data at line rates, by accessing 
>> the persistent storage of nodes in the cluster via the Cassandra Sidecar.
>> 
>> This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
>> that allows deep integration into Apache Spark that allows its users to bulk 
>> import or export data from a running Cassandra cluster with minimal to no 
>> impact to the read/write traffic.
>> 
>> We will shortly publish a branch with code that will accompany this CEP to 
>> help readers understand it better.
>> 
>> As a reminder, please keep the discussion here on the dev list vs. in the 
>> wiki, as we’ve found it easier to manage via email.
>> 
>> Sincerely,
>> 
>> Doug Rohrer & James Berragan
> 



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-27 Thread Jeremy Hanna
Thank you for the write-up and the efforts on CASSANDRA-16222.  It sounds like 
you've been using this for some time.  I understand from the rejected 
alternatives that the Spark Cassandra Connector was slower because it goes 
through the read and write path for C* rather than this backdoor mechanism.  In 
your experience using this, under what circumstances have you seen that this 
tool is not a good fit for analytics - such as complex predicates?  The 
challenge with the Spark Cassandra Connector and previously the Hadoop 
integration is that it had to do full table scans even to get small amounts of 
data.  It sounds like this is similar in that it has to do a full table scan 
but with the advantage of being faster and less load on the cluster.  In other 
words, I'm asking if this has been a replacement for the Spark Cassandra 
Connector or if there are cases in your work where SCC is a better fit.

Also to Benjamin's point in the comments on the CEP itself, how coupled is this 
to internals?  Are there going to be higher level APIs or is it going to call 
internal storage classes directly?

Thanks!

Jeremy


> On Mar 23, 2023, at 12:33 PM, Doug Rohrer  wrote:
> 
> Hi everyone,
> 
> Wiki: 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
> 
> We’d like to propose this CEP for adoption by the community.
> 
> It is common for teams using Cassandra to find themselves looking for a way 
> to interact with large amounts of data for analytics workloads. However, 
> Cassandra’s standard APIs aren’t designed for large scale data egress/ingest 
> as the native read/write paths weren’t designed for bulk analytics.
> 
> We’re proposing this CEP for this exact purpose. It enables the 
> implementation of custom Spark (or similar) applications that can either read 
> or write large amounts of Cassandra data at line rates, by accessing the 
> persistent storage of nodes in the cluster via the Cassandra Sidecar.
> 
> This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
> that allows deep integration into Apache Spark that allows its users to bulk 
> import or export data from a running Cassandra cluster with minimal to no 
> impact to the read/write traffic.
> 
> We will shortly publish a branch with code that will accompany this CEP to 
> help readers understand it better.
> 
> As a reminder, please keep the discussion here on the dev list vs. in the 
> wiki, as we’ve found it easier to manage via email.
> 
> Sincerely,
> 
> Doug Rohrer & James Berragan



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-27 Thread James Berragan
On the Sidecar discussion, while Sidecar is the preferred mechanism for the 
reasons described, the API is sufficiently generic enough to plugin a user 
implementations (essentially provide a list of sstables for a token range, and 
a mechanism to open an InputStream on any SSTable file component). A user could 
- for example - easily read from backup snapshots on a blob store.

> On Mar 26, 2023, at 1:04 PM, Josh McKenzie  wrote:
> 
> I want to second what Yifan's spoken to, specifically in terms of resource 
> isolation and availability.
> 
> While the sidecar hasn't seen a ton of traffic and contributions since the 
> acceptance into the project and clearance of CEP-1, my intuition is that 
> that's due to the entrenched maturity of alternative sidecars out there since 
> we were slow as a project to build one, not out of a lack of demand for a 
> fully fleshed out sidecar. As functionality shows up in the ASF C* Sidecar, 
> there's going to be tension as operators are incentivized to run both their 
> bespoke sidecars they may be running alongside the ASF C* one. That's to be 
> expected and a necessary pain to take on during a transition that I 
> personally think is sorely needed.
> 
> Having bulk operations for analytics and for reading and writing SSTables is 
> a pretty compelling carrot, and the more folks we can get running the sidecar 
> and the more contributors active on it, the more we can expect to see 
> interest and work show up there (repair coordination, REST API's, etc - all 
> of which we've talked about before on ML or slack).
> 
> So I'm a strong +1 to it living in the sidecar.
> 
> On Sat, Mar 25, 2023, at 11:05 AM, Brandon Williams wrote:
>> Oh, that's significantly different and great news, please do!  Thanks
>> for the clarification, Doug!
>> 
>> Kind Regards,
>> Brandon
>> 
>> On Fri, Mar 24, 2023 at 4:42 PM Doug Rohrer > > wrote:
>> >
>> > I agree that the analytics library will need to support vnodes. To be 
>> > clear, there’s nothing preventing the solution from working with vnodes 
>> > right now, and no assumptions about a 1:1 topology between a token and a 
>> > node. However, we don’t, today, have the ability to test vnode support 
>> > end-to-end. We are working towards that, however, and should be able to 
>> > remove the caveat from the released analytics library once we can properly 
>> > test vnode support.
>> > If it helps, I can update the CEP to say something more like “Caveat: 
>> > Currently untested with vnodes - work is ongoing to remove this 
>> > limitation” if that helps?
>> >
>> > Doug
>> >
>> > > On Mar 24, 2023, at 11:43 AM, Brandon Williams > > > > wrote:
>> > >
>> > > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
>> > > mailto:jeremiah.jor...@gmail.com>> wrote:
>> > >>
>> > >> I have concerns with the majority of this being in the sidecar and not 
>> > >> in the database itself.  I think it would make sense for the server 
>> > >> side of this to be a new service exposed by the database, not in the 
>> > >> sidecar.  That way it can be able to properly integrate with the 
>> > >> authentication and authorization apis, and to make it a first class 
>> > >> citizen in terms of having unit/integration tests in the main DB 
>> > >> ensuring no one breaks it.
>> > >
>> > > I don't think this can/should happen until it supports the database's
>> > > default configuration with vnodes.
>> >



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-26 Thread Josh McKenzie
I want to second what Yifan's spoken to, specifically in terms of resource 
isolation and availability.

While the sidecar hasn't seen a ton of traffic and contributions since the 
acceptance into the project and clearance of CEP-1, my intuition is that that's 
due to the entrenched maturity of alternative sidecars out there since we were 
slow as a project to build one, not out of a lack of demand for a fully fleshed 
out sidecar. As functionality shows up in the ASF C* Sidecar, there's going to 
be tension as operators are incentivized to run both their bespoke sidecars 
they may be running alongside the ASF C* one. That's to be expected and a 
necessary pain to take on during a transition that I personally think is sorely 
needed.

Having bulk operations for analytics and for reading and writing SSTables is a 
pretty compelling carrot, and the more folks we can get running the sidecar and 
the more contributors active on it, the more we can expect to see interest and 
work show up there (repair coordination, REST API's, etc - all of which we've 
talked about before on ML or slack).

So I'm a strong +1 to it living in the sidecar.

On Sat, Mar 25, 2023, at 11:05 AM, Brandon Williams wrote:
> Oh, that's significantly different and great news, please do!  Thanks
> for the clarification, Doug!
> 
> Kind Regards,
> Brandon
> 
> On Fri, Mar 24, 2023 at 4:42 PM Doug Rohrer  wrote:
> >
> > I agree that the analytics library will need to support vnodes. To be 
> > clear, there’s nothing preventing the solution from working with vnodes 
> > right now, and no assumptions about a 1:1 topology between a token and a 
> > node. However, we don’t, today, have the ability to test vnode support 
> > end-to-end. We are working towards that, however, and should be able to 
> > remove the caveat from the released analytics library once we can properly 
> > test vnode support.
> > If it helps, I can update the CEP to say something more like “Caveat: 
> > Currently untested with vnodes - work is ongoing to remove this limitation” 
> > if that helps?
> >
> > Doug
> >
> > > On Mar 24, 2023, at 11:43 AM, Brandon Williams  wrote:
> > >
> > > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
> > >  wrote:
> > >>
> > >> I have concerns with the majority of this being in the sidecar and not 
> > >> in the database itself.  I think it would make sense for the server side 
> > >> of this to be a new service exposed by the database, not in the sidecar. 
> > >>  That way it can be able to properly integrate with the authentication 
> > >> and authorization apis, and to make it a first class citizen in terms of 
> > >> having unit/integration tests in the main DB ensuring no one breaks it.
> > >
> > > I don't think this can/should happen until it supports the database's
> > > default configuration with vnodes.
> >
> 


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-25 Thread Brandon Williams
Oh, that's significantly different and great news, please do!  Thanks
for the clarification, Doug!

Kind Regards,
Brandon

On Fri, Mar 24, 2023 at 4:42 PM Doug Rohrer  wrote:
>
> I agree that the analytics library will need to support vnodes. To be clear, 
> there’s nothing preventing the solution from working with vnodes right now, 
> and no assumptions about a 1:1 topology between a token and a node. However, 
> we don’t, today, have the ability to test vnode support end-to-end. We are 
> working towards that, however, and should be able to remove the caveat from 
> the released analytics library once we can properly test vnode support.
> If it helps, I can update the CEP to say something more like “Caveat: 
> Currently untested with vnodes - work is ongoing to remove this limitation” 
> if that helps?
>
> Doug
>
> > On Mar 24, 2023, at 11:43 AM, Brandon Williams  wrote:
> >
> > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
> >  wrote:
> >>
> >> I have concerns with the majority of this being in the sidecar and not in 
> >> the database itself.  I think it would make sense for the server side of 
> >> this to be a new service exposed by the database, not in the sidecar.  
> >> That way it can be able to properly integrate with the authentication and 
> >> authorization apis, and to make it a first class citizen in terms of 
> >> having unit/integration tests in the main DB ensuring no one breaks it.
> >
> > I don't think this can/should happen until it supports the database's
> > default configuration with vnodes.
>


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Yifan Cai
Hi Jeremiah,

There are good reasons to not have these inside Cassandra. Consider the
following.
- Resources isolation. Having the said service running within the same JVM
may negatively impact Cassandra storage's performance. It could be more
beneficial to have them in Sidecar, which offers strong resource isolation
guarantees.
- Availability. If the Cassandra cluster is being bounced, using sidecar
would not affect the SBR/SBW functionality, e.g. SBR can still read
SSTables via sidecar endpoints.
- Compatibility. Sidecar provides stable REST-based APIs, such as uploading
SSTables endpoint, which would remain compatible with different versions of
Cassandra. The current implementation supports versions 3.0 and 4.0.
- Complexity. Considering the existence of the Sidecar project, it would be
less complex to avoid adding another (http?) service in Cassandra.
- Release velocity. Sidecar, as an independent project, can have a quicker
release cycle from Cassandra.
- The features in sidecar are mostly implemented based on various existing
tools/APIs exposed from Cassandra, e.g. ring, commit sstable, snapshot, etc.

Regarding authentication and authorization
- We will add it as a follow-on CEP in Sidecar, but we don't want to hold
up this CEP. It would be a feature that benefits all Sidecar endpoints.

- Yifan

On Fri, Mar 24, 2023 at 2:43 PM Doug Rohrer  wrote:

> I agree that the analytics library will need to support vnodes. To be
> clear, there’s nothing preventing the solution from working with vnodes
> right now, and no assumptions about a 1:1 topology between a token and a
> node. However, we don’t, today, have the ability to test vnode support
> end-to-end. We are working towards that, however, and should be able to
> remove the caveat from the released analytics library once we can properly
> test vnode support.
> If it helps, I can update the CEP to say something more like “Caveat:
> Currently untested with vnodes - work is ongoing to remove this limitation”
> if that helps?
>
> Doug
>
> > On Mar 24, 2023, at 11:43 AM, Brandon Williams  wrote:
> >
> > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
> >  wrote:
> >>
> >> I have concerns with the majority of this being in the sidecar and not
> in the database itself.  I think it would make sense for the server side of
> this to be a new service exposed by the database, not in the sidecar.  That
> way it can be able to properly integrate with the authentication and
> authorization apis, and to make it a first class citizen in terms of having
> unit/integration tests in the main DB ensuring no one breaks it.
> >
> > I don't think this can/should happen until it supports the database's
> > default configuration with vnodes.
>
>


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Doug Rohrer
I agree that the analytics library will need to support vnodes. To be clear, 
there’s nothing preventing the solution from working with vnodes right now, and 
no assumptions about a 1:1 topology between a token and a node. However, we 
don’t, today, have the ability to test vnode support end-to-end. We are working 
towards that, however, and should be able to remove the caveat from the 
released analytics library once we can properly test vnode support.
If it helps, I can update the CEP to say something more like “Caveat: Currently 
untested with vnodes - work is ongoing to remove this limitation” if that helps?

Doug

> On Mar 24, 2023, at 11:43 AM, Brandon Williams  wrote:
> 
> On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
>  wrote:
>> 
>> I have concerns with the majority of this being in the sidecar and not in 
>> the database itself.  I think it would make sense for the server side of 
>> this to be a new service exposed by the database, not in the sidecar.  That 
>> way it can be able to properly integrate with the authentication and 
>> authorization apis, and to make it a first class citizen in terms of having 
>> unit/integration tests in the main DB ensuring no one breaks it.
> 
> I don't think this can/should happen until it supports the database's
> default configuration with vnodes.



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Brandon Williams
On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
 wrote:
>
> I have concerns with the majority of this being in the sidecar and not in the 
> database itself.  I think it would make sense for the server side of this to 
> be a new service exposed by the database, not in the sidecar.  That way it 
> can be able to properly integrate with the authentication and authorization 
> apis, and to make it a first class citizen in terms of having 
> unit/integration tests in the main DB ensuring no one breaks it.

I don't think this can/should happen until it supports the database's
default configuration with vnodes.


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Jeremiah D Jordan
I have concerns with the majority of this being in the sidecar and not in the 
database itself.  I think it would make sense for the server side of this to be 
a new service exposed by the database, not in the sidecar.  That way it can be 
able to properly integrate with the authentication and authorization apis, and 
to make it a first class citizen in terms of having unit/integration tests in 
the main DB ensuring no one breaks it.

-Jeremiah

> On Mar 24, 2023, at 10:29 AM, Dinesh Joshi  wrote:
> 
> Hi Benjamin,
> 
> I agree with your concern about long term maintenance of the code. Doug
> has contributed several patches to Cassandra over the years. Besides him
> there will be several other maintainers that will take on maintenance of
> this code including Yifan and myself. Given how closely it is coupled
> with the Cassandra Sidecar project, I would prefer that we keep this
> within the Cassandra project umbrella as a separate repository and a
> sub-project.
> 
> Thanks,
> 
> Dinesh
> 
> 
> On 3/24/23 02:35, Benjamin Lerer wrote:
>> Hi Doug,
>> 
>> Outside of the changes to the Cassandra Sidecar that are mentioned, what
>> the CEP proposes is the donation of a library for Spark integration. It
>> seems to me that this library could be offered as an open source project
>> outside of the Cassandra project itself. If we accept Spark Bulk
>> Analytic as part of the Cassandra project it means that the community
>> will commit to maintain it and ensure that for each Cassandra release it
>> will be fully compatible. Considering our history with Hadoop
>> integration which has basically been unmaintained for years, I am not
>> convinced that it is what we should do.
>> We only started to expand the scope of the project recently and I would
>> personally prefer that we do that slowly starting with the drivers that
>> are critical for C*. Now, it is only my personal opinion and other
>> people might have a different view on those things.
>> 
>> Le jeu. 23 mars 2023 à 23:29, Miklosovic, Stefan
>> mailto:stefan.mikloso...@netapp.com> 
>> <mailto:stefan.mikloso...@netapp.com>> a
>> écrit :
>> 
>>Hi,
>> 
>>I think this might be a great contribution in the light of removed
>>Hadoop integration recently (CASSANDRA-18323) as it will not be in
>>5.0 anymore. If this CEP is adopted and delivered, I can see how it
>>might be a logical replacement of that.
>> 
>>Regards
>> 
>>
>>From: Doug Rohrer mailto:droh...@apple.com> 
>> <mailto:droh...@apple.com>>
>>Sent: Thursday, March 23, 2023 18:33
>>To: dev@cassandra.apache.org <mailto:dev@cassandra.apache.org> 
>> <mailto:dev@cassandra.apache.org>
>>Cc: James Berragan
>>Subject: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with
>>Spark Bulk Analytics
>> 
>>NetApp Security WARNING: This is an external email. Do not click
>>links or open attachments unless you recognize the sender and know
>>the content is safe.
>> 
>> 
>> 
>> 
>>Hi everyone,
>> 
>>Wiki:
>>
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics>
>> 
>>We’d like to propose this CEP for adoption by the community.
>> 
>>It is common for teams using Cassandra to find themselves looking
>>for a way to interact with large amounts of data for analytics
>>workloads. However, Cassandra’s standard APIs aren’t designed for
>>large scale data egress/ingest as the native read/write paths
>>weren’t designed for bulk analytics.
>> 
>>We’re proposing this CEP for this exact purpose. It enables the
>>implementation of custom Spark (or similar) applications that can
>>either read or write large amounts of Cassandra data at line rates,
>>by accessing the persistent storage of nodes in the cluster via the
>>Cassandra Sidecar.
>> 
>>This CEP proposes new APIs in the Cassandra Sidecar and a companion
>>library that allows deep integration into Apache Spark that allows
>>its users to bulk import or export data from a running Cassandra
>>cluster with minimal to no impact to the read/write traffic.
>> 
>>We will shortly publish a branch with code that will accompany this
>>CEP to help readers understand it better.
>> 
>>As a reminder, please keep the discussion here on the dev list vs.
>>in the wiki, as we’ve found it easier to manage via email.
>> 
>>Sincerely,
>> 
>>Doug Rohrer & James Berragan



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Dinesh Joshi
Hi Benjamin,

I agree with your concern about long term maintenance of the code. Doug
has contributed several patches to Cassandra over the years. Besides him
there will be several other maintainers that will take on maintenance of
this code including Yifan and myself. Given how closely it is coupled
with the Cassandra Sidecar project, I would prefer that we keep this
within the Cassandra project umbrella as a separate repository and a
sub-project.

Thanks,

Dinesh


On 3/24/23 02:35, Benjamin Lerer wrote:
> Hi Doug,
> 
> Outside of the changes to the Cassandra Sidecar that are mentioned, what
> the CEP proposes is the donation of a library for Spark integration. It
> seems to me that this library could be offered as an open source project
> outside of the Cassandra project itself. If we accept Spark Bulk
> Analytic as part of the Cassandra project it means that the community
> will commit to maintain it and ensure that for each Cassandra release it
> will be fully compatible. Considering our history with Hadoop
> integration which has basically been unmaintained for years, I am not
> convinced that it is what we should do.
> We only started to expand the scope of the project recently and I would
> personally prefer that we do that slowly starting with the drivers that
> are critical for C*. Now, it is only my personal opinion and other
> people might have a different view on those things.
> 
> Le jeu. 23 mars 2023 à 23:29, Miklosovic, Stefan
> mailto:stefan.mikloso...@netapp.com>> a
> écrit :
> 
> Hi,
> 
> I think this might be a great contribution in the light of removed
> Hadoop integration recently (CASSANDRA-18323) as it will not be in
> 5.0 anymore. If this CEP is adopted and delivered, I can see how it
> might be a logical replacement of that.
> 
> Regards
> 
> 
> From: Doug Rohrer mailto:droh...@apple.com>>
> Sent: Thursday, March 23, 2023 18:33
> To: dev@cassandra.apache.org <mailto:dev@cassandra.apache.org>
>     Cc: James Berragan
> Subject: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with
> Spark Bulk Analytics
> 
> NetApp Security WARNING: This is an external email. Do not click
> links or open attachments unless you recognize the sender and know
> the content is safe.
> 
> 
> 
> 
> Hi everyone,
> 
> Wiki:
> 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
>  
> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics>
> 
> We’d like to propose this CEP for adoption by the community.
> 
> It is common for teams using Cassandra to find themselves looking
> for a way to interact with large amounts of data for analytics
> workloads. However, Cassandra’s standard APIs aren’t designed for
> large scale data egress/ingest as the native read/write paths
> weren’t designed for bulk analytics.
> 
> We’re proposing this CEP for this exact purpose. It enables the
> implementation of custom Spark (or similar) applications that can
> either read or write large amounts of Cassandra data at line rates,
> by accessing the persistent storage of nodes in the cluster via the
> Cassandra Sidecar.
> 
> This CEP proposes new APIs in the Cassandra Sidecar and a companion
> library that allows deep integration into Apache Spark that allows
> its users to bulk import or export data from a running Cassandra
> cluster with minimal to no impact to the read/write traffic.
> 
> We will shortly publish a branch with code that will accompany this
> CEP to help readers understand it better.
> 
> As a reminder, please keep the discussion here on the dev list vs.
> in the wiki, as we’ve found it easier to manage via email.
> 
> Sincerely,
> 
> Doug Rohrer & James Berragan
> 



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Miklosovic, Stefan
Good point, Benjamin.

You wrote: "library could be offered as an open source project outside of the 
Cassandra project itself".

If Cassandra's code makes integrations like these possible (which I guess is 
the part of CEP), is there any reason this has to live under Cassandra project 
umbrella instead of hosting it in a separate repository?

We might definitely advertise / propagate that on the website here on the 
ecosystem page (1).

The logical successor of the Hadoop integration (which we had in the repository 
until recently) does not have to be in the repository again. We might expose 
ourselves unnecessarily to the same risk we had with Hadoop if the code is not 
maintained anymore for various reasons, being it technological obsolescence or 
shortage of maintainers.

(1) https://cassandra.apache.org/_/ecosystem.html


From: Benjamin Lerer 
Sent: Friday, March 24, 2023 10:35
To: dev@cassandra.apache.org
Subject: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark 
Bulk Analytics

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



Hi Doug,

Outside of the changes to the Cassandra Sidecar that are mentioned, what the 
CEP proposes is the donation of a library for Spark integration. It seems to me 
that this library could be offered as an open source project outside of the 
Cassandra project itself. If we accept Spark Bulk Analytic as part of the 
Cassandra project it means that the community will commit to maintain it and 
ensure that for each Cassandra release it will be fully compatible. Considering 
our history with Hadoop integration which has basically been unmaintained for 
years, I am not convinced that it is what we should do.
We only started to expand the scope of the project recently and I would 
personally prefer that we do that slowly starting with the drivers that are 
critical for C*. Now, it is only my personal opinion and other people might 
have a different view on those things.

Le jeu. 23 mars 2023 à 23:29, Miklosovic, Stefan 
mailto:stefan.mikloso...@netapp.com>> a écrit :
Hi,

I think this might be a great contribution in the light of removed Hadoop 
integration recently (CASSANDRA-18323) as it will not be in 5.0 anymore. If 
this CEP is adopted and delivered, I can see how it might be a logical 
replacement of that.

Regards


From: Doug Rohrer mailto:droh...@apple.com>>
Sent: Thursday, March 23, 2023 18:33
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org>
Cc: James Berragan
Subject: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk 
Analytics

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




Hi everyone,

Wiki: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics

We’d like to propose this CEP for adoption by the community.

It is common for teams using Cassandra to find themselves looking for a way to 
interact with large amounts of data for analytics workloads. However, 
Cassandra’s standard APIs aren’t designed for large scale data egress/ingest as 
the native read/write paths weren’t designed for bulk analytics.

We’re proposing this CEP for this exact purpose. It enables the implementation 
of custom Spark (or similar) applications that can either read or write large 
amounts of Cassandra data at line rates, by accessing the persistent storage of 
nodes in the cluster via the Cassandra Sidecar.

This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
that allows deep integration into Apache Spark that allows its users to bulk 
import or export data from a running Cassandra cluster with minimal to no 
impact to the read/write traffic.

We will shortly publish a branch with code that will accompany this CEP to help 
readers understand it better.

As a reminder, please keep the discussion here on the dev list vs. in the wiki, 
as we’ve found it easier to manage via email.

Sincerely,

Doug Rohrer & James Berragan


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Benjamin Lerer
Hi Doug,

Outside of the changes to the Cassandra Sidecar that are mentioned, what
the CEP proposes is the donation of a library for Spark integration. It
seems to me that this library could be offered as an open source project
outside of the Cassandra project itself. If we accept Spark Bulk Analytic
as part of the Cassandra project it means that the community will commit to
maintain it and ensure that for each Cassandra release it will be fully
compatible. Considering our history with Hadoop integration which has
basically been unmaintained for years, I am not convinced that it is what
we should do.
We only started to expand the scope of the project recently and I would
personally prefer that we do that slowly starting with the drivers that are
critical for C*. Now, it is only my personal opinion and other people might
have a different view on those things.

Le jeu. 23 mars 2023 à 23:29, Miklosovic, Stefan <
stefan.mikloso...@netapp.com> a écrit :

> Hi,
>
> I think this might be a great contribution in the light of removed Hadoop
> integration recently (CASSANDRA-18323) as it will not be in 5.0 anymore. If
> this CEP is adopted and delivered, I can see how it might be a logical
> replacement of that.
>
> Regards
>
> 
> From: Doug Rohrer 
> Sent: Thursday, March 23, 2023 18:33
> To: dev@cassandra.apache.org
> Cc: James Berragan
> Subject: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark
> Bulk Analytics
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
>
> Hi everyone,
>
> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
>
> We’d like to propose this CEP for adoption by the community.
>
> It is common for teams using Cassandra to find themselves looking for a
> way to interact with large amounts of data for analytics workloads.
> However, Cassandra’s standard APIs aren’t designed for large scale data
> egress/ingest as the native read/write paths weren’t designed for bulk
> analytics.
>
> We’re proposing this CEP for this exact purpose. It enables the
> implementation of custom Spark (or similar) applications that can either
> read or write large amounts of Cassandra data at line rates, by accessing
> the persistent storage of nodes in the cluster via the Cassandra Sidecar.
>
> This CEP proposes new APIs in the Cassandra Sidecar and a companion
> library that allows deep integration into Apache Spark that allows its
> users to bulk import or export data from a running Cassandra cluster with
> minimal to no impact to the read/write traffic.
>
> We will shortly publish a branch with code that will accompany this CEP to
> help readers understand it better.
>
> As a reminder, please keep the discussion here on the dev list vs. in the
> wiki, as we’ve found it easier to manage via email.
>
> Sincerely,
>
> Doug Rohrer & James Berragan
>


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-23 Thread Miklosovic, Stefan
Hi,

I think this might be a great contribution in the light of removed Hadoop 
integration recently (CASSANDRA-18323) as it will not be in 5.0 anymore. If 
this CEP is adopted and delivered, I can see how it might be a logical 
replacement of that.

Regards


From: Doug Rohrer 
Sent: Thursday, March 23, 2023 18:33
To: dev@cassandra.apache.org
Cc: James Berragan
Subject: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk 
Analytics

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




Hi everyone,

Wiki: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics

We’d like to propose this CEP for adoption by the community.

It is common for teams using Cassandra to find themselves looking for a way to 
interact with large amounts of data for analytics workloads. However, 
Cassandra’s standard APIs aren’t designed for large scale data egress/ingest as 
the native read/write paths weren’t designed for bulk analytics.

We’re proposing this CEP for this exact purpose. It enables the implementation 
of custom Spark (or similar) applications that can either read or write large 
amounts of Cassandra data at line rates, by accessing the persistent storage of 
nodes in the cluster via the Cassandra Sidecar.

This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
that allows deep integration into Apache Spark that allows its users to bulk 
import or export data from a running Cassandra cluster with minimal to no 
impact to the read/write traffic.

We will shortly publish a branch with code that will accompany this CEP to help 
readers understand it better.

As a reminder, please keep the discussion here on the dev list vs. in the wiki, 
as we’ve found it easier to manage via email.

Sincerely,

Doug Rohrer & James Berragan


[DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-23 Thread Doug Rohrer
Hi everyone,

Wiki: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics

We’d like to propose this CEP for adoption by the community.

It is common for teams using Cassandra to find themselves looking for a way to 
interact with large amounts of data for analytics workloads. However, 
Cassandra’s standard APIs aren’t designed for large scale data egress/ingest as 
the native read/write paths weren’t designed for bulk analytics.

We’re proposing this CEP for this exact purpose. It enables the implementation 
of custom Spark (or similar) applications that can either read or write large 
amounts of Cassandra data at line rates, by accessing the persistent storage of 
nodes in the cluster via the Cassandra Sidecar.

This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
that allows deep integration into Apache Spark that allows its users to bulk 
import or export data from a running Cassandra cluster with minimal to no 
impact to the read/write traffic.

We will shortly publish a branch with code that will accompany this CEP to help 
readers understand it better.

As a reminder, please keep the discussion here on the dev list vs. in the wiki, 
as we’ve found it easier to manage via email.

Sincerely,

Doug Rohrer & James Berragan