Re: [DISCUSS] CASSANDRA-19113: Publishing dtest-shaded JARs on release

2023-11-28 Thread Doug Rohrer
+1 (nb, but not a vote, so ¯\_(ツ)_/¯ ) - would be lovely to not have to deal 
with this individually for each project in which we use the in-jvm dtest 
framework. As Francisco noted, we’re using this in the sidecar and Analytics 
projects now and I’ve had to jump through a lot of hoops to get everything 
building consistently.

I’ve got some minor modifications to the way in which the existing shading 
works that I can contribute back to the core Cassandra project (mostly, a few 
additional relocations and not using the user’s default Maven cache as the 
temporary installation location as it was difficult to make sure you had the 
correct dtest jar with a bunch of them in the `.m2` directory).

Doug

> On Nov 28, 2023, at 2:51 PM, Josh McKenzie  wrote:
> 
> Building these jars every time we run every CI job is just silly.
> 
> +1.
> 
> On Tue, Nov 28, 2023, at 2:08 PM, Francisco Guerrero wrote:
>> Hi Abe,
>> 
>> I'm +1 on this. Several Cassandra-ecosystem projects build the dtest jar in 
>> CI. We'd very
>> much prefer to just consumed shaded dtest jars from Cassandra releases for 
>> testing
>> purposes.
>> 
>> Best,
>> - Francisco
>> 
>> On 2023/11/28 19:02:17 Abe Ratnofsky wrote:
>> > Hey folks - wanted to raise a separate thread to discuss publishing of 
>> > dtest-shaded JARs on release.
>> > 
>> > Currently, adjacent projects that want to use the jvm-dtest framework need 
>> > to build the shaded JARs themselves. This is a decent amount of work, and 
>> > is duplicated across each project. This is mainly relevant for projects 
>> > like Sidecar and Driver. Currently, those projects need to clone and build 
>> > apache/cassandra themselves, run ant dtest-jar, and move the JAR into the 
>> > appropriate place. Different build systems treat local JARs differently, 
>> > and the whole process can be a bit complicated. Would be great to be able 
>> > to treat these as normal dependencies.
>> > 
>> > https://issues.apache.org/jira/browse/CASSANDRA-19113
>> > 
>> > Any objections?
>> > 
>> > --
>> > Abe



Re: CASSANDRA-18941 produce size bounded SSTables from CQLSSTableWriter

2023-10-25 Thread Doug Rohrer
+1 (nb) - wiłl be nice for the analytics writer to be able to size SSTables 
appropriately and efficiently.

Doug

> On Oct 24, 2023, at 10:36 PM, guo Maxwell  wrote:
> 
> 
> 
> Chris Lohfink mailto:clohfin...@gmail.com>> 
> 于2023年10月25日周三 05:02写道:
>> +1
>> 
>> On Tue, Oct 24, 2023 at 11:24 AM Brandon Williams > > wrote:
>>> +1
>>> 
>>> Kind Regards,
>>> Brandon
>>> 
>>> On Mon, Oct 23, 2023 at 6:22 PM Yifan Cai >> > wrote:
>>> >
>>> > Hi,
>>> >
>>> > I want to propose merging the patch in CASSANDRA-18941 to 4.0 and up to 
>>> > trunk and hope we are all OK with it.
>>> >
>>> > In CASSANDRA-18941, I am adding the capability to produce size-bounded 
>>> > SSTables in CQLSSTableWriter for sorted data. It can greatly benefit 
>>> > Cassandra Analytics (https://github.com/apache/cassandra-analytics) for 
>>> > bulk writing SSTables, since it avoids buffering and sorting on flush, 
>>> > given the data source is sorted already in the bulk write process. 
>>> > Cassandra Analytics supports Cassandra 4.0 and depends on the 
>>> > cassandra-all 4.0.x library. Therefore, we are mostly interested in using 
>>> > the new capability in 4.0.
>>> >
>>> > CQLSSTableWriter is only used in offline tools and never in the code path 
>>> > of Cassandra server.
>>> >
>>> > Any objections to merging the patch to 4.0 and up to trunk?
>>> >
>>> > - Yifan



Re: [DISCUSS] putting versions into Deprecated annotations

2023-10-06 Thread Doug Rohrer
+1 on reason string, especially some way to indicate what replaces a method if 
it’s being moved into some other class/new method with more parameters/etc. 
I’ve found lots of cases (in code bases in general, not C* in particular) where 
something is marked as Deprecated but there’s no mention of a replacement even 
when there is one.

As someone who has spent a bunch of time using parts of Cassandra as a library, 
this would be hugely beneficial, but it would also clearly be useful for 
maintainers of the core codebase.

Doug

> On Oct 6, 2023, at 7:49 AM, Josh McKenzie  wrote:
> 
> Might be nice to support a 3rd param that's a String for the reason it's 
> deprecated. i.e. "Replaced by X",  "Unmaintained", "Obsolete", "See 
> CASSANDRA-N", link to a dev ML thread on pony mail, etc. That way if 
> someone comes across it in the codebase they have some context to follow up 
> on if it's the shape of a thing they need w/out having to go full-bore w/git 
> blame and JQL.
> 
> On Fri, Oct 6, 2023, at 4:43 AM, Miklosovic, Stefan wrote:
>> Hi list,
>> 
>> I have a ticket to discuss (1). 
>> 
>> When we deprecate APIs / methods etc, what I want to suggest is that we 
>> might start to explicitly add the version when that happened. For example, 
>> if you deprecated something which goes to 5.0, would you be so nice to do 
>> this?
>> 
>> @Deprecated(since = "5.0") 
>> 
>> Similarly, that annotation offers one more field - forRemoval, so using it 
>> like this: 
>> 
>> @Deprecated(since = "5.0", forRemoval = true) 
>> 
>> means that this is eligible to be deleted in Cassandra 6.0. 
>> 
>> With this information, it is way more comfortable to just "grep" where we 
>> are at when it comes to deprecations eligible to be deleted in the next 
>> version. Currently, we basically have to go one by one and figure out if it 
>> is not old enough to remove. I believe this would bring more transparency 
>> into what is planned to be removed and when as well it will be clearly 
>> visible what should be removed in the next version and it is not. 
>> 
>> Tangential question to this is if everything we deprecated is eligible for 
>> removal? In other words, are there any cases when forRemoval would be false? 
>> Could you elaborate on that and give such examples or do you all think that 
>> everything which is deprecated will be eventually removed?
>> 
>> (1) https://issues.apache.org/jira/browse/CASSANDRA-18912
>> 
>> Thanks and regards



Re: [VOTE] Accept java-driver

2023-10-03 Thread Doug Rohrer
+1 (nb)

> On Oct 3, 2023, at 10:37 AM, C. Scott Andreas  wrote:
> 
> +1 (nb)
> 
> Accepting this donation would mark a huge milestone for the project.
> 
>> On Oct 3, 2023, at 4:25 AM, Josh McKenzie  wrote:
>> 
>> 
>>> I see now this will likely be instead apache/cassandra-java-driver
>> I was wondering about that. apache/java-driver seemed pretty broad. :)
>> 
>> From the linked page:
>> Check that all active committers have a signed CLA on record. TODO – attach 
>> list
>> I've been part of these discussions and work so am familiar with the status 
>> of it (as well as guidance and clearance from the foundation re: folks we 
>> couldn't reach) - but might be worthwhile to link to the sheet or perhaps 
>> instead provide a summary of the 49 java contributors, their CLA signing 
>> status, attempts to reach out, etc for other PMC members that weren't 
>> actively involved back when we were working through it.
>> 
>> As for my vote: +1
>> 
>> Thanks everyone for the hard work getting to this point. This really is a 
>> significant contribution to the project.
>> 
>> On Tue, Oct 3, 2023, at 6:48 AM, Brandon Williams wrote:
>>> +1
>>> 
>>> Kind Regards,
>>> Brandon
>>> 
>>> On Mon, Oct 2, 2023 at 11:53 PM Mick Semb Wever >> > wrote:
>>> >
>>> > The donation of the java-driver is ready for its IP Clearance vote.
>>> > https://incubator.apache.org/ip-clearance/cassandra-java-driver.html
>>> >
>>> > The SGA has been sent to the ASF.  This does not require acknowledgement 
>>> > before the vote.
>>> >
>>> > Once the vote passes, and the SGA has been filed by the ASF Secretary, we 
>>> > will request ASF Infra to move the datastax/java-driver as-is to 
>>> > apache/java-driver
>>> >
>>> > This means all branches and tags, with all their history, will be kept.  
>>> > A cleaning effort has already cleaned up anything deemed not needed.
>>> >
>>> > Background for the donation is found in CEP-8: 
>>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+DataStax+Drivers+Donation
>>> >
>>> > PMC members, please take note of (and check) the IP Clearance 
>>> > requirements when voting.
>>> >
>>> > The vote will be open for 72 hours (or longer). Votes by PMC members are 
>>> > considered binding. A vote passes if there are at least three binding +1s 
>>> > and no -1's.
>>> >
>>> > regards,
>>> > Mick
>>> 
>> 
> 
> 



Re: [Discuss] Enabling JMX in in-jvm dtests (by default)

2023-08-25 Thread Doug Rohrer
I’d agree that anywhere we’re calling `nodetoolResult` or `nodetool` in a test, 
it would be better to enable JMX and use it rather than the older mocks we set 
up to enable calling the mbeans directly. I don’t think enabling JMX by default 
is the right way to go mostly due to the added resources/time required to run 
the tests (it’s only a few seconds of additional startup/shutdown time, but 
when running lots of tests every second counts).  Also, all other features are 
only enabled when requested, so making JMX on by default would require us to 
change the general pattern and have a `without` method to turn off a feature?

Better, I think, just to require it to be explicitly turned on and then have 
the methods that call into nodetool on Instance just throw a clear exception if 
jmx is disabled.

Doug

> On Aug 25, 2023, at 6:35 AM, Brandon Williams  wrote:
> 
> I would prefer to have one standard way to do it, and given the
> options I would prefer it be proper JMX instead of mocking.
> 
> Kind Regards,
> Brandon
> 
> On Fri, Aug 25, 2023 at 4:20 AM Miklosovic, Stefan
>  wrote:
>> 
>> Hi list,
>> 
>> I want to gather a feedback for this comment (1).
>> 
>> Long story short, until JMX feature was introduced, we kind of hacked / 
>> mocked the calls to MBeans from IInstance, like this (2). If you notice, 
>> there is a lot of methods throwing UnsupportedOperationException because we 
>> had no proper JMX connection in place. That in turn means that tests which 
>> call nodetool commands which are using these MBeans / operations are not 
>> possible.
>> 
>> The fix I made in CASSANDRA-18572 will use JMX feature and it will hook 
>> nodetool to a proper JMX connection where we are not mocking anything etc 
>> ... It will use same stuff as in production.
>> 
>> However, this is happening only if one uses JMX feature. So all existing 
>> tests calling nodetool without this feature will still use it like it was. 
>> The patch I made takes care of both scenarios.
>> 
>> My question is if we should not make JMX feature turned on by default. That 
>> way we might further simplify the code base and get rid of the hacks.
>> 
>> Another possibility is to not turn it on by default but we would add JMX 
>> feature to each test which is using nodetool. That would also mean that any 
>> future test which will use nodetool will fail if it does not have JMX 
>> feature enabled.
>> 
>> What would you like to see - dual solution (proper JMX connection if such 
>> feature is used as well as the legacy way) or only one solution with a 
>> proper JMX? (enabled by default or not).
>> 
>> Regards
>> 
>> (1) 
>> https://issues.apache.org/jira/browse/CASSANDRA-18572?focusedCommentId=17758920=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17758920
>> (2) 
>> https://github.com/apache/cassandra/blob/trunk/test/distributed/org/apache/cassandra/distributed/mock/nodetool/InternalNodeProbe.java



Re: [DISCUSS] CASSANDRA-18743 Deprecation of metrics-reporter-config

2023-08-16 Thread Doug Rohrer
My only concern about removal in 5.1 would be that removing it in a “minor” 
release would really be a breaking change, and semver says that should happen 
in a major version.

If we really want to be semver compliant, it shouldn’t be removed until 6.0 
(or, if we remove it in the next release, we should call that 6.0, but that 
conflicts with the idea of a “yearly major” so I’m not sure where we land at 
the end of the day).

Doug

> On Aug 16, 2023, at 4:14 PM, Abe Ratnofsky  wrote:
> 
> There's consensus here to deprecate metrics-reporter-config in 5.0.
> 
> Is there any objection to removing it in 5.1?
> 
>> On Aug 11, 2023, at 10:01 AM, Maxim Muzafarov  wrote:
>> 
>> +1
>> 
>> The rationale for deprecating/removing this library is not just that
>> it is obsolete and doesn't get updates. In fact, when the
>> metrics-reporter-config [1] was added the dropwizard metrics library
>> (formerly com.yammer.metrics [2]) didn't support exporting metrics to
>> files like csv, so it made sense at that time. Now it is fully covered
>> by the drowpwizrd reporters [3], so users can achieve the same
>> behaviour without the need for metrics-reporter-config. And that's why
>> I have a lot of doubts about it being used by anyone, but deprecation
>> is friendlier because there's no rush to remove it. :-)
>> 
>> 
>> [1] https://issues.apache.org/jira/browse/CASSANDRA-4430
>> [2] https://issues.apache.org/jira/browse/CASSANDRA-5838
>> [3] https://metrics.dropwizard.io/4.2.0/getting-started.html#other-reporting
>> 
>> On Fri, 11 Aug 2023 at 16:50, Caleb Rackliffe  
>> wrote:
>>> 
>>> +1
>>> 
 On Aug 11, 2023, at 8:10 AM, Brandon Williams  wrote:
 
 +1
 
 Kind Regards,
 Brandon
 
> On Fri, Aug 11, 2023 at 8:08 AM Ekaterina Dimitrova
>  wrote:
> 
> 
> “ The rationale for this proposed deprecation is that the upcoming 5.0 
> release is a good time to evaluate dependencies that are no longer 
> receiving updates and will become risks in the future.”
> 
> Thank you for raising it, I support your proposal for deprecation
> 
>> On Fri, 11 Aug 2023 at 8:55, Abe Ratnofsky  wrote:
>> 
>> Hey folks,
>> 
>> Opening a thread to get input on a proposed dependency deprecation in 
>> 5.0: metrics-reporter-config has been archived for 3 years and not 
>> updated in nearly 6 years.
>> 
>> This project has a minor security issue with its usage of unsafe YAML 
>> loading via snakeyaml’s unprotected Constructor: 
>> https://nvd.nist.gov/vuln/detail/CVE-2022-1471
>> 
>> This CVE is reasonable to suppress, since operators should be able to 
>> trust their YAML configuration files.
>> 
>> The rationale for this proposed deprecation is that the upcoming 5.0 
>> release is a good time to evaluate dependencies that are no longer 
>> receiving updates and will become risks in the future.
>> 
>> https://issues.apache.org/jira/browse/CASSANDRA-18743
>> 
>> —
>> Abe
>> 
> 



Re: [VOTE] Release dtest-api 0.0.16

2023-08-16 Thread Doug Rohrer
+1 (nb) - Thanks Dinesh!

Doug

> On Aug 16, 2023, at 5:34 PM, Dinesh Joshi  wrote:
> 
> Proposing the test build of in-jvm dtest API 0.0.16 for release.
> 
> Repository:
> https://gitbox.apache.org/repos/asf?p=cassandra-in-jvm-dtest-api.git
> 
> Candidate SHA:
> https://github.com/apache/cassandra-in-jvm-dtest-api/commit/1ba6ef93d0721741b5f6d6d72cba3da03fe78438
> tagged with 0.0.16
> 
> Artifacts:
> https://repository.apache.org/content/repositories/orgapachecassandra-1307/org/apache/cassandra/dtest-api/0.0.16/
> 
> Key signature: 53371F9B1B425A336988B6A03B6042413D323470
> 
> Changes since last release:
> 
> * CASSANDRA-18727 - JMXUtil.getJmxConnector should retry connection attempts
> 
> The vote will be open for 24 hours. Everyone who has tested the build
> is invited to vote. Votes by PMC members are considered binding. A
> vote passes if there are at least three binding +1s.
> 



Re: [VOTE] Release dtest-api 0.0.15

2023-05-24 Thread Doug Rohrer
+1 (nb)

> On May 24, 2023, at 11:32 AM, Brandon Williams  wrote:
> 
> +1
> 
> Kind Regards,
> Brandon
> 
> On Wed, May 24, 2023 at 10:31 AM Dinesh Joshi  wrote:
>> 
>> Proposing the test build of in-jvm dtest API 0.0.15 for release.
>> 
>> Repository:
>> https://gitbox.apache.org/repos/asf?p=cassandra-in-jvm-dtest-api.git
>> 
>> Candidate SHA:
>> https://github.com/apache/cassandra-in-jvm-dtest-api/commit/48af78d1d4b5f285d3dd4991afd4df3101e3983a
>> tagged with 0.0.15
>> 
>> Artifacts:
>> https://repository.apache.org/content/repositories/orgapachecassandra-1290/org/apache/cassandra/dtest-api/0.0.15/
>> 
>> Key signature: 53371F9B1B425A336988B6A03B6042413D323470
>> 
>> Changes since last release:
>> 
>> * CASSANDRA-18537: Add JMX utility class to in-jvm dtest to ease
>> development of new tests using JMX
>> 
>> The vote will be open for 24 hours. Everyone who has tested the build
>> is invited to vote. Votes by PMC members are considered binding. A
>> vote passes if there are at least three binding +1s.



Re: [VOTE] Release dtest-api 0.0.14

2023-05-15 Thread Doug Rohrer
+1 (nb)

Doug Rohrer

> On May 15, 2023, at 7:17 PM, Brandon Williams  wrote:
> 
> +1
> 
> Kind Regards,
> Brandon
> 
>> On Mon, May 15, 2023 at 5:12 PM Dinesh Joshi  wrote:
>> 
>> Proposing the test build of in-jvm dtest API 0.0.14 for release.
>> 
>> Repository:
>> https://gitbox.apache.org/repos/asf?p=cassandra-in-jvm-dtest-api.git
>> 
>> Candidate SHA:
>> https://github.com/apache/cassandra-in-jvm-dtest-api/commit/ea4b44e0ed0a4f0bbe9b18fb40ad927b49a73a32
>> tagged with 0.0.14
>> 
>> Artifacts:
>> https://repository.apache.org/content/repositories/orgapachecassandra-1289/org/apache/cassandra/dtest-api/0.0.14/
>> 
>> Key signature: 53371F9B1B425A336988B6A03B6042413D323470
>> 
>> Changes since last release:
>> 
>> * CASSANDRA-18511: Add support for JMX in jvm-dtest
>> 
>> The vote will be open for 24 hours. Everyone who has tested the build
>> is invited to vote. Votes by PMC members are considered binding. A
>> vote passes if there are at least three binding +1s.


Re: [VOTE] CEP-29 CQL NOT Operator

2023-05-12 Thread Doug Rohrer
+1 (nb)

> On May 8, 2023, at 4:52 AM, Piotr Kołaczkowski  wrote:
> 
> Let's vote.
> 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-29%3A+CQL+NOT+operator
> 
> Piotr Kołaczkowski
> e. pkola...@datastax.com
> w. www.datastax.com



Re: [VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-07 Thread Doug Rohrer
The vote passes with 12 +1s (8 binding) and no -1.

Thank you all for taking the time to consider CEP-28. This has been a 
years-long effort by a bunch of people, and we’re really excited to be able to 
share the Cassandra Analytics library with the community and work together to 
continue improving it.

Doug Rohrer

> On May 6, 2023, at 1:52 PM, Dinesh Joshi  wrote:
> 
> +1
> 
>> On May 4, 2023, at 9:46 AM, Doug Rohrer  wrote:
>> 
>> Hello all,
>> 
>> I’d like to put CEP-28 to a vote.
>> 
>> Proposal:
>> 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
>> 
>> Jira:
>> https://issues.apache.org/jira/browse/CASSANDRA-16222
>> 
>> Draft implementation:
>> 
>> - Apache Cassandra Spark Analytics source code: 
>> https://github.com/frankgh/cassandra-analytics
>> - Changes required for Sidecar: 
>> https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
>> 
>> Discussion:
>> https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3
>> 
>> The vote will be open for 72 hours. 
>> A vote passes if there are at least three binding +1s and no binding vetoes. 
>> 
>> 
>> Thanks,
>> 
>> Doug Rohrer
>> 
>> 
> 



[VOTE] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread Doug Rohrer
Hello all,

I’d like to put CEP-28 to a vote.

Proposal:

https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics

Jira:
https://issues.apache.org/jira/browse/CASSANDRA-16222

Draft implementation:

- Apache Cassandra Spark Analytics source code: 
https://github.com/frankgh/cassandra-analytics
- Changes required for Sidecar: 
https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis

Discussion:
https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3

The vote will be open for 72 hours. 
A vote passes if there are at least three binding +1s and no binding vetoes. 


Thanks,

Doug Rohrer




Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-10 Thread Doug Rohrer
I’ve updated the CEP with two overview diagrams of the interactions between 
Sidecar, Cassandra, and the Bulk Analytics library.  Hope this helps folks 
better understand how things work, and thanks for the patience as it took a bit 
longer than expected for me to find the time for this.

Doug

> On Apr 5, 2023, at 11:18 AM, Doug Rohrer  wrote:
> 
> Sorry for the delay in responding here - yes, we can add some diagrams to the 
> CEP - I’ll try to get that done by end-of-week.
> 
> Thanks,
> 
> Doug
> 
>> On Mar 28, 2023, at 1:14 PM, J. D. Jordan  wrote:
>> 
>> Maybe some data flow diagrams could be added to the cep showing some example 
>> operations for read/write?
>> 
>>> On Mar 28, 2023, at 11:35 AM, Yifan Cai  wrote:
>>> 
>>> 
>>> A lot of great discussions! 
>>> 
>>> On the sidecar front, especially what the role sidecar plays in terms of 
>>> this CEP, I feel there might be some confusion. Once the code is published, 
>>> we should have clarity.
>>> Sidecar does not read sstables nor do any coordination for analytics 
>>> queries. It is local to the companion Cassandra instance. For bulk read, it 
>>> takes snapshots and streams sstables to spark workers to read. For bulk 
>>> write, it imports the sstables uploaded from spark workers. All commands 
>>> are existing jmx/nodetool functionalities from Cassandra. Sidecar adds the 
>>> http interface to them. It might be an over simplified description. The 
>>> complex computation is performed in spark clusters only.
>>> 
>>> In the long run, Cassandra might evolve into a database that does both OLTP 
>>> and OLAP. (Not what this thread aims for) 
>>> At the current stage, Spark is very suited for analytic purposes. 
>>> 
>>> On Tue, Mar 28, 2023 at 9:06 AM Benedict >> <mailto:bened...@apache.org>> wrote:
>>>> I disagree with the first claim, as the process has all the information it 
>>>> chooses to utilise about which resources it’s using and what it’s using 
>>>> those resources for.
>>>> 
>>>> The inability to isolate GC domains is something we cannot address, but 
>>>> also probably not a problem if we were doing everything with memory 
>>>> management as well as we could be.
>>>> 
>>>> But, not worth detailing this thread for. Today we do very little well on 
>>>> this front within the process, and a separate process is well justified 
>>>> given the state of play.
>>>> 
>>>>> On 28 Mar 2023, at 16:38, Derek Chen-Becker >>>> <mailto:de...@chen-becker.org>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch >>>> <mailto:joe.e.ly...@gmail.com>> wrote:
>>>>> ...
>>>>> 
>>>>>> I think we might be underselling how valuable JVM isolation is,
>>>>>> especially for analytics queries that are going to pass the entire
>>>>>> dataset through heap somewhat constantly. 
>>>>> 
>>>>> Big +1 here. The JVM simply does not have significant granularity of 
>>>>> control for resource utilization, but this is explicitly a feature of 
>>>>> separate processes. Add in being able to separate GC domains and you can 
>>>>> avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Derek
>>>>> 
>>>>> 
>>>>> -- 
>>>>> +---+
>>>>> | Derek Chen-Becker |
>>>>> | GPG Key available at https://keybase.io/dchenbecker and   |
>>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>>>>> +---+
>>>>> 
> 



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-05 Thread Doug Rohrer
Sorry for the delay in responding here - yes, we can add some diagrams to the 
CEP - I’ll try to get that done by end-of-week.

Thanks,

Doug

> On Mar 28, 2023, at 1:14 PM, J. D. Jordan  wrote:
> 
> Maybe some data flow diagrams could be added to the cep showing some example 
> operations for read/write?
> 
>> On Mar 28, 2023, at 11:35 AM, Yifan Cai  wrote:
>> 
>> 
>> A lot of great discussions! 
>> 
>> On the sidecar front, especially what the role sidecar plays in terms of 
>> this CEP, I feel there might be some confusion. Once the code is published, 
>> we should have clarity.
>> Sidecar does not read sstables nor do any coordination for analytics 
>> queries. It is local to the companion Cassandra instance. For bulk read, it 
>> takes snapshots and streams sstables to spark workers to read. For bulk 
>> write, it imports the sstables uploaded from spark workers. All commands are 
>> existing jmx/nodetool functionalities from Cassandra. Sidecar adds the http 
>> interface to them. It might be an over simplified description. The complex 
>> computation is performed in spark clusters only.
>> 
>> In the long run, Cassandra might evolve into a database that does both OLTP 
>> and OLAP. (Not what this thread aims for) 
>> At the current stage, Spark is very suited for analytic purposes. 
>> 
>> On Tue, Mar 28, 2023 at 9:06 AM Benedict > > wrote:
>>> I disagree with the first claim, as the process has all the information it 
>>> chooses to utilise about which resources it’s using and what it’s using 
>>> those resources for.
>>> 
>>> The inability to isolate GC domains is something we cannot address, but 
>>> also probably not a problem if we were doing everything with memory 
>>> management as well as we could be.
>>> 
>>> But, not worth detailing this thread for. Today we do very little well on 
>>> this front within the process, and a separate process is well justified 
>>> given the state of play.
>>> 
 On 28 Mar 2023, at 16:38, Derek Chen-Becker >>> > wrote:
 
 
 
 On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch >>> > wrote:
 ...
 
> I think we might be underselling how valuable JVM isolation is,
> especially for analytics queries that are going to pass the entire
> dataset through heap somewhat constantly. 
 
 Big +1 here. The JVM simply does not have significant granularity of 
 control for resource utilization, but this is explicitly a feature of 
 separate processes. Add in being able to separate GC domains and you can 
 avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.
 
 Cheers,
 
 Derek
 
 
 -- 
 +---+
 | Derek Chen-Becker |
 | GPG Key available at https://keybase.io/dchenbecker and   |
 | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
 | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
 +---+
 



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Doug Rohrer
I agree that the analytics library will need to support vnodes. To be clear, 
there’s nothing preventing the solution from working with vnodes right now, and 
no assumptions about a 1:1 topology between a token and a node. However, we 
don’t, today, have the ability to test vnode support end-to-end. We are working 
towards that, however, and should be able to remove the caveat from the 
released analytics library once we can properly test vnode support.
If it helps, I can update the CEP to say something more like “Caveat: Currently 
untested with vnodes - work is ongoing to remove this limitation” if that helps?

Doug

> On Mar 24, 2023, at 11:43 AM, Brandon Williams  wrote:
> 
> On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
>  wrote:
>> 
>> I have concerns with the majority of this being in the sidecar and not in 
>> the database itself.  I think it would make sense for the server side of 
>> this to be a new service exposed by the database, not in the sidecar.  That 
>> way it can be able to properly integrate with the authentication and 
>> authorization apis, and to make it a first class citizen in terms of having 
>> unit/integration tests in the main DB ensuring no one breaks it.
> 
> I don't think this can/should happen until it supports the database's
> default configuration with vnodes.



[DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-23 Thread Doug Rohrer
Hi everyone,

Wiki: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics

We’d like to propose this CEP for adoption by the community.

It is common for teams using Cassandra to find themselves looking for a way to 
interact with large amounts of data for analytics workloads. However, 
Cassandra’s standard APIs aren’t designed for large scale data egress/ingest as 
the native read/write paths weren’t designed for bulk analytics.

We’re proposing this CEP for this exact purpose. It enables the implementation 
of custom Spark (or similar) applications that can either read or write large 
amounts of Cassandra data at line rates, by accessing the persistent storage of 
nodes in the cluster via the Cassandra Sidecar.

This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
that allows deep integration into Apache Spark that allows its users to bulk 
import or export data from a running Cassandra cluster with minimal to no 
impact to the read/write traffic.

We will shortly publish a branch with code that will accompany this CEP to help 
readers understand it better.

As a reminder, please keep the discussion here on the dev list vs. in the wiki, 
as we’ve found it easier to manage via email.

Sincerely,

Doug Rohrer & James Berragan

Please grant Wiki access for CEP

2023-03-21 Thread Doug Rohrer
Hi folks:

I’d like to post a CEP, but given it’s the first time I’m trying to contribute 
to the wiki, I don’t have access.

If someone with access could please grant user drohrer access to post, I’d 
greatly appreciate it.

Thanks,

Doug Rohrer