Re: [DISCUSS] java 9 and the future of cassandra on the jdk

2018-03-22 Thread Michael Shuler
On 03/22/2018 05:30 PM, Michael Shuler wrote:

> Ubuntu 16.04 (Bionic (near release))

Ubuntu 18.04 (Bionic) :)

Michael

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] java 9 and the future of cassandra on the jdk

2018-03-22 Thread Michael Shuler
As I mentioned in IRC and was pasted earlier in the thread, I believe
the easiest path is to follow the major releases of OpenJDK in the
long-term-support Linux OS releases. Currently, Debian Stable (Stretch),
Ubuntu 16.04 (Bionic (near release)), and Red Hat / CentOS 7 all have
OpenJDK 8 as the default JDK. For long-term support, they all have build
facilities in place for their supported architectures and developers
that care about security updates for users through their documented EOL
dates.

The current deb and rpm packages for Apache Cassandra all properly
depend on OpenJDK 8, so there's really nothing to be done here, until
the project decides to implicitly depend on a JDK version not easily
installable on the major OS LTS releases. (Users of older OS versions
may need to fiddle with yum and apt sources to get OpenJDK 8, but this
is a relatively solved problem.)

Users have the ability to deviate and set a JAVA_HOME env var to use a
custom-installed JDK of their liking, or go down the `alternatives` path
of their favorite OS.

1) I don't think we should be get into the business of distributing
Java, even if licensing allowed it.
2) The OS vendors are in the business of keeping users updated with
upstream releases of Java, so there's no reason not to utilize them.

Michael

On 03/22/2018 05:12 PM, Jason Brown wrote:
> See the legal-discuss@ thread:
> https://mail-archives.apache.org/mod_mbox/www-legal-discuss/201803.mbox/browser
> .
> 
> TL;DR jlink-based distributions are not gonna fly due to OpenJDK's license,
> so let's focus on other paths forward.
> 
> 
> On Thu, Mar 22, 2018 at 2:04 PM, Carl Mueller 
> wrote:
> 
>> Is OpenJDK really not addressing this at all? Is that because OpenJDK is
>> beholden to Oracle somehow? This is a major disservice to Apache and the
>> java ecosystem as a whole.
>>
>> When java was fully open sourced, it was supposed to free the ecosystem to
>> a large degree from Oracle. Why is OpenJDK being so uncooperative? Are they
>> that resource strapped? Can no one, from consulting empires, Google, IBM,
>> Amazon, and a host of other major companies take care of this?
>>
>> This is basically OpenSSL all over again.
>>
>> Deciding on a way to get a stable language runtime isn't our job. It's the
>> job of either the runtime authors (OpenJDK) or another group that should
>> form around it.
>>
>> There is no looming deadline on this, is there? Can we just let the dust
>> settle on this in the overall ecosystem to see what happens? And again,
>> what is the Apache Software Foundation's approach to this that affects so
>> many of their projects?
>>
>> On Wed, Mar 21, 2018 at 12:55 PM, Jason Brown 
>> wrote:
>>
>>> Well, that was quick. TL;DR Redistributing any part of the OpenJDK is
>>> basically a no-go.
>>>
>>> Thus, that option is off the table.
>>>
>>> On Wed, Mar 21, 2018 at 10:46 AM, Jason Brown 
>>> wrote:
>>>
 ftr, I've sent a message to legal-discuss to inquire about the
>> licensing
 aspect of the OpenJDK as we've been discussing. I believe anyone can
>>> follow
 the thread by subscribing to the legal-discuss@ ML, or you can wait
>> for
 updates on this thread as I get them.

 On Wed, Mar 21, 2018 at 9:49 AM, Jason Brown 
>>> wrote:

> If we went down this path, I can't imagine we would build OpenJDK
> ourselves, but probably build a release with jlink or javapackager. I
> haven't done homework on that yet, but i *think* it uses a blessed
>>> OpenJDK
> release for the packaging (or perhaps whatever JDK you happen to be
> compiling/building with). Thus as long as we build/release when an
>>> openJDK
> rev is released, we would hypothetically be ok from a secutiry POV.
>
> That being said, Micke's points about multiple architectures and other
> OSes (Windows for sure, macOS not so sure) are a legit concern as
>> those
> would need to be separate packages, with separate CI/testing and so on
>>> :(
>
> I'm not sure betting the farm on linux disto support is the path to
> happiness, either. Not everyone uses one of the distros mentioned (RH,
> ubuntu), nor does everyone use linux (sure, the vast majority is
>>> Linux/x86,
> but we do support Windows deployment and macOS development).
>
> -Jason
>
>
>
> On Wed, Mar 21, 2018 at 9:26 AM, Michael Burman 
> wrote:
>
>> On 03/21/2018 04:52 PM, Josh McKenzie wrote:
>>
>> This would certainly mitigate a lot of the core problems with the new
>>> release model. Has there been any public statements of plans/intent
>>> with regards to distros doing this?
>>>
>> Since the latest official LTS version is Java 8, that's the only one
>> with publicly available information For RHEL, OpenJDK8 will receive
>>> updates
>> until October 2020.  "A major version of OpenJDK is 

Re: Optimizing queries for partition keys

2018-03-22 Thread Benjamin Lerer
You should check the 3.x release. CASSANDRA-10657 could have fixed your
problem.


On Thu, Mar 22, 2018 at 9:15 PM, Benjamin Lerer  wrote:

> Syvlain explained the problem in CASSANDRA-4536:
> " Let me note that in CQL3 a row that have no live column don't exist, so
> we can't really implement this with a range slice having an empty columns
> list. Instead we should do a range slice with a full-row slice predicate
> with a count of 1, to make sure we do have a live column before including
> the partition key. "
>
> By using ColumnFilter.selectionBuilder(); you do not select all the
> columns. By consequence, some partitions might be returned while they
> should not.
>
> On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock  wrote:
>
>> Cassandra devs,
>>
>> We use workflows in some of our clusters (running 3.0.15) that involve
>> "SELECT DISTINCT key FROM..."-style queries.  For some tables, we
>> observed extremely poor performance under light load (i.e., a small
>> number of rows per second and frequent timeouts), which we eventually
>> traced to replicas shipping entire rows (which in some cases could store
>> on the order of MBs of data) to service the query.  That surprised us
>> (partly because 2.1 doesn't seem to behave this way), so we did some
>> digging, and we eventually came up with a patch that modifies
>> SelectStatement.java in the following way: if the selection in the query
>> only includes the partition key, then when building a ColumnFilter for
>> the query, use:
>>
>> builder = ColumnFilter.selectionBuilder();
>>
>> instead of:
>>
>> builder = ColumnFilter.allColumnsBuilder();
>>
>> to initialize the ColumnFilter.Builder in gatherQueriedColumns().  That
>> seems to repair the performance regression, and it doesn't appear to
>> break any functionality (based on the unit tests and some smoke tests we
>> ran involving insertions and deletions).
>>
>> We'd like to contribute this patch back to the project, but we're not
>> convinced that there aren't subtle correctness issues we're missing,
>> judging both from comments in the code and the existence of
>> CASSANDRA-5912, which suggests optimizing this kind of query is
>> nontrivial.
>>
>> So: does this change sound safe to make, or are there corner cases we
>> need to account for?  If there are corner cases, are there plausibly
>> ways of addressing them at the SelectStatement level, or will we need to
>> look deeper?
>>
>> Thanks,
>> SK
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>
>>
>


Re: [DISCUSS] java 9 and the future of cassandra on the jdk

2018-03-22 Thread Jason Brown
See the legal-discuss@ thread:
https://mail-archives.apache.org/mod_mbox/www-legal-discuss/201803.mbox/browser
.

TL;DR jlink-based distributions are not gonna fly due to OpenJDK's license,
so let's focus on other paths forward.


On Thu, Mar 22, 2018 at 2:04 PM, Carl Mueller 
wrote:

> Is OpenJDK really not addressing this at all? Is that because OpenJDK is
> beholden to Oracle somehow? This is a major disservice to Apache and the
> java ecosystem as a whole.
>
> When java was fully open sourced, it was supposed to free the ecosystem to
> a large degree from Oracle. Why is OpenJDK being so uncooperative? Are they
> that resource strapped? Can no one, from consulting empires, Google, IBM,
> Amazon, and a host of other major companies take care of this?
>
> This is basically OpenSSL all over again.
>
> Deciding on a way to get a stable language runtime isn't our job. It's the
> job of either the runtime authors (OpenJDK) or another group that should
> form around it.
>
> There is no looming deadline on this, is there? Can we just let the dust
> settle on this in the overall ecosystem to see what happens? And again,
> what is the Apache Software Foundation's approach to this that affects so
> many of their projects?
>
> On Wed, Mar 21, 2018 at 12:55 PM, Jason Brown 
> wrote:
>
> > Well, that was quick. TL;DR Redistributing any part of the OpenJDK is
> > basically a no-go.
> >
> > Thus, that option is off the table.
> >
> > On Wed, Mar 21, 2018 at 10:46 AM, Jason Brown 
> > wrote:
> >
> > > ftr, I've sent a message to legal-discuss to inquire about the
> licensing
> > > aspect of the OpenJDK as we've been discussing. I believe anyone can
> > follow
> > > the thread by subscribing to the legal-discuss@ ML, or you can wait
> for
> > > updates on this thread as I get them.
> > >
> > > On Wed, Mar 21, 2018 at 9:49 AM, Jason Brown 
> > wrote:
> > >
> > >> If we went down this path, I can't imagine we would build OpenJDK
> > >> ourselves, but probably build a release with jlink or javapackager. I
> > >> haven't done homework on that yet, but i *think* it uses a blessed
> > OpenJDK
> > >> release for the packaging (or perhaps whatever JDK you happen to be
> > >> compiling/building with). Thus as long as we build/release when an
> > openJDK
> > >> rev is released, we would hypothetically be ok from a secutiry POV.
> > >>
> > >> That being said, Micke's points about multiple architectures and other
> > >> OSes (Windows for sure, macOS not so sure) are a legit concern as
> those
> > >> would need to be separate packages, with separate CI/testing and so on
> > :(
> > >>
> > >> I'm not sure betting the farm on linux disto support is the path to
> > >> happiness, either. Not everyone uses one of the distros mentioned (RH,
> > >> ubuntu), nor does everyone use linux (sure, the vast majority is
> > Linux/x86,
> > >> but we do support Windows deployment and macOS development).
> > >>
> > >> -Jason
> > >>
> > >>
> > >>
> > >> On Wed, Mar 21, 2018 at 9:26 AM, Michael Burman 
> > >> wrote:
> > >>
> > >>> On 03/21/2018 04:52 PM, Josh McKenzie wrote:
> > >>>
> > >>> This would certainly mitigate a lot of the core problems with the new
> >  release model. Has there been any public statements of plans/intent
> >  with regards to distros doing this?
> > 
> > >>> Since the latest official LTS version is Java 8, that's the only one
> > >>> with publicly available information For RHEL, OpenJDK8 will receive
> > updates
> > >>> until October 2020.  "A major version of OpenJDK is supported for a
> > period
> > >>> of six years from the time that it is first introduced in any version
> > of
> > >>> RHEL, or until the retirement date of the underlying RHEL platform ,
> > >>> whichever is earlier." [1]
> > >>>
> > >>> [1] https://access.redhat.com/articles/1299013
> > >>>
> > >>> In terms of the burden of bugfixes and security fixes if we bundled a
> >  JRE w/C*, cutting a patch release of C* with a new JRE distribution
> >  would be a really low friction process (add to build, check CI,
> green,
> >  done), so I don't think that would be a blocker for the concept.
> > 
> >  And do we have someone actively monitoring CVEs for this? Would we
> > ship
> > >>> a version of OpenJDK which ensures that it works with all the major
> > >>> distributions? Would we run tests against all the major distributions
> > for
> > >>> each of the OpenJDK version we would ship after each CVE with each
> > >>> Cassandra version? Who compiles the OpenJDK distribution we would
> > create
> > >>> (which wouldn't be the official one if we need to maintain support
> for
> > each
> > >>> distribution we support) ? What if one build doesn't work for one
> > distro?
> > >>> Would we not update that CVE? OpenJDK builds that are in the distros
> > are
> > >>> not necessarily the pure ones from the upstream, they might 

Re: [DISCUSS] java 9 and the future of cassandra on the jdk

2018-03-22 Thread Carl Mueller
Is OpenJDK really not addressing this at all? Is that because OpenJDK is
beholden to Oracle somehow? This is a major disservice to Apache and the
java ecosystem as a whole.

When java was fully open sourced, it was supposed to free the ecosystem to
a large degree from Oracle. Why is OpenJDK being so uncooperative? Are they
that resource strapped? Can no one, from consulting empires, Google, IBM,
Amazon, and a host of other major companies take care of this?

This is basically OpenSSL all over again.

Deciding on a way to get a stable language runtime isn't our job. It's the
job of either the runtime authors (OpenJDK) or another group that should
form around it.

There is no looming deadline on this, is there? Can we just let the dust
settle on this in the overall ecosystem to see what happens? And again,
what is the Apache Software Foundation's approach to this that affects so
many of their projects?

On Wed, Mar 21, 2018 at 12:55 PM, Jason Brown  wrote:

> Well, that was quick. TL;DR Redistributing any part of the OpenJDK is
> basically a no-go.
>
> Thus, that option is off the table.
>
> On Wed, Mar 21, 2018 at 10:46 AM, Jason Brown 
> wrote:
>
> > ftr, I've sent a message to legal-discuss to inquire about the licensing
> > aspect of the OpenJDK as we've been discussing. I believe anyone can
> follow
> > the thread by subscribing to the legal-discuss@ ML, or you can wait for
> > updates on this thread as I get them.
> >
> > On Wed, Mar 21, 2018 at 9:49 AM, Jason Brown 
> wrote:
> >
> >> If we went down this path, I can't imagine we would build OpenJDK
> >> ourselves, but probably build a release with jlink or javapackager. I
> >> haven't done homework on that yet, but i *think* it uses a blessed
> OpenJDK
> >> release for the packaging (or perhaps whatever JDK you happen to be
> >> compiling/building with). Thus as long as we build/release when an
> openJDK
> >> rev is released, we would hypothetically be ok from a secutiry POV.
> >>
> >> That being said, Micke's points about multiple architectures and other
> >> OSes (Windows for sure, macOS not so sure) are a legit concern as those
> >> would need to be separate packages, with separate CI/testing and so on
> :(
> >>
> >> I'm not sure betting the farm on linux disto support is the path to
> >> happiness, either. Not everyone uses one of the distros mentioned (RH,
> >> ubuntu), nor does everyone use linux (sure, the vast majority is
> Linux/x86,
> >> but we do support Windows deployment and macOS development).
> >>
> >> -Jason
> >>
> >>
> >>
> >> On Wed, Mar 21, 2018 at 9:26 AM, Michael Burman 
> >> wrote:
> >>
> >>> On 03/21/2018 04:52 PM, Josh McKenzie wrote:
> >>>
> >>> This would certainly mitigate a lot of the core problems with the new
>  release model. Has there been any public statements of plans/intent
>  with regards to distros doing this?
> 
> >>> Since the latest official LTS version is Java 8, that's the only one
> >>> with publicly available information For RHEL, OpenJDK8 will receive
> updates
> >>> until October 2020.  "A major version of OpenJDK is supported for a
> period
> >>> of six years from the time that it is first introduced in any version
> of
> >>> RHEL, or until the retirement date of the underlying RHEL platform ,
> >>> whichever is earlier." [1]
> >>>
> >>> [1] https://access.redhat.com/articles/1299013
> >>>
> >>> In terms of the burden of bugfixes and security fixes if we bundled a
>  JRE w/C*, cutting a patch release of C* with a new JRE distribution
>  would be a really low friction process (add to build, check CI, green,
>  done), so I don't think that would be a blocker for the concept.
> 
>  And do we have someone actively monitoring CVEs for this? Would we
> ship
> >>> a version of OpenJDK which ensures that it works with all the major
> >>> distributions? Would we run tests against all the major distributions
> for
> >>> each of the OpenJDK version we would ship after each CVE with each
> >>> Cassandra version? Who compiles the OpenJDK distribution we would
> create
> >>> (which wouldn't be the official one if we need to maintain support for
> each
> >>> distribution we support) ? What if one build doesn't work for one
> distro?
> >>> Would we not update that CVE? OpenJDK builds that are in the distros
> are
> >>> not necessarily the pure ones from the upstream, they might include
> patches
> >>> that provide better support for the distribution - or even fix bugs
> that
> >>> are not yet in the upstream version.
> >>>
> >>> I guess we also need the Windows versions, maybe the PowerPC & ARM
> >>> versions also at some point. I'm not sure if we plan to support J9 or
> other
> >>> JVMs at some point.
> >>>
> >>> We would also need to create CVE reports after each Java CVE for
> >>> Cassandra as well I would assume since it would affect us separately
> (and
> >>> updating only the Java wouldn't help).

Re: Optimizing queries for partition keys

2018-03-22 Thread Benjamin Lerer
Syvlain explained the problem in CASSANDRA-4536:
" Let me note that in CQL3 a row that have no live column don't exist, so
we can't really implement this with a range slice having an empty columns
list. Instead we should do a range slice with a full-row slice predicate
with a count of 1, to make sure we do have a live column before including
the partition key. "

By using ColumnFilter.selectionBuilder(); you do not select all the
columns. By consequence, some partitions might be returned while they
should not.

On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock  wrote:

> Cassandra devs,
>
> We use workflows in some of our clusters (running 3.0.15) that involve
> "SELECT DISTINCT key FROM..."-style queries.  For some tables, we
> observed extremely poor performance under light load (i.e., a small
> number of rows per second and frequent timeouts), which we eventually
> traced to replicas shipping entire rows (which in some cases could store
> on the order of MBs of data) to service the query.  That surprised us
> (partly because 2.1 doesn't seem to behave this way), so we did some
> digging, and we eventually came up with a patch that modifies
> SelectStatement.java in the following way: if the selection in the query
> only includes the partition key, then when building a ColumnFilter for
> the query, use:
>
> builder = ColumnFilter.selectionBuilder();
>
> instead of:
>
> builder = ColumnFilter.allColumnsBuilder();
>
> to initialize the ColumnFilter.Builder in gatherQueriedColumns().  That
> seems to repair the performance regression, and it doesn't appear to
> break any functionality (based on the unit tests and some smoke tests we
> ran involving insertions and deletions).
>
> We'd like to contribute this patch back to the project, but we're not
> convinced that there aren't subtle correctness issues we're missing,
> judging both from comments in the code and the existence of
> CASSANDRA-5912, which suggests optimizing this kind of query is nontrivial.
>
> So: does this change sound safe to make, or are there corner cases we
> need to account for?  If there are corner cases, are there plausibly
> ways of addressing them at the SelectStatement level, or will we need to
> look deeper?
>
> Thanks,
> SK
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Optimizing queries for partition keys

2018-03-22 Thread Sam Klock
Cassandra devs,

We use workflows in some of our clusters (running 3.0.15) that involve
"SELECT DISTINCT key FROM..."-style queries.  For some tables, we
observed extremely poor performance under light load (i.e., a small
number of rows per second and frequent timeouts), which we eventually
traced to replicas shipping entire rows (which in some cases could store
on the order of MBs of data) to service the query.  That surprised us
(partly because 2.1 doesn't seem to behave this way), so we did some
digging, and we eventually came up with a patch that modifies
SelectStatement.java in the following way: if the selection in the query
only includes the partition key, then when building a ColumnFilter for
the query, use:

builder = ColumnFilter.selectionBuilder();

instead of:

builder = ColumnFilter.allColumnsBuilder();

to initialize the ColumnFilter.Builder in gatherQueriedColumns().  That
seems to repair the performance regression, and it doesn't appear to
break any functionality (based on the unit tests and some smoke tests we
ran involving insertions and deletions).

We'd like to contribute this patch back to the project, but we're not
convinced that there aren't subtle correctness issues we're missing,
judging both from comments in the code and the existence of
CASSANDRA-5912, which suggests optimizing this kind of query is nontrivial.

So: does this change sound safe to make, or are there corner cases we
need to account for?  If there are corner cases, are there plausibly
ways of addressing them at the SelectStatement level, or will we need to
look deeper?

Thanks,
SK

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



RE: Paying off tech debt and correctly naming things

2018-03-22 Thread Kenneth Brotman
Perfect!

-Original Message-
From: Jon Haddad [mailto:jonathan.had...@gmail.com] On Behalf Of Jon Haddad
Sent: Thursday, March 22, 2018 8:10 AM
To: dev@cassandra.apache.org
Subject: Re: Paying off tech debt and correctly naming things

Cool.  I think there’s general agreement that doing this in as small bites as 
possible is going to be the best approach.  I have no interest in mega patches. 
 

>  The combined approach takes a
> change that's already non-trivially dealing with complex subsystem 
> changes and injects a bunch of trivial renaming noise across unrelated 
> subsystems into the signal of an actual logic refactor.

I agree.  This is why I like the idea of proactively working to improve the 
readability of the codebase as a specific goal, rather than being wrapped into 
some other unrelated patch.  Keeping the scope in check is the challenge.  
Simple class and method renames, as several have pointed out, is easy enough 
with IDEA.  

I’ll start with class renames, as individual patches for each of them.  I’ll be 
sure to call it out on the ML.  First one will be ColumnFamilyStore -> 
TableStore.  

Jon

> On Mar 22, 2018, at 7:13 AM, Jason Brown  wrote:
> 
> Jon,
> 
> Thanks for bringing up this topic. I'll admit that I've been around 
> this code base for long enough, and have enough accumulated history, 
> that I probably can't fully appreciate the impact for a newcomer wrt naming.
> However, as Josh points out, this situation probably happens to "every 
> non-trivially aged code-base ever".
> 
> One thing I'd like to add is that with these types of large 
> refactoring changes, the review effort is non-trivial. This is because 
> the review still has to ensure that correctness is preserved and it's 
> easy to overlook a seemingly innocuous change.
> 
> That being said, I am supportive of this effort. However, I believe 
> it's going to be best, for contributor and reviewer, to break it up 
> into smaller, more digestible pieces. I'd also like to request that we 
> not go whole hog and try to do everything in a compressed time frame; 
> reviewer availability is already stretched thin and I'm afraid of 
> deepening the review queue, especially mine :)
> 
> Thanks,
> 
> -Jason
> 
> 
> 
> 
> On Thu, Mar 22, 2018 at 6:41 AM, Josh McKenzie  wrote:
> 
>>> Some of us have big patches in flight, things that actually pay off 
>>> some technical debt, and dealing with such renames is rebase
>> hell :\
>> For sure, but with a code-base this old / organically grown, I expect 
>> this will always be the case. If we're talking something as simple as 
>> an intellij rename refactor, while menial, couldn't someone with a 
>> giant patch just do the same thing on their side and spend half an 
>> hour of their life clicking next? ;)
>> 
>>> That said, there is good time for such renames - it’s during those 
>>> major refactors and rewrites. When you are changing a subsystem, 
>>> might as well do the appropriate renames.
>> Does that hold true for a code-base with as much static state and 
>> abstraction leaking / bad factoring as we have? (i.e. every 
>> non-trivially aged code-base ever) The combined approach takes a 
>> change that's already non-trivially dealing with complex subsystem 
>> changes and injects a bunch of trivial renaming noise across 
>> unrelated subsystems into the signal of an actual logic refactor.
>> 
>> On Thu, Mar 22, 2018 at 9:31 AM, Aleksey Yeshchenko 
>> 
>> wrote:
>>> Poor and out-of-date naming of things is probably the least serious 
>>> part
>> of our technical debt. Bad factoring, and straight-up
>>> poorly written components is where it’s really at.
>>> 
>>> Doing a big rename for rename sake alone does more harm than it is 
>>> good,
>> sometimes. Some of us have big patches
>>> in flight, things that actually pay off some technical debt, and 
>>> dealing
>> with such renames is rebase hell :\
>>> 
>>> That said, there is good time for such renames - it’s during those 
>>> major
>> refactors and rewrites. When you are
>>> changing a subsystem, might as well do the appropriate renames.
>>> 
>>> —
>>> AY
>>> 
>>> On 20 March 2018 at 22:04:48, Jon Haddad (j...@jonhaddad.com) wrote:
>>> 
>>> Whenever I hop around in the codebase, one thing that always manages 
>>> to
>> slow me down is needing to understand the context of the variable 
>> names that I’m looking at. We’ve now removed thrift the transport, 
>> but the variables, classes and comments still remain. Personally, I’d 
>> like to go in and pay off as much technical debt as possible by 
>> refactoring the code to be as close to CQL as possible. Rows should 
>> be rows, not partitions, I’d love to see the term column family 
>> removed forever in favor of always using tables. That said, it’s a 
>> big task. I did a quick refactor in a branch, simply changing the 
>> ColumnFamilyStore class to TableStore, and pushed it up to GitHub. 
>> [1]
>>> 
>>> 

RE: Paying off tech debt and correctly naming things

2018-03-22 Thread Kenneth Brotman
I agree with Jason Brown: good topic to address, effect not trivial.  Going 
whole hog too risky.  Careful meticulous small areas at a time with warning to 
everyone so they can chime in if they are working on that area and would prefer 
it left alone for now.  

I wouldn't want it to delay others things.  For example, not sure where 
everyone is on this but, if Reaper was integrated in Cassandra so it just 
worked away in Cassandra without needing any installation or attention of any 
kind - in time for version 4.0 - would be really cool.

Kenneth Brotman

-Original Message-
From: Jason Brown [mailto:jasedbr...@gmail.com] 
Sent: Thursday, March 22, 2018 7:14 AM
To: dev@cassandra.apache.org
Subject: Re: Paying off tech debt and correctly naming things

Jon,

Thanks for bringing up this topic. I'll admit that I've been around this code 
base for long enough, and have enough accumulated history, that I probably 
can't fully appreciate the impact for a newcomer wrt naming.
However, as Josh points out, this situation probably happens to "every 
non-trivially aged code-base ever".

One thing I'd like to add is that with these types of large refactoring 
changes, the review effort is non-trivial. This is because the review still has 
to ensure that correctness is preserved and it's easy to overlook a seemingly 
innocuous change.

That being said, I am supportive of this effort. However, I believe it's going 
to be best, for contributor and reviewer, to break it up into smaller, more 
digestible pieces. I'd also like to request that we not go whole hog and try to 
do everything in a compressed time frame; reviewer availability is already 
stretched thin and I'm afraid of deepening the review queue, especially mine :)

Thanks,

-Jason




On Thu, Mar 22, 2018 at 6:41 AM, Josh McKenzie  wrote:

> > Some of us have big patches in flight, things that actually pay off 
> > some technical debt, and dealing with such renames is rebase
> hell :\
> For sure, but with a code-base this old / organically grown, I expect 
> this will always be the case. If we're talking something as simple as 
> an intellij rename refactor, while menial, couldn't someone with a 
> giant patch just do the same thing on their side and spend half an 
> hour of their life clicking next? ;)
>
> > That said, there is good time for such renames - it’s during those 
> > major refactors and rewrites. When you are changing a subsystem, 
> > might as well do the appropriate renames.
> Does that hold true for a code-base with as much static state and 
> abstraction leaking / bad factoring as we have? (i.e. every 
> non-trivially aged code-base ever) The combined approach takes a 
> change that's already non-trivially dealing with complex subsystem 
> changes and injects a bunch of trivial renaming noise across unrelated 
> subsystems into the signal of an actual logic refactor.
>
> On Thu, Mar 22, 2018 at 9:31 AM, Aleksey Yeshchenko 
> 
> wrote:
> > Poor and out-of-date naming of things is probably the least serious 
> > part
> of our technical debt. Bad factoring, and straight-up
> > poorly written components is where it’s really at.
> >
> > Doing a big rename for rename sake alone does more harm than it is 
> > good,
> sometimes. Some of us have big patches
> > in flight, things that actually pay off some technical debt, and 
> > dealing
> with such renames is rebase hell :\
> >
> > That said, there is good time for such renames - it’s during those 
> > major
> refactors and rewrites. When you are
> > changing a subsystem, might as well do the appropriate renames.
> >
> > —
> > AY
> >
> > On 20 March 2018 at 22:04:48, Jon Haddad (j...@jonhaddad.com) wrote:
> >
> > Whenever I hop around in the codebase, one thing that always manages 
> > to
> slow me down is needing to understand the context of the variable 
> names that I’m looking at. We’ve now removed thrift the transport, but 
> the variables, classes and comments still remain. Personally, I’d like 
> to go in and pay off as much technical debt as possible by refactoring 
> the code to be as close to CQL as possible. Rows should be rows, not 
> partitions, I’d love to see the term column family removed forever in 
> favor of always using tables. That said, it’s a big task. I did a 
> quick refactor in a branch, simply changing the ColumnFamilyStore 
> class to TableStore, and pushed it up to GitHub. [1]
> >
> > Didn’t click on the link? That’s ok. The TL;DR is that it’s almost 
> > 2K
> LOC changed across 275 files. I’ll note that my branch doesn’t change 
> any of the almost 1000 search results of “columnfamilystore” found in 
> the codebase and hundreds of tests failed on my branch in CircleCI, so 
> that 2K LOC change would probably be quite a bit bigger. There is, of 
> course, a lot more than just renaming this one class, there’s 
> thousands of variable names using any manner of “cf”, “cfs”, 
> “columnfamily”, names plus comments and who knows 

Re: Paying off tech debt and correctly naming things

2018-03-22 Thread Jon Haddad
Cool.  I think there’s general agreement that doing this in as small bites as 
possible is going to be the best approach.  I have no interest in mega patches. 
 

>  The combined approach takes a
> change that's already non-trivially dealing with complex subsystem
> changes and injects a bunch of trivial renaming noise across unrelated
> subsystems into the signal of an actual logic refactor.

I agree.  This is why I like the idea of proactively working to improve the 
readability of the codebase as a specific goal, rather than being wrapped into 
some other unrelated patch.  Keeping the scope in check is the challenge.  
Simple class and method renames, as several have pointed out, is easy enough 
with IDEA.  

I’ll start with class renames, as individual patches for each of them.  I’ll be 
sure to call it out on the ML.  First one will be ColumnFamilyStore -> 
TableStore.  

Jon

> On Mar 22, 2018, at 7:13 AM, Jason Brown  wrote:
> 
> Jon,
> 
> Thanks for bringing up this topic. I'll admit that I've been around this
> code base for long enough, and have enough accumulated history, that I
> probably can't fully appreciate the impact for a newcomer wrt naming.
> However, as Josh points out, this situation probably happens to "every
> non-trivially aged code-base ever".
> 
> One thing I'd like to add is that with these types of large refactoring
> changes, the review effort is non-trivial. This is because the review still
> has to ensure that correctness is preserved and it's easy to overlook a
> seemingly innocuous change.
> 
> That being said, I am supportive of this effort. However, I believe it's
> going to be best, for contributor and reviewer, to break it up into
> smaller, more digestible pieces. I'd also like to request that we not go
> whole hog and try to do everything in a compressed time frame; reviewer
> availability is already stretched thin and I'm afraid of deepening the
> review queue, especially mine :)
> 
> Thanks,
> 
> -Jason
> 
> 
> 
> 
> On Thu, Mar 22, 2018 at 6:41 AM, Josh McKenzie  wrote:
> 
>>> Some of us have big patches in flight, things that actually
>>> pay off some technical debt, and dealing with such renames is rebase
>> hell :\
>> For sure, but with a code-base this old / organically grown, I expect
>> this will always be the case. If we're talking something as simple as
>> an intellij rename refactor, while menial, couldn't someone with a
>> giant patch just do the same thing on their side and spend half an
>> hour of their life clicking next? ;)
>> 
>>> That said, there is good time for such renames - it’s during
>>> those major refactors and rewrites. When you are
>>> changing a subsystem, might as well do the appropriate renames.
>> Does that hold true for a code-base with as much static state and
>> abstraction leaking / bad factoring as we have? (i.e. every
>> non-trivially aged code-base ever) The combined approach takes a
>> change that's already non-trivially dealing with complex subsystem
>> changes and injects a bunch of trivial renaming noise across unrelated
>> subsystems into the signal of an actual logic refactor.
>> 
>> On Thu, Mar 22, 2018 at 9:31 AM, Aleksey Yeshchenko 
>> wrote:
>>> Poor and out-of-date naming of things is probably the least serious part
>> of our technical debt. Bad factoring, and straight-up
>>> poorly written components is where it’s really at.
>>> 
>>> Doing a big rename for rename sake alone does more harm than it is good,
>> sometimes. Some of us have big patches
>>> in flight, things that actually pay off some technical debt, and dealing
>> with such renames is rebase hell :\
>>> 
>>> That said, there is good time for such renames - it’s during those major
>> refactors and rewrites. When you are
>>> changing a subsystem, might as well do the appropriate renames.
>>> 
>>> —
>>> AY
>>> 
>>> On 20 March 2018 at 22:04:48, Jon Haddad (j...@jonhaddad.com) wrote:
>>> 
>>> Whenever I hop around in the codebase, one thing that always manages to
>> slow me down is needing to understand the context of the variable names
>> that I’m looking at. We’ve now removed thrift the transport, but the
>> variables, classes and comments still remain. Personally, I’d like to go in
>> and pay off as much technical debt as possible by refactoring the code to
>> be as close to CQL as possible. Rows should be rows, not partitions, I’d
>> love to see the term column family removed forever in favor of always using
>> tables. That said, it’s a big task. I did a quick refactor in a branch,
>> simply changing the ColumnFamilyStore class to TableStore, and pushed it up
>> to GitHub. [1]
>>> 
>>> Didn’t click on the link? That’s ok. The TL;DR is that it’s almost 2K
>> LOC changed across 275 files. I’ll note that my branch doesn’t change any
>> of the almost 1000 search results of “columnfamilystore” found in the
>> codebase and hundreds of tests failed on my branch in CircleCI, so that 2K
>> LOC change 

Re: Paying off tech debt and correctly naming things

2018-03-22 Thread Jason Brown
Jon,

Thanks for bringing up this topic. I'll admit that I've been around this
code base for long enough, and have enough accumulated history, that I
probably can't fully appreciate the impact for a newcomer wrt naming.
However, as Josh points out, this situation probably happens to "every
non-trivially aged code-base ever".

One thing I'd like to add is that with these types of large refactoring
changes, the review effort is non-trivial. This is because the review still
has to ensure that correctness is preserved and it's easy to overlook a
seemingly innocuous change.

That being said, I am supportive of this effort. However, I believe it's
going to be best, for contributor and reviewer, to break it up into
smaller, more digestible pieces. I'd also like to request that we not go
whole hog and try to do everything in a compressed time frame; reviewer
availability is already stretched thin and I'm afraid of deepening the
review queue, especially mine :)

Thanks,

-Jason




On Thu, Mar 22, 2018 at 6:41 AM, Josh McKenzie  wrote:

> > Some of us have big patches in flight, things that actually
> > pay off some technical debt, and dealing with such renames is rebase
> hell :\
> For sure, but with a code-base this old / organically grown, I expect
> this will always be the case. If we're talking something as simple as
> an intellij rename refactor, while menial, couldn't someone with a
> giant patch just do the same thing on their side and spend half an
> hour of their life clicking next? ;)
>
> > That said, there is good time for such renames - it’s during
> > those major refactors and rewrites. When you are
> > changing a subsystem, might as well do the appropriate renames.
> Does that hold true for a code-base with as much static state and
> abstraction leaking / bad factoring as we have? (i.e. every
> non-trivially aged code-base ever) The combined approach takes a
> change that's already non-trivially dealing with complex subsystem
> changes and injects a bunch of trivial renaming noise across unrelated
> subsystems into the signal of an actual logic refactor.
>
> On Thu, Mar 22, 2018 at 9:31 AM, Aleksey Yeshchenko 
> wrote:
> > Poor and out-of-date naming of things is probably the least serious part
> of our technical debt. Bad factoring, and straight-up
> > poorly written components is where it’s really at.
> >
> > Doing a big rename for rename sake alone does more harm than it is good,
> sometimes. Some of us have big patches
> > in flight, things that actually pay off some technical debt, and dealing
> with such renames is rebase hell :\
> >
> > That said, there is good time for such renames - it’s during those major
> refactors and rewrites. When you are
> > changing a subsystem, might as well do the appropriate renames.
> >
> > —
> > AY
> >
> > On 20 March 2018 at 22:04:48, Jon Haddad (j...@jonhaddad.com) wrote:
> >
> > Whenever I hop around in the codebase, one thing that always manages to
> slow me down is needing to understand the context of the variable names
> that I’m looking at. We’ve now removed thrift the transport, but the
> variables, classes and comments still remain. Personally, I’d like to go in
> and pay off as much technical debt as possible by refactoring the code to
> be as close to CQL as possible. Rows should be rows, not partitions, I’d
> love to see the term column family removed forever in favor of always using
> tables. That said, it’s a big task. I did a quick refactor in a branch,
> simply changing the ColumnFamilyStore class to TableStore, and pushed it up
> to GitHub. [1]
> >
> > Didn’t click on the link? That’s ok. The TL;DR is that it’s almost 2K
> LOC changed across 275 files. I’ll note that my branch doesn’t change any
> of the almost 1000 search results of “columnfamilystore” found in the
> codebase and hundreds of tests failed on my branch in CircleCI, so that 2K
> LOC change would probably be quite a bit bigger. There is, of course, a lot
> more than just renaming this one class, there’s thousands of variable names
> using any manner of “cf”, “cfs”, “columnfamily”, names plus comments and
> who knows what else. There’s lots of references in probably every file that
> would have to get updated.
> >
> > What are people’s thoughts on this? We should be honest with ourselves
> and know this isn’t going to get any easier over time. It’s only going to
> get more confusing for new people to the project, and having to figure out
> “what kind of row am i even looking at” is a waste of time. There’s
> obviously a much bigger impact than just renaming a bunch of files, there’s
> any number of patches and branches that would become outdated, plus anyone
> pulling in Cassandra as a dependency would be affected. I don’t really have
> a solution for the disruption other than “leave it in place”, but in my
> mind that’s not a great (or even good) solution.
> >
> > Anyways, enough out of me. My concern for ergonomics and naming might be
> 

Re: Paying off tech debt and correctly naming things

2018-03-22 Thread Josh McKenzie
> Some of us have big patches in flight, things that actually
> pay off some technical debt, and dealing with such renames is rebase hell :\
For sure, but with a code-base this old / organically grown, I expect
this will always be the case. If we're talking something as simple as
an intellij rename refactor, while menial, couldn't someone with a
giant patch just do the same thing on their side and spend half an
hour of their life clicking next? ;)

> That said, there is good time for such renames - it’s during
> those major refactors and rewrites. When you are
> changing a subsystem, might as well do the appropriate renames.
Does that hold true for a code-base with as much static state and
abstraction leaking / bad factoring as we have? (i.e. every
non-trivially aged code-base ever) The combined approach takes a
change that's already non-trivially dealing with complex subsystem
changes and injects a bunch of trivial renaming noise across unrelated
subsystems into the signal of an actual logic refactor.

On Thu, Mar 22, 2018 at 9:31 AM, Aleksey Yeshchenko  wrote:
> Poor and out-of-date naming of things is probably the least serious part of 
> our technical debt. Bad factoring, and straight-up
> poorly written components is where it’s really at.
>
> Doing a big rename for rename sake alone does more harm than it is good, 
> sometimes. Some of us have big patches
> in flight, things that actually pay off some technical debt, and dealing with 
> such renames is rebase hell :\
>
> That said, there is good time for such renames - it’s during those major 
> refactors and rewrites. When you are
> changing a subsystem, might as well do the appropriate renames.
>
> —
> AY
>
> On 20 March 2018 at 22:04:48, Jon Haddad (j...@jonhaddad.com) wrote:
>
> Whenever I hop around in the codebase, one thing that always manages to slow 
> me down is needing to understand the context of the variable names that I’m 
> looking at. We’ve now removed thrift the transport, but the variables, 
> classes and comments still remain. Personally, I’d like to go in and pay off 
> as much technical debt as possible by refactoring the code to be as close to 
> CQL as possible. Rows should be rows, not partitions, I’d love to see the 
> term column family removed forever in favor of always using tables. That 
> said, it’s a big task. I did a quick refactor in a branch, simply changing 
> the ColumnFamilyStore class to TableStore, and pushed it up to GitHub. [1]
>
> Didn’t click on the link? That’s ok. The TL;DR is that it’s almost 2K LOC 
> changed across 275 files. I’ll note that my branch doesn’t change any of the 
> almost 1000 search results of “columnfamilystore” found in the codebase and 
> hundreds of tests failed on my branch in CircleCI, so that 2K LOC change 
> would probably be quite a bit bigger. There is, of course, a lot more than 
> just renaming this one class, there’s thousands of variable names using any 
> manner of “cf”, “cfs”, “columnfamily”, names plus comments and who knows what 
> else. There’s lots of references in probably every file that would have to 
> get updated.
>
> What are people’s thoughts on this? We should be honest with ourselves and 
> know this isn’t going to get any easier over time. It’s only going to get 
> more confusing for new people to the project, and having to figure out “what 
> kind of row am i even looking at” is a waste of time. There’s obviously a 
> much bigger impact than just renaming a bunch of files, there’s any number of 
> patches and branches that would become outdated, plus anyone pulling in 
> Cassandra as a dependency would be affected. I don’t really have a solution 
> for the disruption other than “leave it in place”, but in my mind that’s not 
> a great (or even good) solution.
>
> Anyways, enough out of me. My concern for ergonomics and naming might be 
> significantly higher than the rest of the folks working in the code, and I 
> wanted to put a feeler out there before I decided to dig into this in a more 
> serious manner.
>
> Jon
>
> [1] 
> https://github.com/apache/cassandra/compare/trunk...rustyrazorblade:refactor_column_family_store?expand=1
>  
> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: Paying off tech debt and correctly naming things

2018-03-22 Thread Aleksey Yeshchenko
Poor and out-of-date naming of things is probably the least serious part of our 
technical debt. Bad factoring, and straight-up
poorly written components is where it’s really at.

Doing a big rename for rename sake alone does more harm than it is good, 
sometimes. Some of us have big patches
in flight, things that actually pay off some technical debt, and dealing with 
such renames is rebase hell :\

That said, there is good time for such renames - it’s during those major 
refactors and rewrites. When you are
changing a subsystem, might as well do the appropriate renames.

—
AY

On 20 March 2018 at 22:04:48, Jon Haddad (j...@jonhaddad.com) wrote:

Whenever I hop around in the codebase, one thing that always manages to slow me 
down is needing to understand the context of the variable names that I’m 
looking at. We’ve now removed thrift the transport, but the variables, classes 
and comments still remain. Personally, I’d like to go in and pay off as much 
technical debt as possible by refactoring the code to be as close to CQL as 
possible. Rows should be rows, not partitions, I’d love to see the term column 
family removed forever in favor of always using tables. That said, it’s a big 
task. I did a quick refactor in a branch, simply changing the ColumnFamilyStore 
class to TableStore, and pushed it up to GitHub. [1]  

Didn’t click on the link? That’s ok. The TL;DR is that it’s almost 2K LOC 
changed across 275 files. I’ll note that my branch doesn’t change any of the 
almost 1000 search results of “columnfamilystore” found in the codebase and 
hundreds of tests failed on my branch in CircleCI, so that 2K LOC change would 
probably be quite a bit bigger. There is, of course, a lot more than just 
renaming this one class, there’s thousands of variable names using any manner 
of “cf”, “cfs”, “columnfamily”, names plus comments and who knows what else. 
There’s lots of references in probably every file that would have to get 
updated.  

What are people’s thoughts on this? We should be honest with ourselves and know 
this isn’t going to get any easier over time. It’s only going to get more 
confusing for new people to the project, and having to figure out “what kind of 
row am i even looking at” is a waste of time. There’s obviously a much bigger 
impact than just renaming a bunch of files, there’s any number of patches and 
branches that would become outdated, plus anyone pulling in Cassandra as a 
dependency would be affected. I don’t really have a solution for the disruption 
other than “leave it in place”, but in my mind that’s not a great (or even 
good) solution.  

Anyways, enough out of me. My concern for ergonomics and naming might be 
significantly higher than the rest of the folks working in the code, and I 
wanted to put a feeler out there before I decided to dig into this in a more 
serious manner.  

Jon  

[1] 
https://github.com/apache/cassandra/compare/trunk...rustyrazorblade:refactor_column_family_store?expand=1