Re: CASSANDRA-13241 lower default chunk_length_in_kb

2018-10-29 Thread Jonathan Haddad
Looks straightforward, I can review today.

On Mon, Oct 29, 2018 at 12:25 PM Ariel Weisberg  wrote:

> Hi,
>
> Seeing too many -'s for changing the representation and essentially no +1s
> so I submitted a patch for just changing the default. I could use a
> reviewer for https://issues.apache.org/jira/browse/CASSANDRA-13241
>
> I created https://issues.apache.org/jira/browse/CASSANDRA-14857  "Use a
> more space efficient representation for compressed chunk offsets" for post
> 4.0.
>
> Regards,
> Ariel
>
> On Tue, Oct 23, 2018, at 11:46 AM, Ariel Weisberg wrote:
> > Hi,
> >
> > To summarize who we have heard from so far
> >
> > WRT to changing just the default:
> >
> > +1:
> > Jon Haddadd
> > Ben Bromhead
> > Alain Rodriguez
> > Sankalp Kohli (not explicit)
> >
> > -0:
> > Sylvaine Lebresne
> > Jeff Jirsa
> >
> > Not sure:
> > Kurt Greaves
> > Joshua Mckenzie
> > Benedict Elliot Smith
> >
> > WRT to change the representation:
> >
> > +1:
> > There are only conditional +1s at this point
> >
> > -0:
> > Sylvaine Lebresne
> >
> > -.5:
> > Jeff Jirsa
> >
> > This
> > (
> https://github.com/aweisberg/cassandra/commit/a9ae85daa3ede092b9a1cf84879fb1a9f25b9dce)
>
> > is a rough cut of the change for the representation. It needs better
> > naming, unit tests, javadoc etc. but it does implement the change.
> >
> > Ariel
> > On Fri, Oct 19, 2018, at 3:42 PM, Jonathan Haddad wrote:
> > > Sorry, to be clear - I'm +1 on changing the configuration default, but
> I
> > > think changing the compression in memory representations warrants
> further
> > > discussion and investigation before making a case for or against it
> yet.
> > > An optimization that reduces in memory cost by over 50% sounds pretty
> good
> > > and we never were really explicit that those sort of optimizations
> would be
> > > excluded after our feature freeze.  I don't think they should
> necessarily
> > > be excluded at this time, but it depends on the size and risk of the
> patch.
> > >
> > > On Sat, Oct 20, 2018 at 8:38 AM Jonathan Haddad 
> wrote:
> > >
> > > > I think we should try to do the right thing for the most people that
> we
> > > > can.  The number of folks impacted by 64KB is huge.  I've worked on
> a lot
> > > > of clusters created by a lot of different teams, going from brand
> new to
> > > > pretty damn knowledgeable.  I can't think of a single time over the
> last 2
> > > > years that I've seen a cluster use non-default settings for
> compression.
> > > > With only a handful of exceptions, I've lowered the chunk size
> considerably
> > > > (usually to 4 or 8K) and the impact has always been very noticeable,
> > > > frequently resulting in hardware reduction and cost savings.  Of all
> the
> > > > poorly chosen defaults we have, this is one of the biggest offenders
> that I
> > > > see.  There's a good reason ScyllaDB  claims they're so much faster
> than
> > > > Cassandra - we ship a DB that performs poorly for 90+% of teams
> because we
> > > > ship for a specific use case, not a general one (time series on
> memory
> > > > constrained boxes being the specific use case)
> > > >
> > > > This doesn't impact existing tables, just new ones.  More and more
> teams
> > > > are using Cassandra as a general purpose database, we should
> acknowledge
> > > > that adjusting our defaults accordingly.  Yes, we use a little bit
> more
> > > > memory on new tables if we just change this setting, and what we get
> out of
> > > > it is a massive performance win.
> > > >
> > > > I'm +1 on the change as well.
> > > >
> > > >
> > > >
> > > > On Sat, Oct 20, 2018 at 4:21 AM Sankalp Kohli <
> kohlisank...@gmail.com>
> > > > wrote:
> > > >
> > > >> (We should definitely harden the definition for freeze in a separate
> > > >> thread)
> > > >>
> > > >> My thinking is that this is the best time to do this change as we
> have
> > > >> not even cut alpha or beta. All the people involved in the test will
> > > >> definitely be testing it again when we have these releases.
> > > >>
> > > >> > On Oct 19, 2018, at 8:00 AM, Michael Shuler <
> mich...@pbandjelly.org>
> > > >> wrote:
> > > >> >
> > > >> >> On 10/19/18 9:16 AM, Joshua McKenzie wrote:
> > > >> >>
> > > >> >> At the risk of hijacking this thread, when are we going to
> transition
> > > >> from
> > > >> >> "no new features, change whatever else you want including
> refactoring
> > > >> and
> > > >> >> changing years-old defaults" to "ok, we think we have something
> that's
> > > >> >> stable, time to start testing"?
> > > >> >
> > > >> > Creating a cassandra-4.0 branch would allow trunk to, for
> instance, get
> > > >> > a default config value change commit and get more testing. We
> might
> > > >> > forget again, from what I understand of Benedict's last comment :)
> > > >> >
> > > >> > --
> > > >> > Michael
> > > >> >
> > > >> >
> -
> > > >> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > >> > For additional 

Re: CASSANDRA-13241 lower default chunk_length_in_kb

2018-10-29 Thread Ariel Weisberg
Hi,

Seeing too many -'s for changing the representation and essentially no +1s so I 
submitted a patch for just changing the default. I could use a reviewer for 
https://issues.apache.org/jira/browse/CASSANDRA-13241

I created https://issues.apache.org/jira/browse/CASSANDRA-14857  "Use a more 
space efficient representation for compressed chunk offsets" for post 4.0.

Regards,
Ariel

On Tue, Oct 23, 2018, at 11:46 AM, Ariel Weisberg wrote:
> Hi,
> 
> To summarize who we have heard from so far
> 
> WRT to changing just the default:
> 
> +1:
> Jon Haddadd
> Ben Bromhead
> Alain Rodriguez
> Sankalp Kohli (not explicit)
> 
> -0:
> Sylvaine Lebresne 
> Jeff Jirsa
> 
> Not sure:
> Kurt Greaves
> Joshua Mckenzie
> Benedict Elliot Smith
> 
> WRT to change the representation:
> 
> +1:
> There are only conditional +1s at this point
> 
> -0:
> Sylvaine Lebresne
> 
> -.5:
> Jeff Jirsa
> 
> This 
> (https://github.com/aweisberg/cassandra/commit/a9ae85daa3ede092b9a1cf84879fb1a9f25b9dce)
>  
> is a rough cut of the change for the representation. It needs better 
> naming, unit tests, javadoc etc. but it does implement the change.
> 
> Ariel
> On Fri, Oct 19, 2018, at 3:42 PM, Jonathan Haddad wrote:
> > Sorry, to be clear - I'm +1 on changing the configuration default, but I
> > think changing the compression in memory representations warrants further
> > discussion and investigation before making a case for or against it yet.
> > An optimization that reduces in memory cost by over 50% sounds pretty good
> > and we never were really explicit that those sort of optimizations would be
> > excluded after our feature freeze.  I don't think they should necessarily
> > be excluded at this time, but it depends on the size and risk of the patch.
> > 
> > On Sat, Oct 20, 2018 at 8:38 AM Jonathan Haddad  wrote:
> > 
> > > I think we should try to do the right thing for the most people that we
> > > can.  The number of folks impacted by 64KB is huge.  I've worked on a lot
> > > of clusters created by a lot of different teams, going from brand new to
> > > pretty damn knowledgeable.  I can't think of a single time over the last 2
> > > years that I've seen a cluster use non-default settings for compression.
> > > With only a handful of exceptions, I've lowered the chunk size 
> > > considerably
> > > (usually to 4 or 8K) and the impact has always been very noticeable,
> > > frequently resulting in hardware reduction and cost savings.  Of all the
> > > poorly chosen defaults we have, this is one of the biggest offenders that 
> > > I
> > > see.  There's a good reason ScyllaDB  claims they're so much faster than
> > > Cassandra - we ship a DB that performs poorly for 90+% of teams because we
> > > ship for a specific use case, not a general one (time series on memory
> > > constrained boxes being the specific use case)
> > >
> > > This doesn't impact existing tables, just new ones.  More and more teams
> > > are using Cassandra as a general purpose database, we should acknowledge
> > > that adjusting our defaults accordingly.  Yes, we use a little bit more
> > > memory on new tables if we just change this setting, and what we get out 
> > > of
> > > it is a massive performance win.
> > >
> > > I'm +1 on the change as well.
> > >
> > >
> > >
> > > On Sat, Oct 20, 2018 at 4:21 AM Sankalp Kohli 
> > > wrote:
> > >
> > >> (We should definitely harden the definition for freeze in a separate
> > >> thread)
> > >>
> > >> My thinking is that this is the best time to do this change as we have
> > >> not even cut alpha or beta. All the people involved in the test will
> > >> definitely be testing it again when we have these releases.
> > >>
> > >> > On Oct 19, 2018, at 8:00 AM, Michael Shuler 
> > >> wrote:
> > >> >
> > >> >> On 10/19/18 9:16 AM, Joshua McKenzie wrote:
> > >> >>
> > >> >> At the risk of hijacking this thread, when are we going to transition
> > >> from
> > >> >> "no new features, change whatever else you want including refactoring
> > >> and
> > >> >> changing years-old defaults" to "ok, we think we have something that's
> > >> >> stable, time to start testing"?
> > >> >
> > >> > Creating a cassandra-4.0 branch would allow trunk to, for instance, get
> > >> > a default config value change commit and get more testing. We might
> > >> > forget again, from what I understand of Benedict's last comment :)
> > >> >
> > >> > --
> > >> > Michael
> > >> >
> > >> > -
> > >> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > >> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >> >
> > >>
> > >> -
> > >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >>
> > >>
> > >
> > > --
> > > Jon Haddad
> > > http://www.rustyrazorblade.com
> > > twitter: rustyrazorblade
> > >
> > 
> > 
> > -- 
> 

Re: Deprecating/removing PropertyFileSnitch?

2018-10-29 Thread Alexander Dejanovski
Hi,

I fully agree that PFS is way too dangerous and makes little (if any) sense
compared to GPFS.
We've had numerous customers that ended up with potential data loss and
fairly complex procedures to recover from several nodes jumping into the
default DC.
Misconfigurations also led to sudden changes of topology which changed
token ownership and require a lot of knowledge to recover from (and even
then, with a reasonable level of uncertainty).

+1 on removing PFS.

Cheers,



On Mon, Oct 29, 2018 at 6:20 PM Jeremy Hanna 
wrote:

>
>
> > On Oct 29, 2018, at 11:20 AM, Jeff Jirsa  wrote:
> >
> > On Mon, Oct 29, 2018 at 8:35 AM Jeremy Hanna  >
> > wrote:
> >
> >> Re-reading this thread, it sounds like the issue is there are times
> when a
> >> field may go missing in gossip and it hasn’t yet been tracked down.  As
> >> Jeremiah says, can we get that into a Jira issue with any contextual
> >> information (if there is any)?  However as he says, in theory fields
> going
> >> missing from gossip shouldn’t cause problems for users of GPFS and I
> don’t
> >> believe there have been issues raised in that regard for all of the
> >> clusters out there (including Jeff’s comment about it in this thread).
> >> Testing that more thoroughly could also be a dependent ticket of
> >> deprecating/removing PFS.
> >>
> >>
> > The problem with opening a JIRA now is that it'll look just like 13700
> and
> > the others before it - it'll read something like "status goes missing in
> > large clusters" and the very next time we find a gossip bug, we'll mark
> it
> > as fixed, and it may or may not be the only cause of that bug.
>
> I’ve created a Jira that CASSANDRA-10745 requires for completion to
> thoroughly test the GPFS under such conditions.  See CASSANDRA-14856 <
> https://issues.apache.org/jira/browse/CASSANDRA-14856>
> >
> >
> >> Separately, both Jeff and Sankalp were saying that the fallback was a
> >> problem and there was a flurry of tickets back in 2016 that led to the
> >> original ticket to deprecate the property file snitch.  However,
> >> https://issues.apache.org/jira/browse/CASSANDRA-10745 <
> >> https://issues.apache.org/jira/browse/CASSANDRA-10745> discusses what
> to
> >> do when deprecating it.  Would people want the functionality between
> GPFS
> >> completely separate from PFS or would people want a mode to emulate it
> >> while using the code for GPFS underneath?
> >>
> >
> > Actually, Jeff was guessing that the class of problems that would make
> you
> > want to deprecate PFS is fallback from GPFS to PFS (because beyond that
> PFS
> > is just stupid easy to use and I can't imagine it's causing a lot of
> > problems for people who know they're using PFS - yes, if you don't update
> > the file, things break, but that's precisely the guarantee of the
> snitch).
>
> My apologies if I had misrepresented, but I’m glad I checked.
>
> What I was originally saying is that PFS has these sharp edges to it - if
> you don’t sync the files for whatever reason, there are problems.  I saw a
> case recently where a team upgraded their machines in one DC and their
> addresses were new in that DC.  They updated the properties file in the DC
> where they upgraded machines but neglected to update the addresses in the
> other DC.  In that case, the nodes in the other DC saw nodes that didn’t
> have any configuration for them and assigned the default configuration as
> per the file option, which was incorrect.  That caused some difficult to
> workaround problems.  All of this could have been avoided had they been
> using the GPFS instead.
>
> So in order to not invite problems such as this for those new to the
> project or and just because there are going to be times when there will be
> configuration mismatches resulting in this sort of behavior (even with
> https://issues.apache.org/jira/browse/CASSANDRA-12681 <
> https://issues.apache.org/jira/browse/CASSANDRA-12681>), I was hoping to
> get consensus on deprecating/removing PFS.
>
> >
> >
> >>
> >>
> >>> On Oct 22, 2018, at 10:33 PM, Jeremiah D Jordan <
> >> jeremiah.jor...@gmail.com> wrote:
> >>>
> >>> If you guys are still seeing the problem, would be good to have a JIRA
> >> written up, as all the ones linked were fixed in 2017 and 2015.
> >> CASSANDRA-13700 was found during our testing, and we haven’t seen any
> other
> >> issues since fixing it.
> >>>
> >>> -Jeremiah
> >>>
>  On Oct 22, 2018, at 10:12 PM, Sankalp Kohli 
> >> wrote:
> 
>  No worries...I mentioned the issue not the JIRA number
> 
> > On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan <
> jerem...@datastax.com>
> >> wrote:
> >
> > Sorry, maybe my spam filter got them or something, but I have never
> >> seen a JIRA number mentioned in the thread before this one.  Just looked
> >> back through again to make sure, and this is the first email I have with
> >> one.
> >
> > -Jeremiah
> >
> >> On Oct 22, 2018, at 9:37 PM, sankalp kohli 
> >> wrote:
> >>
> >> Here are 

Re: Deprecating/removing PropertyFileSnitch?

2018-10-29 Thread Jeremy Hanna


> On Oct 29, 2018, at 11:20 AM, Jeff Jirsa  wrote:
> 
> On Mon, Oct 29, 2018 at 8:35 AM Jeremy Hanna 
> wrote:
> 
>> Re-reading this thread, it sounds like the issue is there are times when a
>> field may go missing in gossip and it hasn’t yet been tracked down.  As
>> Jeremiah says, can we get that into a Jira issue with any contextual
>> information (if there is any)?  However as he says, in theory fields going
>> missing from gossip shouldn’t cause problems for users of GPFS and I don’t
>> believe there have been issues raised in that regard for all of the
>> clusters out there (including Jeff’s comment about it in this thread).
>> Testing that more thoroughly could also be a dependent ticket of
>> deprecating/removing PFS.
>> 
>> 
> The problem with opening a JIRA now is that it'll look just like 13700 and
> the others before it - it'll read something like "status goes missing in
> large clusters" and the very next time we find a gossip bug, we'll mark it
> as fixed, and it may or may not be the only cause of that bug.

I’ve created a Jira that CASSANDRA-10745 requires for completion to thoroughly 
test the GPFS under such conditions.  See CASSANDRA-14856 

> 
> 
>> Separately, both Jeff and Sankalp were saying that the fallback was a
>> problem and there was a flurry of tickets back in 2016 that led to the
>> original ticket to deprecate the property file snitch.  However,
>> https://issues.apache.org/jira/browse/CASSANDRA-10745 <
>> https://issues.apache.org/jira/browse/CASSANDRA-10745> discusses what to
>> do when deprecating it.  Would people want the functionality between GPFS
>> completely separate from PFS or would people want a mode to emulate it
>> while using the code for GPFS underneath?
>> 
> 
> Actually, Jeff was guessing that the class of problems that would make you
> want to deprecate PFS is fallback from GPFS to PFS (because beyond that PFS
> is just stupid easy to use and I can't imagine it's causing a lot of
> problems for people who know they're using PFS - yes, if you don't update
> the file, things break, but that's precisely the guarantee of the snitch).

My apologies if I had misrepresented, but I’m glad I checked.

What I was originally saying is that PFS has these sharp edges to it - if you 
don’t sync the files for whatever reason, there are problems.  I saw a case 
recently where a team upgraded their machines in one DC and their addresses 
were new in that DC.  They updated the properties file in the DC where they 
upgraded machines but neglected to update the addresses in the other DC.  In 
that case, the nodes in the other DC saw nodes that didn’t have any 
configuration for them and assigned the default configuration as per the file 
option, which was incorrect.  That caused some difficult to workaround 
problems.  All of this could have been avoided had they been using the GPFS 
instead.

So in order to not invite problems such as this for those new to the project or 
and just because there are going to be times when there will be configuration 
mismatches resulting in this sort of behavior (even with 
https://issues.apache.org/jira/browse/CASSANDRA-12681 
), I was hoping to get 
consensus on deprecating/removing PFS.

> 
> 
>> 
>> 
>>> On Oct 22, 2018, at 10:33 PM, Jeremiah D Jordan <
>> jeremiah.jor...@gmail.com> wrote:
>>> 
>>> If you guys are still seeing the problem, would be good to have a JIRA
>> written up, as all the ones linked were fixed in 2017 and 2015.
>> CASSANDRA-13700 was found during our testing, and we haven’t seen any other
>> issues since fixing it.
>>> 
>>> -Jeremiah
>>> 
 On Oct 22, 2018, at 10:12 PM, Sankalp Kohli 
>> wrote:
 
 No worries...I mentioned the issue not the JIRA number
 
> On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan 
>> wrote:
> 
> Sorry, maybe my spam filter got them or something, but I have never
>> seen a JIRA number mentioned in the thread before this one.  Just looked
>> back through again to make sure, and this is the first email I have with
>> one.
> 
> -Jeremiah
> 
>> On Oct 22, 2018, at 9:37 PM, sankalp kohli 
>> wrote:
>> 
>> Here are some of the JIRAs which are fixed but actually did not fix
>> the
>> issue. We have tried fixing this by several patches. May be it will be
>> fixed when Gossip is rewritten(CASSANDRA-12345). I should find or
>> create a
>> new JIRA as this issue still exists.
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366=DwIFaQ=adz96Xi0w1RHqtPMowiL2g=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw=
>> 
>> 

Re: Deprecating/removing PropertyFileSnitch?

2018-10-29 Thread J. D. Jordan
The place people get in trouble with PFS is that the example file has a 
“default” setting in it, which people fill out because it is there. Later down 
the road they typo/mess up updating the file when they add nodes in a different 
DC than the default, and oops, stuff is messed up.  That and GPFS fallback.

So can we all agree to rename the PFS example file so that someone has to 
copy/rename it to make it valid (to fix GPFS fallback issues) and remove the 
example from the file of having a “default” rack/dc set?  If we did those two 
things I think it would go a long way towards fixing PFS issues.

-Jeremiah

> On Oct 29, 2018, at 11:20 AM, Jeff Jirsa  wrote:
> 
> On Mon, Oct 29, 2018 at 8:35 AM Jeremy Hanna 
> wrote:
> 
>> Re-reading this thread, it sounds like the issue is there are times when a
>> field may go missing in gossip and it hasn’t yet been tracked down.  As
>> Jeremiah says, can we get that into a Jira issue with any contextual
>> information (if there is any)?  However as he says, in theory fields going
>> missing from gossip shouldn’t cause problems for users of GPFS and I don’t
>> believe there have been issues raised in that regard for all of the
>> clusters out there (including Jeff’s comment about it in this thread).
>> Testing that more thoroughly could also be a dependent ticket of
>> deprecating/removing PFS.
>> 
>> 
> The problem with opening a JIRA now is that it'll look just like 13700 and
> the others before it - it'll read something like "status goes missing in
> large clusters" and the very next time we find a gossip bug, we'll mark it
> as fixed, and it may or may not be the only cause of that bug.
> 
> 
> 
>> Separately, both Jeff and Sankalp were saying that the fallback was a
>> problem and there was a flurry of tickets back in 2016 that led to the
>> original ticket to deprecate the property file snitch.  However,
>> https://issues.apache.org/jira/browse/CASSANDRA-10745 <
>> https://issues.apache.org/jira/browse/CASSANDRA-10745> discusses what to
>> do when deprecating it.  Would people want the functionality between GPFS
>> completely separate from PFS or would people want a mode to emulate it
>> while using the code for GPFS underneath?
>> 
> 
> Actually, Jeff was guessing that the class of problems that would make you
> want to deprecate PFS is fallback from GPFS to PFS (because beyond that PFS
> is just stupid easy to use and I can't imagine it's causing a lot of
> problems for people who know they're using PFS - yes, if you don't update
> the file, things break, but that's precisely the guarantee of the snitch).
> 
> 
>> 
>> 
>>> On Oct 22, 2018, at 10:33 PM, Jeremiah D Jordan <
>> jeremiah.jor...@gmail.com> wrote:
>>> 
>>> If you guys are still seeing the problem, would be good to have a JIRA
>> written up, as all the ones linked were fixed in 2017 and 2015.
>> CASSANDRA-13700 was found during our testing, and we haven’t seen any other
>> issues since fixing it.
>>> 
>>> -Jeremiah
>>> 
 On Oct 22, 2018, at 10:12 PM, Sankalp Kohli 
>> wrote:
 
 No worries...I mentioned the issue not the JIRA number
 
> On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan 
>> wrote:
> 
> Sorry, maybe my spam filter got them or something, but I have never
>> seen a JIRA number mentioned in the thread before this one.  Just looked
>> back through again to make sure, and this is the first email I have with
>> one.
> 
> -Jeremiah
> 
>> On Oct 22, 2018, at 9:37 PM, sankalp kohli 
>> wrote:
>> 
>> Here are some of the JIRAs which are fixed but actually did not fix
>> the
>> issue. We have tried fixing this by several patches. May be it will be
>> fixed when Gossip is rewritten(CASSANDRA-12345). I should find or
>> create a
>> new JIRA as this issue still exists.
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366=DwIFaQ=adz96Xi0w1RHqtPMowiL2g=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw=
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089=DwIFaQ=adz96Xi0w1RHqtPMowiL2g=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE=
>> (related to it)
>> 
>> Also the quote you are using was written as a follow on email. I have
>> already said what the bug I was referring to.
>> 
>> "Say you restarted all instances in the cluster and status for some
>> host
>> goes missing. Now when you start a host replacement, the new host
>> won’t
>> learn about the host whose status is missing and the view of this
>> host will
>> be wrong."
>> 
>> - CASSANDRA-10366
>> 
>> 
>> On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli >> 
>> wrote:
>> 
>>> I will send the JIRAs of the bug which we thought we have fixed but
>> 

Re: Deprecating/removing PropertyFileSnitch?

2018-10-29 Thread Jeff Jirsa
On Mon, Oct 29, 2018 at 8:35 AM Jeremy Hanna 
wrote:

> Re-reading this thread, it sounds like the issue is there are times when a
> field may go missing in gossip and it hasn’t yet been tracked down.  As
> Jeremiah says, can we get that into a Jira issue with any contextual
> information (if there is any)?  However as he says, in theory fields going
> missing from gossip shouldn’t cause problems for users of GPFS and I don’t
> believe there have been issues raised in that regard for all of the
> clusters out there (including Jeff’s comment about it in this thread).
> Testing that more thoroughly could also be a dependent ticket of
> deprecating/removing PFS.
>
>
The problem with opening a JIRA now is that it'll look just like 13700 and
the others before it - it'll read something like "status goes missing in
large clusters" and the very next time we find a gossip bug, we'll mark it
as fixed, and it may or may not be the only cause of that bug.



> Separately, both Jeff and Sankalp were saying that the fallback was a
> problem and there was a flurry of tickets back in 2016 that led to the
> original ticket to deprecate the property file snitch.  However,
> https://issues.apache.org/jira/browse/CASSANDRA-10745 <
> https://issues.apache.org/jira/browse/CASSANDRA-10745> discusses what to
> do when deprecating it.  Would people want the functionality between GPFS
> completely separate from PFS or would people want a mode to emulate it
> while using the code for GPFS underneath?
>

Actually, Jeff was guessing that the class of problems that would make you
want to deprecate PFS is fallback from GPFS to PFS (because beyond that PFS
is just stupid easy to use and I can't imagine it's causing a lot of
problems for people who know they're using PFS - yes, if you don't update
the file, things break, but that's precisely the guarantee of the snitch).


>
>
> > On Oct 22, 2018, at 10:33 PM, Jeremiah D Jordan <
> jeremiah.jor...@gmail.com> wrote:
> >
> > If you guys are still seeing the problem, would be good to have a JIRA
> written up, as all the ones linked were fixed in 2017 and 2015.
> CASSANDRA-13700 was found during our testing, and we haven’t seen any other
> issues since fixing it.
> >
> > -Jeremiah
> >
> >> On Oct 22, 2018, at 10:12 PM, Sankalp Kohli 
> wrote:
> >>
> >> No worries...I mentioned the issue not the JIRA number
> >>
> >>> On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan 
> wrote:
> >>>
> >>> Sorry, maybe my spam filter got them or something, but I have never
> seen a JIRA number mentioned in the thread before this one.  Just looked
> back through again to make sure, and this is the first email I have with
> one.
> >>>
> >>> -Jeremiah
> >>>
>  On Oct 22, 2018, at 9:37 PM, sankalp kohli 
> wrote:
> 
>  Here are some of the JIRAs which are fixed but actually did not fix
> the
>  issue. We have tried fixing this by several patches. May be it will be
>  fixed when Gossip is rewritten(CASSANDRA-12345). I should find or
> create a
>  new JIRA as this issue still exists.
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366=DwIFaQ=adz96Xi0w1RHqtPMowiL2g=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw=
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089=DwIFaQ=adz96Xi0w1RHqtPMowiL2g=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE=
> (related to it)
> 
>  Also the quote you are using was written as a follow on email. I have
>  already said what the bug I was referring to.
> 
>  "Say you restarted all instances in the cluster and status for some
> host
>  goes missing. Now when you start a host replacement, the new host
> won’t
>  learn about the host whose status is missing and the view of this
> host will
>  be wrong."
> 
>  - CASSANDRA-10366
> 
> 
>  On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli  >
>  wrote:
> 
> > I will send the JIRAs of the bug which we thought we have fixed but
> it
> > still exists.
> >
> > Have you done any correctness testing after doing all these
> tests...have
> > you done the tests for 1000 instance clusters?
> >
> > It is great you have done these tests and I am hoping the gossiping
> snitch
> > is good. Also was there any Gossip bug fixed post 3.0? May be I am
> seeing
> > the bug which is fixed.
> >
> >> On Oct 22, 2018, at 7:09 PM, J. D. Jordan <
> jeremiah.jor...@gmail.com>
> > wrote:
> >>
> >> Do you have a specific gossip bug that you have seen recently which
> > caused a problem that would make this happen?  Do you have a
> specific JIRA
> > in mind?  “We can’t remove this because what if there is a bug”
> doesn’t
> > seem like a good enough 

Re: Deprecating/removing PropertyFileSnitch?

2018-10-29 Thread Jeremy Hanna
Re-reading this thread, it sounds like the issue is there are times when a 
field may go missing in gossip and it hasn’t yet been tracked down.  As 
Jeremiah says, can we get that into a Jira issue with any contextual 
information (if there is any)?  However as he says, in theory fields going 
missing from gossip shouldn’t cause problems for users of GPFS and I don’t 
believe there have been issues raised in that regard for all of the clusters 
out there (including Jeff’s comment about it in this thread).  Testing that 
more thoroughly could also be a dependent ticket of deprecating/removing PFS.

Separately, both Jeff and Sankalp were saying that the fallback was a problem 
and there was a flurry of tickets back in 2016 that led to the original ticket 
to deprecate the property file snitch.  However, 
https://issues.apache.org/jira/browse/CASSANDRA-10745 
 discusses what to do 
when deprecating it.  Would people want the functionality between GPFS 
completely separate from PFS or would people want a mode to emulate it while 
using the code for GPFS underneath?


> On Oct 22, 2018, at 10:33 PM, Jeremiah D Jordan  
> wrote:
> 
> If you guys are still seeing the problem, would be good to have a JIRA 
> written up, as all the ones linked were fixed in 2017 and 2015.  
> CASSANDRA-13700 was found during our testing, and we haven’t seen any other 
> issues since fixing it.
> 
> -Jeremiah
> 
>> On Oct 22, 2018, at 10:12 PM, Sankalp Kohli  wrote:
>> 
>> No worries...I mentioned the issue not the JIRA number 
>> 
>>> On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan  
>>> wrote:
>>> 
>>> Sorry, maybe my spam filter got them or something, but I have never seen a 
>>> JIRA number mentioned in the thread before this one.  Just looked back 
>>> through again to make sure, and this is the first email I have with one.
>>> 
>>> -Jeremiah
>>> 
 On Oct 22, 2018, at 9:37 PM, sankalp kohli  wrote:
 
 Here are some of the JIRAs which are fixed but actually did not fix the
 issue. We have tried fixing this by several patches. May be it will be
 fixed when Gossip is rewritten(CASSANDRA-12345). I should find or create a
 new JIRA as this issue still exists.
 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366=DwIFaQ=adz96Xi0w1RHqtPMowiL2g=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw=
 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089=DwIFaQ=adz96Xi0w1RHqtPMowiL2g=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE=
  (related to it)
 
 Also the quote you are using was written as a follow on email. I have
 already said what the bug I was referring to.
 
 "Say you restarted all instances in the cluster and status for some host
 goes missing. Now when you start a host replacement, the new host won’t
 learn about the host whose status is missing and the view of this host will
 be wrong."
 
 - CASSANDRA-10366
 
 
 On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli 
 wrote:
 
> I will send the JIRAs of the bug which we thought we have fixed but it
> still exists.
> 
> Have you done any correctness testing after doing all these tests...have
> you done the tests for 1000 instance clusters?
> 
> It is great you have done these tests and I am hoping the gossiping snitch
> is good. Also was there any Gossip bug fixed post 3.0? May be I am seeing
> the bug which is fixed.
> 
>> On Oct 22, 2018, at 7:09 PM, J. D. Jordan 
> wrote:
>> 
>> Do you have a specific gossip bug that you have seen recently which
> caused a problem that would make this happen?  Do you have a specific JIRA
> in mind?  “We can’t remove this because what if there is a bug” doesn’t
> seem like a good enough reason to me. If that was a reason we would never
> make any changes to anything.
>> I think many people have seen PFS actually cause real problems, where
> with GPFS the issue being talked about is predicated on some theoretical
> gossip bug happening.
>> In the past year at DataStax we have done a lot of testing on 3.0 and
> 3.11 around adding nodes, adding DC’s, replacing nodes, replacing racks,
> and replacing DC’s, all while using GPFS, and as far as I know we have not
> seen any “lost” rack/DC information during such testing.
>> 
>> -Jeremiah
>> 
>>> On Oct 22, 2018, at 5:46 PM, sankalp kohli 
> wrote:
>>> 
>>> We will have similar issues with Gossip but this will create more
> issues as
>>> more things will be relied on Gossip.
>>> 
>>> I agree PFS should be removed but I dont see how it can be with issues
> like