Re: Failed Dtest will block cutting releases

2016-12-02 Thread Benjamin Roth
Excuse me if I jump into an old thread, but from my experience, I have a
very clear opinion about situations like that as I encountered them before:

Tests are there to give *certainty*.
*Would you like to pass a crossing with a green light if you cannot be sure
if green really means green?*
Do you want to rely on tests that are green, red, green, red? What if a red
is a real red and you missed it because you simply ignore it because it's
flaky?

IMHO there are only 3 options how to deal with broken/red tests:
- Fix the underlying issue
- Fix the test
- Delete the test

If I cannot trust a test, it is better not to have it at all. Otherwise
people are staring at red lights and start to drive.

This causes:
- Uncertainty
- Loss of trust
- Confusion
- More work
- *Less quality*

Just as an example:
Few days ago I created a patch. Then I ran the utest and 1 test failed.
Hmmm, did I break it? I had to check it twice by checking out the former
state, running the tests again just to recognize that it wasn't me who made
it fail. That's annoying.

Sorry again, I'm rather new here but what I just read reminded me much of
situations I have been in years ago.
So: +1, John

2016-12-03 7:48 GMT+01:00 sankalp kohli :

> Hi,
> I dont see any any update on this thread. We will go ahead and make
> Dtest a blocker for cutting releasing for anything after 3.10.
>
> Please respond if anyone has an objection to this.
>
> Thanks,
> Sankalp
>
>
>
> On Mon, Nov 21, 2016 at 11:57 AM, Josh McKenzie 
> wrote:
>
> > Caveat: I'm strongly in favor of us blocking a release on a non-green
> test
> > board of either utest or dtest.
> >
> >
> > > put something in prod which is known to be broken in obvious ways
> >
> > In my experience the majority of fixes are actually shoring up
> low-quality
> > / flaky tests or fixing tests that have been invalidated by a commit but
> do
> > not indicate an underlying bug. Inferring "tests are failing so we know
> > we're asking people to put things in prod that are broken in obvious
> ways"
> > is hyperbolic. A more correct statement would be: "Tests are failing so
> we
> > know we're shipping with a test that's failing" which is not helpful.
> >
> > Our signal to noise ratio with tests has been very poor historically;
> we've
> > been trying to address that through aggressive triage and assigning out
> > test failures however we need far more active and widespread community
> > involvement if we want to truly *fix* this problem long-term.
> >
> > On Mon, Nov 21, 2016 at 2:33 PM, Jonathan Haddad 
> > wrote:
> >
> > > +1.  Kind of silly to put advise people to put something in prod which
> is
> > > known to be broken in obvious ways
> > >
> > > On Mon, Nov 21, 2016 at 11:31 AM sankalp kohli  >
> > > wrote:
> > >
> > > > Hi,
> > > > We should not cut a releases if Dtest are not passing. I won't
> > block
> > > > 3.10 on this since we are just discussing this.
> > > >
> > > > Please provide feedback on this.
> > > >
> > > > Thanks,
> > > > Sankalp
> > > >
> > >
> >
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer


Re: Failed Dtest will block cutting releases

2016-12-02 Thread sankalp kohli
Hi,
I dont see any any update on this thread. We will go ahead and make
Dtest a blocker for cutting releasing for anything after 3.10.

Please respond if anyone has an objection to this.

Thanks,
Sankalp



On Mon, Nov 21, 2016 at 11:57 AM, Josh McKenzie 
wrote:

> Caveat: I'm strongly in favor of us blocking a release on a non-green test
> board of either utest or dtest.
>
>
> > put something in prod which is known to be broken in obvious ways
>
> In my experience the majority of fixes are actually shoring up low-quality
> / flaky tests or fixing tests that have been invalidated by a commit but do
> not indicate an underlying bug. Inferring "tests are failing so we know
> we're asking people to put things in prod that are broken in obvious ways"
> is hyperbolic. A more correct statement would be: "Tests are failing so we
> know we're shipping with a test that's failing" which is not helpful.
>
> Our signal to noise ratio with tests has been very poor historically; we've
> been trying to address that through aggressive triage and assigning out
> test failures however we need far more active and widespread community
> involvement if we want to truly *fix* this problem long-term.
>
> On Mon, Nov 21, 2016 at 2:33 PM, Jonathan Haddad 
> wrote:
>
> > +1.  Kind of silly to put advise people to put something in prod which is
> > known to be broken in obvious ways
> >
> > On Mon, Nov 21, 2016 at 11:31 AM sankalp kohli 
> > wrote:
> >
> > > Hi,
> > > We should not cut a releases if Dtest are not passing. I won't
> block
> > > 3.10 on this since we are just discussing this.
> > >
> > > Please provide feedback on this.
> > >
> > > Thanks,
> > > Sankalp
> > >
> >
>


Re: Where do I find EverywhereStrategy?

2016-12-02 Thread sankalp kohli
The point Ben is saying is that for auth keyspace, default o​f RF=1 is not
good for any type of cluster whether it is small or large.


Re: STCS in L0 behaviour

2016-12-02 Thread Marcus Olsson

Hi,

In reply to Dikang Gu:
For the run where we incorporated the change from CASSANDRA-11571 the 
stack trace was like this (from JMC):

*Stack Trace*   *Sample Count*  *Percentage(%)*
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(int) 
	229 	11.983
-org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates() 
	228 	11.931
--org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(int) 
	221 	11.565
---org.apache.cassandra.db.compaction.LeveledManifest.overlappingWithBounds(SSTableReader, 
Map) 	201 	10.518
org.apache.cassandra.db.compaction.LeveledManifest.overlappingWithBounds(Token, 
Token, Map) 	201 	10.518

-org.apache.cassandra.dht.Bounds.intersects(Bounds) 141 7.378
-java.util.HashSet.add(Object)  56  2.93


This is for one of the compaction executors during an interval of 1 
minute and 24 seconds, but we saw similar behavior for other compaction 
threads as well. The full flight recording was 10 minutes and was 
started at the same time as the repair. The interval was taken from the 
end of the recording where the number of sstables had increased. During 
this interval this compaction thread used ~10% of the total CPU.


I agree that optimally there shouldn't be many sstables in L0 and except 
for when repair is running we don't have that many.


---

In reply to Jeff Jirsa/Nate McCall:
I might have been unclear about the compaction order in my first email, 
I meant to say that there is a check for STCS right before L1+, but only 
if a L1+ compaction is possible. We used version 2.2.7 for the test run 
so https://issues.apache.org/jira/browse/CASSANDRA-10979 should be 
included and have reduced some of the backlog of L0.


Correct me if I'm wrong but my interpretation of the scenario that 
Sylvain describes in 
https://issues.apache.org/jira/browse/CASSANDRA-5371 is when you either 
almost constantly have 32+ SSTables in L0 or are close to it. My guess 
is that this could be applied to having constant load during a certain 
timespan as well. So when you get more than 32 sstables you start to do 
STCS which in turn creates larger sstables which might span the whole of 
L1. Then when these sstables should be promoted to L1 it re-writes the 
whole L1 which creates a larger backlog in L0. So then the number of 
sstables keeps rising and trigger a STCS again, and complete the circle. 
Based on this interpretation it seems to me that if the write pattern 
into L0 is "random" this might happen regardless if a STCS compaction 
has occurred or not.


If my interpretation is correct it might be better to choose a higher 
number of sstables before STCS starts in L0 and make it configurable. 
With a reduced complexity it could be something like this:

1. Perform STCS in L0 if we have above X(1000?) sstables in L0.
2. Check L1+
3. Check for L0->L1

It should be possible to keep the current logic as well and only add a 
configurable check before (step 1) to avoid the overlapping check with 
larger backlogs. Another alternative might be 
https://issues.apache.org/jira/browse/CASSANDRA-7409 and allow 
overlapping sstables in more levels than L0. If it can quickly push 
sorted data to L1 it might remove the need for STCS in LCS. The 
previously mentioned potential cost of the overlapping check would still 
be there if we have a large backlog, but the approach might reduce the 
risk of getting into the situation. I'll try to get some time to run a 
test with CASSANDRA-7409 in our test cluster.


BR
Marcus O

On 11/28/2016 06:48 PM, Eric Evans wrote:

On Sat, Nov 26, 2016 at 6:30 PM, Dikang Gu  wrote:

Hi Marcus,

Do you have some stack trace to show that which function in the `
getNextBackgroundTask` is most expensive?

Yeah, I think having 15-20K sstables in L0 is very bad, in our heavy-write
cluster, I try my best to reduce the impact of repair, and keep number of
sstables in L0 < 100.

Thanks
Dikang.

On Thu, Nov 24, 2016 at 12:53 PM, Nate McCall  wrote:


The reason is described here:

https://issues.apache.org/jira/browse/CASSANDRA-5371?
focusedCommentId=13621679=com.atlassian.jira.
plugin.system.issuetabpanels:comment-tabpanel#comment-13621679

/Marcus

"...a lot of the work you've done you will redo when you compact your now
bigger L0 sstable against L1."

^ Sylvain's hypothesis (next comment down) is actually something we see
occasionally in practice: having to re-write the contents of L1 too often
when large L0 SSTables are pulled in. Here is an example we took on a
system with pending compaction spikes that was seeing this specific issue
with four LCS-based tables:

https://gist.github.com/zznate/d22812551fa7a527d4c0d931f107c950

The significant part of this particular workload is a burst of heavy writes
from long-duration scheduled jobs.



--
Dikang






CASSANDRA-12888: Streaming and MVs

2016-12-02 Thread Benjamin Roth
As I haven't received a single reply on that, I went over to implement and
test it on my own with our production cluster. I had a real pain with
bringing up a new node, so I had to move on.

Result:
Works like a charm. I ran many dtests that relate in any way with storage,
stream, bootstrap, ... with good results.
The bootstrap finished in under 5:30h, not a single error log during
bootstrap. Also afterwards, repairs run smooth, cluster seems to operate
quite well.

I still need:

   - Reviews (see 12888, 12905, 12984)
   - Some opinion if I did the CDC case right. IMHO CDC is not required on
   bootstrap and we don't need to send the mutations through the write path
   just to write the commit log. This will also break incremental repairs.
   Instead for CDC the sstables are streamed like normal but mutations are
   written to commitlog additionally. The worst I see is that the node crashes
   and the commitlogs for that repair streams are replayed leading to
   duplicate writes, which is not really crucial and not a regular case. Any
   better ideas?
   - Docs have to be updated (12985) if patch is accepted

I really appreciate ANY feedback. IMHO the impact of that fixes is immense
and maybe will be a huge step to get MVs production ready.

Thank you very much,
Benjamin


-- Forwarded message --
From: Benjamin Roth 
Date: 2016-11-29 17:04 GMT+01:00
Subject: Streaming and MVs
To: dev@cassandra.apache.org


I don't know where else to discuss this issue, so I post it here.

I am trying to get CS to run stable with MVs since the beginning of july.
Normal reads + write do work as expected but when it comes to repairs or
bootstrapping it still feels far far away from what I would call fast and
stable. The other day I just wanted to bootstrap a new node. I tried it 2
times.
First time the bootstrap failed due to WTEs. I fixed this issue by not
timing out in streams but then it turned out that the bootstrap (load
roughly 250-300 GB) didn't even finish in 24h. What if I really had a
problem and had to get up some nodes fast? No way!

I think the root cause of it all is the way streams are handled on tables
with MVs.
Sending them to the regular write path implies so many bottlenecks and
sometimes also redundant writes. Let me explain:

1. Bootstrap
During a bootstrap, all ranges from all KS and all CFs that will belong to
the new node will be streamed. MVs are treated like all other CFs and all
ranges that will move to the new node will also be streamed during
bootstrap.
Sending streams of the base tables through the write path will have the
following negative impacts:

   - Writes are sent to the commit log. Not necessary. When node is stopped
   during bootstrap, bootstrap will simply start over. No need to recover from
   commit logs. Non-MV tables won't have a CL anyway
   - MV mutations will not be applied instantly but send to the batch log.
   This is of course necessary during the range movement (if PK of MV differs
   from base table) but what happens: The batchlog will be completely flooded.
   This leads to ridiculously large batchlogs (I observerd BLs with 60GB
   size), zillions of compactions and quadrillions of tombstones. This is a
   pure resource killer, especially because BL uses a CF as a queue.
   - Applying every mutation separately causes read-before-writes during MV
   mutation. This is of course an order of magnitude slower than simply
   streaming down an SSTable. This effect becomes even worse while bootstrap
   progresses and creates more and more (uncompacted) SSTables. Many of them
   wont ever be compacted because the batchlog eats all the resources
   available for compaction
   - Streaming down the MV tables AND applying the mutations of the
   basetables leads to redundant writes. Redundant writes are local if PK of
   the MV == PK of the base table and - even worse - remote if not. Remote MV
   updates will impact nodes that aren't even part of the bootstrap.
   - CDC should also not be necessary during bootstrap, should it? TBD

2. Repair
Negative impact is similar to bootstrap but, ...

   - Sending repairs through write path will not mark the streamed tables
   as repaired. See CASSANDRA-12888. Doing NOT so will instantly solve that
   issue. Much simpler with any other solution
   - It will change the "repair design" a bit. Repairing a base table will
   not automatically repair the MV. But is this bad at all? To be honest as a
   newbie it was very hard for me to understand what I had to do to be sure
   that everything is repaired correctly. Recently I was told NOT to repair MV
   CFs but only to repair the base tables. This means one cannot just call
   "nodetool repair $keyspace" - this is complicated, not transparent and it
   sucks. I changed the behaviour in my own branch and let run the dtests for
   MVs. 2 tests failed:
  - base_replica_repair_test of course failes due to the design change
  -