[jira] [Commented] (CASSANDRA-13871) cassandra-stress user command misbehaves when retrying operations

2018-10-05 Thread Andy Tolbert (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640549#comment-16640549
 ] 

Andy Tolbert commented on CASSANDRA-13871:
--

I recently ran into this and was confounded why stress started misbehaving very 
oddly.  I had {{-errors ignore}} specified, so the exception wasn't surfaced at 
all, but after removing that option and updating code to include the full cause 
trace, I realized I was encountering this issue.  It looks like the patch is no 
longer valid because of recent changes, I will attach a follow on patch 
sometime this weekend.

> cassandra-stress user command misbehaves when retrying operations
> -
>
> Key: CASSANDRA-13871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13871
> Project: Cassandra
>  Issue Type: Bug
>  Components: Stress
>Reporter: Daniel Cranford
>Priority: Minor
> Attachments: 0001-Fixing-cassandra-stress-user-operations-retry.patch
>
>
> o.a.c.stress.Operation will retry queries a configurable number of times. 
> When the "user" command is invoked the o.a.c.stress.operations.userdefined 
> SchemaInsert and SchemaQuery operations are used.
> When SchemaInsert and SchemaQuery are retried (eg after a Read/WriteTimeout 
> exception), they advance the PartitionIterator used to generate the keys to 
> insert/query (SchemaInsert.java:85 SchemaQuery.java:129) This means each 
> retry will use a different set of keys.
> The predefined set of operations avoid this problem by packaging up the 
> arguments to bind to the query into the RunOp object so that retrying the 
> operation results in exactly the same query with the same arguments being run.
> This problem was introduced by CASSANDRA-7964. Prior to CASSANDRA-7964 the 
> PartitionIterator (Partition.RowIterator before the change) was reinitialized 
> prior to each query retry, thus generating the same set of keys each time.
> This problem is reported rather confusingly. The only error that shows up in 
> a log file (specified with -log file=foo.log) is the unhelpful
> {noformat}
> java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: 
> (NoSuchElementException)
> at org.apache.cassandra.stress.Operation.error(Operation.java:136)
> at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114)
> at 
> org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158)
> at 
> org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459)
> {noformat}
> Standard error is only slightly more helpful, displaying the ignorable 
> initial read/write error, and confusing java.util.NoSuchElementException 
> lines (caused by PartitionIterator exhaustion) followed by the above 
> IOException with stack trace, eg
> {noformat}
> com.datastax.drive.core.exceptions.ReadTimeoutException: Cassandra timeout 
> during read query
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: 
> (NoSuchElementException)
> at org.apache.cassandra.stress.Operation.error(Operation.java:136)
> at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114)
> at 
> org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158)
> at 
> org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14495) Memory Leak /High Memory usage post 3.11.2 upgrade

2018-10-05 Thread Chris Lohfink (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640544#comment-16640544
 ] 

Chris Lohfink commented on CASSANDRA-14495:
---

> heap memory usage bumps up

This is how the jvm work, objects created sit on heap and build up until a GC. 
Heap usage going up is expected normal behavior.

> Memory Leak /High Memory usage post 3.11.2 upgrade
> --
>
> Key: CASSANDRA-14495
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14495
> Project: Cassandra
>  Issue Type: Bug
>  Components: Metrics
>Reporter: Abdul Patel
>Priority: Major
> Attachments: cas_heap.txt
>
>
> Hi All,
>  
> I recently upgraded my non prod cassandra cluster( 4 nodes single DC) from 
> 3.10 to 3.11.2 version.
> No issues reported apart from only nodetool info reporting 80% usage .
> I intially had 16GB memory on each node, later i bumped up to 20GB, and 
> rebooted all nodes.
> Waited for an week and now again i have seen memory usage more than 80% , 
> 16GB + .
> this means some memory leaks are happening over the time.
> Any one has faced such issue or do we have any workaround ? my 3.11.2 version 
>  upgrade rollout has been halted because of this bug.
> ===
> ID : 65b64f5a-7fe6-4036-94c8-8da9c57718cc
> Gossip active  : true
> Thrift active  : true
> Native Transport active: true
> Load   : 985.24 MiB
> Generation No  : 1526923117
> Uptime (seconds)   : 1097684
> Heap Memory (MB)   : 16875.64 / 20480.00
> Off Heap Memory (MB)   : 20.42
> Data Center    : DC7
> Rack   : rac1
> Exceptions : 0
> Key Cache  : entries 3569, size 421.44 KiB, capacity 100 MiB, 
> 7931933 hits, 8098632 requests, 0.979 recent hit rate, 14400 save period in 
> seconds
> Row Cache  : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 
> requests, NaN recent hit rate, 0 save period in seconds
> Counter Cache  : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 
> requests, NaN recent hit rate, 7200 save period in seconds
> Chunk Cache    : entries 2361, size 147.56 MiB, capacity 3.97 GiB, 
> 2412803 misses, 72594047 requests, 0.967 recent hit rate, NaN microseconds 
> miss latency
> Percent Repaired   : 99.88086234106282%
> Token  : (invoke with -T/--tokens to see all 256 tokens)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14495) Memory Leak /High Memory usage post 3.11.2 upgrade

2018-10-05 Thread Abdul Patel (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640541#comment-16640541
 ] 

Abdul Patel commented on CASSANDRA-14495:
-

i have seen same pattern in 3.11.3 as well , it works for 2-3 weeks and 
suddenly heap memory usage bumps up. and then every hr i get alerts.

the only new thing is , i am installing cassandra reaper as well with new 
patch, but even with repaer down , i see same behavior.

do we just bump of max heap ? or is it bug 

> Memory Leak /High Memory usage post 3.11.2 upgrade
> --
>
> Key: CASSANDRA-14495
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14495
> Project: Cassandra
>  Issue Type: Bug
>  Components: Metrics
>Reporter: Abdul Patel
>Priority: Major
> Attachments: cas_heap.txt
>
>
> Hi All,
>  
> I recently upgraded my non prod cassandra cluster( 4 nodes single DC) from 
> 3.10 to 3.11.2 version.
> No issues reported apart from only nodetool info reporting 80% usage .
> I intially had 16GB memory on each node, later i bumped up to 20GB, and 
> rebooted all nodes.
> Waited for an week and now again i have seen memory usage more than 80% , 
> 16GB + .
> this means some memory leaks are happening over the time.
> Any one has faced such issue or do we have any workaround ? my 3.11.2 version 
>  upgrade rollout has been halted because of this bug.
> ===
> ID : 65b64f5a-7fe6-4036-94c8-8da9c57718cc
> Gossip active  : true
> Thrift active  : true
> Native Transport active: true
> Load   : 985.24 MiB
> Generation No  : 1526923117
> Uptime (seconds)   : 1097684
> Heap Memory (MB)   : 16875.64 / 20480.00
> Off Heap Memory (MB)   : 20.42
> Data Center    : DC7
> Rack   : rac1
> Exceptions : 0
> Key Cache  : entries 3569, size 421.44 KiB, capacity 100 MiB, 
> 7931933 hits, 8098632 requests, 0.979 recent hit rate, 14400 save period in 
> seconds
> Row Cache  : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 
> requests, NaN recent hit rate, 0 save period in seconds
> Counter Cache  : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 
> requests, NaN recent hit rate, 7200 save period in seconds
> Chunk Cache    : entries 2361, size 147.56 MiB, capacity 3.97 GiB, 
> 2412803 misses, 72594047 requests, 0.967 recent hit rate, NaN microseconds 
> miss latency
> Percent Repaired   : 99.88086234106282%
> Token  : (invoke with -T/--tokens to see all 256 tokens)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Resolved] (CASSANDRA-14804) Running repair on multiple nodes in parallel could halt entire repair

2018-10-05 Thread Blake Eggleston (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Blake Eggleston resolved CASSANDRA-14804.
-
Resolution: Fixed

No problem, glad you got it figured out

> Running repair on multiple nodes in parallel could halt entire repair 
> --
>
> Key: CASSANDRA-14804
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14804
> Project: Cassandra
>  Issue Type: Bug
>  Components: Repair
>Reporter: Jaydeepkumar Chovatia
>Priority: Major
> Fix For: 3.0.18
>
>
> Possible deadlock if we run repair on multiple nodes at the same time. We 
> have come across a situation in production in which if we repair multiple 
> nodes at the same time then repair hangs forever. Here are the details:
> Time t1
>  {{node-1}} has issued repair command to {{node-2}} but due to some reason 
> {{node-2}} didn't receive request hence {{node-1}} is awaiting at 
> [prepareForRepair 
> |https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/ActiveRepairService.java#L333]
>  for 1 hour *with lock*
> Time t2
>  {{node-2}} sent prepare repair request to {{node-1}}, some exception 
> occurred on {{node-1}} and it is trying to cleanup parent session 
> [here|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L172]
>  but {{node-1}} cannot get lock as 1 hour of time has not yet elapsed (above 
> one)
> snippet of jstack on {{node-1}}
> {quote}"Thread-888" #262588 daemon prio=5 os_prio=0 waiting on condition
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for (a java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
>  at 
> org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:332)
>  - locked <> (a org.apache.cassandra.service.ActiveRepairService)
>  at 
> org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:214)
>  at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$9/864248990.run(Unknown
>  Source)
>  at java.lang.Thread.run(Thread.java:748)
> "AntiEntropyStage:1" #1789 daemon prio=5 os_prio=0 waiting for monitor entry 
> []
>  java.lang.Thread.State: BLOCKED (on object monitor)
>  at 
> org.apache.cassandra.service.ActiveRepairService.removeParentRepairSession(ActiveRepairService.java:421)
>  - waiting to lock <> (a org.apache.cassandra.service.ActiveRepairService)
>  at 
> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:172)
>  at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$9/864248990.run(Unknown
>  Source)
>  at java.lang.Thread.run(Thread.java:748){quote}
> Time t3:
>  {{node-2}}(and possibly other nodes {{node-3}}…) sent [prepare request 
> |https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/ActiveRepairService.java#L333]
>  to {{node-1}}, but {{node-1}}’s AntiEntropyStage thread is busy awaiting for 
> lock at {{ActiveRepairService.removeParentRepairSession}}, hence {{node-2}}, 
> {{node-3}} (and possibly other nodes) will also go in 1 hour wait *with 
> lock*. This rolling effect continues and stalls repair in entire ring.
> If we totally stop triggering repair then system would recover slowly but 
> here are the two major problems with this:
>  1. Externally there is no way to decide whether to trigger new repair or 
> wait for system to recover
>  2. In this case system recovers eventu

[jira] [Commented] (CASSANDRA-14776) Transient Replication: Hints on timeout should be disabled for writes to transient nodes

2018-10-05 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640424#comment-16640424
 ] 

Benedict commented on CASSANDRA-14776:
--

My only issue here is that, if we're depending on hints anyway, why don't we 
just rely on them for the non-transient node?

It feels like once we've lost the ability to meet any of our guarantees *and* 
promptness, we may as well avoid polluting the nodes.  Hints delivered after a 
repair during a node being brought back online, for instance, will only cause 
read-repairs unnecessarily until the next repair, despite the data being 
consistently replicated everywhere.

I agree it's a bit of a grey area though.

> Transient Replication:  Hints on timeout should be disabled for writes to 
> transient nodes
> -
>
> Key: CASSANDRA-14776
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14776
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Benedict
>Priority: Minor
> Fix For: 4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14761) Rename speculative_write_threshold to something more appropriate

2018-10-05 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640423#comment-16640423
 ] 

Benedict commented on CASSANDRA-14761:
--

I thought we had agreed on {{transient_write_threshold}}, although for the 
*counters* (which we also need to rename) {{transient_write}} is ambiguous, as 
we are really counting only those triggered by our threshold.  In this case, 
maybe simple {{transient_threshold}} for the percentile, and 
{{transient_threshold_writes}} for the counts?  We can later have a 
straight-forward {{transient_writes}} to include all those triggered by the 
failure detector, perhaps.

> Rename speculative_write_threshold to something more appropriate
> 
>
> Key: CASSANDRA-14761
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14761
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> It's not really speculative. This commit is where it was last named and shows 
> what to update 
> https://github.com/aweisberg/cassandra/commit/e1df8e977d942a1b0da7c2a7554149c781d0e6c3



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14804) Running repair on multiple nodes in parallel could halt entire repair

2018-10-05 Thread Jaydeepkumar Chovatia (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640334#comment-16640334
 ] 

Jaydeepkumar Chovatia commented on CASSANDRA-14804:
---

In our branch we have {{prepareForRepair}}  *{{synchronized}}* yet, it was 
fixed in CASSANDRA-13849 which we missed to backport. 
Let me back port CASSANDRA-13849 to our branch and then hopefully this will fix 
the issue.

Thanks a lot [~bdeggleston] for your help!

> Running repair on multiple nodes in parallel could halt entire repair 
> --
>
> Key: CASSANDRA-14804
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14804
> Project: Cassandra
>  Issue Type: Bug
>  Components: Repair
>Reporter: Jaydeepkumar Chovatia
>Priority: Major
> Fix For: 3.0.18
>
>
> Possible deadlock if we run repair on multiple nodes at the same time. We 
> have come across a situation in production in which if we repair multiple 
> nodes at the same time then repair hangs forever. Here are the details:
> Time t1
>  {{node-1}} has issued repair command to {{node-2}} but due to some reason 
> {{node-2}} didn't receive request hence {{node-1}} is awaiting at 
> [prepareForRepair 
> |https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/ActiveRepairService.java#L333]
>  for 1 hour *with lock*
> Time t2
>  {{node-2}} sent prepare repair request to {{node-1}}, some exception 
> occurred on {{node-1}} and it is trying to cleanup parent session 
> [here|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L172]
>  but {{node-1}} cannot get lock as 1 hour of time has not yet elapsed (above 
> one)
> snippet of jstack on {{node-1}}
> {quote}"Thread-888" #262588 daemon prio=5 os_prio=0 waiting on condition
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for (a java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
>  at 
> org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:332)
>  - locked <> (a org.apache.cassandra.service.ActiveRepairService)
>  at 
> org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:214)
>  at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$9/864248990.run(Unknown
>  Source)
>  at java.lang.Thread.run(Thread.java:748)
> "AntiEntropyStage:1" #1789 daemon prio=5 os_prio=0 waiting for monitor entry 
> []
>  java.lang.Thread.State: BLOCKED (on object monitor)
>  at 
> org.apache.cassandra.service.ActiveRepairService.removeParentRepairSession(ActiveRepairService.java:421)
>  - waiting to lock <> (a org.apache.cassandra.service.ActiveRepairService)
>  at 
> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:172)
>  at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$9/864248990.run(Unknown
>  Source)
>  at java.lang.Thread.run(Thread.java:748){quote}
> Time t3:
>  {{node-2}}(and possibly other nodes {{node-3}}…) sent [prepare request 
> |https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/ActiveRepairService.java#L333]
>  to {{node-1}}, but {{node-1}}’s AntiEntropyStage thread is busy awaiting for 
> lock at {{ActiveRepairService.removeParentRepairSession}}, hence {{node-2}}, 
> {{node-3}} (and possibly other nodes) will also go in 1 hour wait *with 
> lock*. This rolling effect continues and stalls repair in entire rin

[jira] [Commented] (CASSANDRA-14804) Running repair on multiple nodes in parallel could halt entire repair

2018-10-05 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640295#comment-16640295
 ] 

Blake Eggleston commented on CASSANDRA-14804:
-

[~chovatia.jayd...@gmail.com] I’m not sure how we'd get to the state in t2. We 
wait for an hour on a semaphore we instantiate in {{prepareForRepair}}, and 
{{removeParentRepairSession}} is synchronized on the object monitor. One 
shouldn’t block the other. I think the jstack in the description is missing the 
thread where the {{ActiveRepairService}} monitor is being held. 

> Running repair on multiple nodes in parallel could halt entire repair 
> --
>
> Key: CASSANDRA-14804
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14804
> Project: Cassandra
>  Issue Type: Bug
>  Components: Repair
>Reporter: Jaydeepkumar Chovatia
>Priority: Major
> Fix For: 3.0.18
>
>
> Possible deadlock if we run repair on multiple nodes at the same time. We 
> have come across a situation in production in which if we repair multiple 
> nodes at the same time then repair hangs forever. Here are the details:
> Time t1
>  {{node-1}} has issued repair command to {{node-2}} but due to some reason 
> {{node-2}} didn't receive request hence {{node-1}} is awaiting at 
> [prepareForRepair 
> |https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/ActiveRepairService.java#L333]
>  for 1 hour *with lock*
> Time t2
>  {{node-2}} sent prepare repair request to {{node-1}}, some exception 
> occurred on {{node-1}} and it is trying to cleanup parent session 
> [here|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L172]
>  but {{node-1}} cannot get lock as 1 hour of time has not yet elapsed (above 
> one)
> snippet of jstack on {{node-1}}
> {quote}"Thread-888" #262588 daemon prio=5 os_prio=0 waiting on condition
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for (a java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
>  at 
> org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:332)
>  - locked <> (a org.apache.cassandra.service.ActiveRepairService)
>  at 
> org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:214)
>  at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$9/864248990.run(Unknown
>  Source)
>  at java.lang.Thread.run(Thread.java:748)
> "AntiEntropyStage:1" #1789 daemon prio=5 os_prio=0 waiting for monitor entry 
> []
>  java.lang.Thread.State: BLOCKED (on object monitor)
>  at 
> org.apache.cassandra.service.ActiveRepairService.removeParentRepairSession(ActiveRepairService.java:421)
>  - waiting to lock <> (a org.apache.cassandra.service.ActiveRepairService)
>  at 
> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:172)
>  at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$9/864248990.run(Unknown
>  Source)
>  at java.lang.Thread.run(Thread.java:748){quote}
> Time t3:
>  {{node-2}}(and possibly other nodes {{node-3}}…) sent [prepare request 
> |https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/ActiveRepairService.java#L333]
>  to {{node-1}}, but {{node-1}}’s AntiEntropyStage thread is busy awaiting for 
> lock at {{ActiveRepairService.removeParentRepairSession}}, hence {{node-2}}, 
> {{node-3}} (and possibly other nodes) will al

[jira] [Commented] (CASSANDRA-14776) Transient Replication: Hints on timeout should be disabled for writes to transient nodes

2018-10-05 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640274#comment-16640274
 ] 

Ariel Weisberg commented on CASSANDRA-14776:


So this isn't totally true I think if we are trying to achieve EACH_QUORUM in 
every data center?

There are cases where yes we already achieved it and it hinting is not useful, 
but there are cases where we achieved local quorum, but not each quorum, and we 
might like transient replicas to be hinted. Hints is a pretty effective 
mechanism for bring remote DCs up to date without running repair and they work 
when repair can't run such as when nodes are down.

I feel like if we went to the trouble to attempt a write to a transient replica 
it's OK for us to then hint it?

> Transient Replication:  Hints on timeout should be disabled for writes to 
> transient nodes
> -
>
> Key: CASSANDRA-14776
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14776
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Benedict
>Priority: Minor
> Fix For: 4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14761) Rename speculative_write_threshold to something more appropriate

2018-10-05 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640271#comment-16640271
 ] 

Ariel Weisberg commented on CASSANDRA-14761:


[~benedict] what are we going to change this to? I want to get this done so I 
can reference it in other material.

> Rename speculative_write_threshold to something more appropriate
> 
>
> Key: CASSANDRA-14761
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14761
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> It's not really speculative. This commit is where it was last named and shows 
> what to update 
> https://github.com/aweisberg/cassandra/commit/e1df8e977d942a1b0da7c2a7554149c781d0e6c3



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14807) Avoid querying “self” through messaging service when collecting full data during read repair

2018-10-05 Thread Alex Petrov (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-14807:

Description: 
Currently, when collecting full requests during read-repair, we go through the 
messaging service instead of executing the query locally.

|[patch|https://github.com/apache/cassandra/pull/278]|[dtest-patch|https://github.com/apache/cassandra-dtest/pull/39]|[utest|https://circleci.com/gh/ifesdjeen/cassandra/641]|[dtest-vnode|https://circleci.com/gh/ifesdjeen/cassandra/640]|[dtest-novnode|https://circleci.com/gh/ifesdjeen/cassandra/639]|

  was:Currently, when collecting full requests during read-repair, we go 
through the messaging service instead of executing the query locally.


> Avoid querying “self” through messaging service when collecting full data 
> during read repair
> 
>
> Key: CASSANDRA-14807
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14807
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Major
>
> Currently, when collecting full requests during read-repair, we go through 
> the messaging service instead of executing the query locally.
> |[patch|https://github.com/apache/cassandra/pull/278]|[dtest-patch|https://github.com/apache/cassandra-dtest/pull/39]|[utest|https://circleci.com/gh/ifesdjeen/cassandra/641]|[dtest-vnode|https://circleci.com/gh/ifesdjeen/cassandra/640]|[dtest-novnode|https://circleci.com/gh/ifesdjeen/cassandra/639]|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-14807) Avoid querying “self” through messaging service when collecting full data during read repair

2018-10-05 Thread Alex Petrov (JIRA)
Alex Petrov created CASSANDRA-14807:
---

 Summary: Avoid querying “self” through messaging service when 
collecting full data during read repair
 Key: CASSANDRA-14807
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14807
 Project: Cassandra
  Issue Type: Bug
Reporter: Alex Petrov
Assignee: Alex Petrov


Currently, when collecting full requests during read-repair, we go through the 
messaging service instead of executing the query locally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14807) Avoid querying “self” through messaging service when collecting full data during read repair

2018-10-05 Thread Alex Petrov (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-14807:

Description: 
Currently, when collecting full requests during read-repair, we go through the 
messaging service instead of executing the query locally.

||[patch|https://github.com/apache/cassandra/pull/278]||[dtest-patch|https://github.com/apache/cassandra-dtest/pull/39]||

|[utest|https://circleci.com/gh/ifesdjeen/cassandra/641]|[dtest-vnode|https://circleci.com/gh/ifesdjeen/cassandra/640]|[dtest-novnode|https://circleci.com/gh/ifesdjeen/cassandra/639]|

  was:
Currently, when collecting full requests during read-repair, we go through the 
messaging service instead of executing the query locally.

||[patch|https://github.com/apache/cassandra/pull/278]||[dtest-patch|https://github.com/apache/cassandra-dtest/pull/39]||

[utest|https://circleci.com/gh/ifesdjeen/cassandra/641]|[dtest-vnode|https://circleci.com/gh/ifesdjeen/cassandra/640]|[dtest-novnode|https://circleci.com/gh/ifesdjeen/cassandra/639]|


> Avoid querying “self” through messaging service when collecting full data 
> during read repair
> 
>
> Key: CASSANDRA-14807
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14807
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Major
>
> Currently, when collecting full requests during read-repair, we go through 
> the messaging service instead of executing the query locally.
> ||[patch|https://github.com/apache/cassandra/pull/278]||[dtest-patch|https://github.com/apache/cassandra-dtest/pull/39]||
> |[utest|https://circleci.com/gh/ifesdjeen/cassandra/641]|[dtest-vnode|https://circleci.com/gh/ifesdjeen/cassandra/640]|[dtest-novnode|https://circleci.com/gh/ifesdjeen/cassandra/639]|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14807) Avoid querying “self” through messaging service when collecting full data during read repair

2018-10-05 Thread Alex Petrov (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-14807:

Description: 
Currently, when collecting full requests during read-repair, we go through the 
messaging service instead of executing the query locally.

||[patch|https://github.com/apache/cassandra/pull/278]||[dtest-patch|https://github.com/apache/cassandra-dtest/pull/39]||

[utest|https://circleci.com/gh/ifesdjeen/cassandra/641]|[dtest-vnode|https://circleci.com/gh/ifesdjeen/cassandra/640]|[dtest-novnode|https://circleci.com/gh/ifesdjeen/cassandra/639]|

  was:
Currently, when collecting full requests during read-repair, we go through the 
messaging service instead of executing the query locally.

|[patch|https://github.com/apache/cassandra/pull/278]|[dtest-patch|https://github.com/apache/cassandra-dtest/pull/39]|[utest|https://circleci.com/gh/ifesdjeen/cassandra/641]|[dtest-vnode|https://circleci.com/gh/ifesdjeen/cassandra/640]|[dtest-novnode|https://circleci.com/gh/ifesdjeen/cassandra/639]|


> Avoid querying “self” through messaging service when collecting full data 
> during read repair
> 
>
> Key: CASSANDRA-14807
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14807
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Major
>
> Currently, when collecting full requests during read-repair, we go through 
> the messaging service instead of executing the query locally.
> ||[patch|https://github.com/apache/cassandra/pull/278]||[dtest-patch|https://github.com/apache/cassandra-dtest/pull/39]||
> [utest|https://circleci.com/gh/ifesdjeen/cassandra/641]|[dtest-vnode|https://circleci.com/gh/ifesdjeen/cassandra/640]|[dtest-novnode|https://circleci.com/gh/ifesdjeen/cassandra/639]|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14373) Allow using custom script for chronicle queue BinLog archival

2018-10-05 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639945#comment-16639945
 ] 

Ariel Weisberg commented on CASSANDRA-14373:


+1

One interesting thing to note is that the retry mechanism is going to reorder 
the things it is archiving when it supplies them to the archive script. It's 
probably fine, but I wonder if people can tell when files are missing? Like are 
they sequentially numbered?

> Allow using custom script for chronicle queue BinLog archival
> -
>
> Key: CASSANDRA-14373
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14373
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Stefan Podkowinski
>Assignee: Pramod K Sivaraju
>Priority: Major
>  Labels: lhf, pull-request-available
> Fix For: 4.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It would be nice to allow the user to configure an archival script that will 
> be executed in {{BinLog.onReleased(cycle, file)}} for every deleted bin log, 
> just as we do in {{CommitLogArchiver}}. The script should be able to copy the 
> released file to an external location or do whatever the author hand in mind. 
> Deleting the log file should be delegated to the script as well.
> See CASSANDRA-13983, CASSANDRA-12151 for use cases.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14373) Allow using custom script for chronicle queue BinLog archival

2018-10-05 Thread Marcus Eriksson (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639846#comment-16639846
 ] 

Marcus Eriksson commented on CASSANDRA-14373:
-

thanks for the review, pushed a couple of commits to address the comments

bq. Removing documentation of defaults doesn't seem like a pure win since they 
still seem to be there?
I removed the defaults from the nodetool command to be able to override options 
that are set in cassandra.yaml, do you have a suggestion how to do it in a 
better way? We don't know the actual set defaults in the nodetool command, the 
user would have to check out cassandra.yaml which might not be very user 
friendly.
bq. This isn't just a path right it's a format specified of sorts with %path?
right, changed to 'command' instead
bq. This is BinLogOptions but the comments reference Audit log and there is a 
typo in the first sentence here
fixed
bq. Depending on what the archiving script does and why it failed there could 
be unfortunate consequences to retrying repeatedly.
added a new configuration param that allows users to set max retries to 0 to 
avoid this (defaults to 10 retries as retrying forever might also be bad)
bq. exec forks, forking can be slow because of page table copying which in the 
past was slow under things like Xen.. I'm just mentioning it. I don't think you 
need to make it better right now. I don't know offhand how you invoke an 
external command more efficiently from Java.
yeah not sure what to do here, a quick search tells me ProcessBuilder seems to 
be the way to do this
bq. Is this going to enable it for all tests? Is that a good idea can we only 
enable it for just the unit tests that require it?
yeah, removed, should not be there
bq. Should use execute instead of submit unless consuming the result future
fixed
bq. Same here
here we actually wait for the future

bq. The dtests are good tests but could they be unit tests since they are 
single node. 
In general I agree, but in this case it executes the nodetool command as an end 
user would, against a running cassandra cluster (well, node, but anyway). I 
suppose we could stand up a real cluster in a unit test and execute the 
nodetool script as well, but I assume that would take about as long as doing it 
using ccm.

> Allow using custom script for chronicle queue BinLog archival
> -
>
> Key: CASSANDRA-14373
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14373
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Stefan Podkowinski
>Assignee: Pramod K Sivaraju
>Priority: Major
>  Labels: lhf, pull-request-available
> Fix For: 4.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It would be nice to allow the user to configure an archival script that will 
> be executed in {{BinLog.onReleased(cycle, file)}} for every deleted bin log, 
> just as we do in {{CommitLogArchiver}}. The script should be able to copy the 
> released file to an external location or do whatever the author hand in mind. 
> Deleting the log file should be delegated to the script as well.
> See CASSANDRA-13983, CASSANDRA-12151 for use cases.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14713) Update docker image used for testing

2018-10-05 Thread Stefan Podkowinski (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16623200#comment-16623200
 ] 

Stefan Podkowinski edited comment on CASSANDRA-14713 at 10/5/18 1:04 PM:
-

Dtest results from b.a.o when run with "spod/cassandra-testing-ubuntu18-java11" 
image:

||Branch||Results||
|2.1|[!https://builds.apache.org/buildStatus/icon?job=Cassandra-devbranch-dtest&build=650!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/650/]|
|2.2|[!https://builds.apache.org/buildStatus/icon?job=Cassandra-devbranch-dtest&build=657!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/657/]|
|3.0|[!https://builds.apache.org/buildStatus/icon?job=Cassandra-devbranch-dtest&build=648!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/648/]|
|3.11|[!https://builds.apache.org/buildStatus/icon?job=Cassandra-devbranch-dtest&build=651!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/651/]|
|trunk|[!https://builds.apache.org/buildStatus/icon?job=Cassandra-devbranch-dtest&build=646!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/646/]|
|Unit 
Tests|[!https://circleci.com/gh/spodkowinski/cassandra/tree/WIP-14713.svg?style=svg!|https://circleci.com/gh/spodkowinski/cassandra/tree/WIP-14713]|




was (Author: spo...@gmail.com):
Dtest results from b.a.o when run with "spod/cassandra-testing-ubuntu18-java11" 
image:

||Branch||Results||
|2.1|[!https://builds.apache.org/buildStatus/icon?job=Cassandra-devbranch-dtest&build=650!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/650/]|
|2.2|[!https://builds.apache.org/buildStatus/icon?job=Cassandra-devbranch-dtest&build=654!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/654/]|
|3.0|[!https://builds.apache.org/buildStatus/icon?job=Cassandra-devbranch-dtest&build=648!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/648/]|
|3.11|[!https://builds.apache.org/buildStatus/icon?job=Cassandra-devbranch-dtest&build=651!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/651/]|
|trunk|[!https://builds.apache.org/buildStatus/icon?job=Cassandra-devbranch-dtest&build=646!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/646/]|
|Unit 
Tests|[!https://circleci.com/gh/spodkowinski/cassandra/tree/WIP-14713.svg?style=svg!|https://circleci.com/gh/spodkowinski/cassandra/tree/WIP-14713]|



> Update docker image used for testing
> 
>
> Key: CASSANDRA-14713
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14713
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Testing
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Major
> Attachments: Dockerfile
>
>
> Tests executed on builds.apache.org ({{docker/jenkins/jenkinscommand.sh}}) 
> and circleCI ({{.circleci/config.yml}}) will currently use the same 
> [cassandra-test|https://hub.docker.com/r/kjellman/cassandra-test/] docker 
> image ([github|https://github.com/mkjellman/cassandra-test-docker]) by 
> [~mkjellman].
> We should manage this image on our own as part of cassandra-builds, to keep 
> it updated. There's also a [Apache 
> user|https://hub.docker.com/u/apache/?page=1] on docker hub for publishing 
> images.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Resolved] (CASSANDRA-14805) Fails on running Cassandra server

2018-10-05 Thread Joshua McKenzie (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua McKenzie resolved CASSANDRA-14805.
-
Resolution: Invalid

Please reach out to the community on #cassandra on freenode or via the [user 
mailing lists|[http://cassandra.apache.org/community/].]

This Jira is for tracking development of the database.

> Fails on running Cassandra server 
> --
>
> Key: CASSANDRA-14805
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14805
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Ravi Gangwar
>Priority: Critical
>
> Full Product Version :
> Fails on java version. It's Java "1.8.0_181"
> Os Version : Ubuntu 16.04 LTS
> EXTRA RELEVANT SYSTEM CONFIGURATION :
> Just installed Cassandra 2.2.11 
> A DESCRIPTION OF THE PROBLEM : 
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGBUS (0x7) at pc=0x7f42fc492e70, pid=12128, tid=0x7f42fc3c9700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 
> 1.8.0_181-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # C [liblz4-java8098863625230398555.so+0x5e70] LZ4_decompress_fast+0xd0
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.java.com/bugreport/crash.jsp
> # The crash happened outside the Java Virtual Machine in native code.
> # See problematic frame for where to report the bug.
> #
> --- T H R E A D ---
> Current thread (0x006d0800): JavaThread "CompactionExecutor:7" daemon 
> [_thread_in_native, id=5278, stack(0x7f42fc389000,0x7f42fc3ca000)]
> siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 
> 0x7f2026b8b000
> Registers:
> RAX=0x7f432200750a, RBX=0x7f2026b8affc, RCX=0x7f4322005856, 
> RDX=0x7f432200750a
> RSP=0x7f42fc3c7e50, RBP=0x7f2026b8940f, RSI=0x7f4322011d88, 
> RDI=0x0007
> R8 =0x7f4322011d84, R9 =0x7f4322011d90, R10=0x0003, 
> R11=0x
> R12=0x, R13=0x7f1ffe33d40f, R14=0x7f4322011d87, 
> R15=0x7f4322011d8b
> RIP=0x7f42fc492e70, EFLAGS=0x00010287, CSGSFS=0x002b0033, 
> ERR=0x0004
>  TRAPNO=0x000e
> Top of Stack: (sp=0x7f42fc3c7e50)
> 0x7f42fc3c7e50: 4ce9 
> 0x7f42fc3c7e60: 0004 0001
> 0x7f42fc3c7e70: 0002 0001
> 0x7f42fc3c7e80: 0004 0004
> 0x7f42fc3c7e90: 0004 0004
> 0x7f42fc3c7ea0:  
> 0x7f42fc3c7eb0:  
> 0x7f42fc3c7ec0:  0001
> 0x7f42fc3c7ed0: 0002 0003
> 0x7f42fc3c7ee0: 7f42fc3c7fa8 006d09f8
> 0x7f42fc3c7ef0:  
> 0x7f42fc3c7f00: 7f1ffe33d40f 7f4322001d90
> 0x7f42fc3c7f10: 2884c000 7f42fc48f59d
> 0x7f42fc3c7f20: 000733b5fa78 
> 0x7f42fc3c7f30: 7f42fc3c7fc0 
> 0x7f42fc3c7f40: 2884bfff 7f42fc3c7fa8
> 0x7f42fc3c7f50: 006d0800 7f43119e8f1d
> 0x7f42fc3c7f60: 7f42fc3c7f98 
> 0x7f42fc3c7f70: 0001 
> 0x7f42fc3c7f80: 00070764f4d8 
> 0x7f42fc3c7f90:  00075b5937f0
> 0x7f42fc3c7fa0: 7f42fc3c7ff0 0006f91d2e38
> 0x7f42fc3c7fb0: 0006d7e3f210 7f42fc3c8040
> 0x7f42fc3c7fc0: 00070764f4d8 7f4311f9eca8
> 0x7f42fc3c7fd0: 51070001 00072dc9abb0
> 0x7f42fc3c7fe0: 28851103 0006dd2d
> 0x7f42fc3c7ff0: 00022883f40b eb6b26fe2884bffc
> 0x7f42fc3c8000: 00070764f4d8 00012884c000
> 0x7f42fc3c8010: 00075b5937f0 00075b5937f0
> 0x7f42fc3c8020: 00060008 0006d7f67598
> 0x7f42fc3c8030: 0006d7f67348 7f43120080d4
> 0x7f42fc3c8040: 0001 0006d7f676d8
>  
>  
> SYSTEM CONFIGURATION :
>  
>  # CPU - Intel Core i7 - 4771 CPU @ 3.50 GHz * 8
> 2. RAM - 16 GB
> 3. STORAGE - 967.6 GB
> 4. OS - Ubuntu 16.04 LTS
> 5. Apache Cassandra - 2.2.11
> 6. CPP Driver - 2.2.1-1
> 7. libuv - 1.4.2-1 
> 8. Java version - "1.8.0_181" 
>  Java (TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java Hotspot (TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>  
> Cassandra was working normally before generating this bug. Now when i am 
> trying to restart my server getting this bu

[jira] [Created] (CASSANDRA-14806) CircleCI workflow improvements and Java 11 support

2018-10-05 Thread Stefan Podkowinski (JIRA)
Stefan Podkowinski created CASSANDRA-14806:
--

 Summary: CircleCI workflow improvements and Java 11 support
 Key: CASSANDRA-14806
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14806
 Project: Cassandra
  Issue Type: Improvement
  Components: Build, Testing
Reporter: Stefan Podkowinski
Assignee: Stefan Podkowinski


The current CircleCI config could use some cleanup and improvements. First of 
all, the config has been made more modular by using the new CircleCI 2.1 
executors and command elements. Based on CASSANDRA-14713, there's now also a 
Java 11 executor that will allow running tests under Java 11. The {{build}} 
step will be done using Java 11 in all cases, so we can catch any regressions 
for that and also test the Java 11 multi-jar artifact during dtests, that we'd 
also create during the release process.

The job workflow has now also been changed to make use of the [manual job 
approval|https://circleci.com/docs/2.0/workflows/#holding-a-workflow-for-a-manual-approval]
 feature, which now allows running dtest jobs only on request and not 
automatically with every commit. The Java8 unit tests still do, but that could 
also be easily changed if needed. See [example 
workflow|https://circleci.com/workflow-run/08ecb879-9aaa-4d75-84d6-b00dc9628425]
 with start_ jobs being triggers needed manual approval for starting the actual 
jobs.

There was some churn in manually editing the config for paid and non-paid 
resource tiers before. This has been mostly mitigated now by using project 
settings instead, for overriding lower defaults (see below) and scheduling 
dtests on request, which will only run on paid accounts nonetheless, so we use 
high settings for these right away. The only issue left is the question how we 
may be able to dynamically adjust the {{resource_class}} and {{parallelism}} 
settings for unit tests. So at this point, the CircleCI config will work for 
both paid and non-paid by default, but paid accounts will see slower unit test 
results, as only medium instances are used (ie. 15min instead of 4min).

Attention CircleCI paid account users: you'll have to add "{{CCM_MAX_HEAP_SIZE: 
2048M}}" and "{{CCM_HEAP_NEWSIZE: 512M}}" to your project's environment 
settings or create a context, to override the lower defaults for free instances!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org