[jira] [Commented] (CASSANDRA-13307) The specification of protocol version in cqlsh means the python driver doesn't automatically downgrade protocol version.

2017-04-05 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15956912#comment-15956912
 ] 

Joel Knighton commented on CASSANDRA-13307:
---

There's one you can set through the Edit button, if you scroll down. If you 
don't have permissions to access/edit that somehow, come complain in 
#cassandra-dev on IRC.

Thanks for volunteering to review!

> The specification of protocol version in cqlsh means the python driver 
> doesn't automatically downgrade protocol version.
> 
>
> Key: CASSANDRA-13307
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13307
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 3.11.x
>
>
> Hi,
> Looks like we've regressed on the issue described in:
> https://issues.apache.org/jira/browse/CASSANDRA-9467
> In that we're no longer able to connect from newer cqlsh versions
> (e.g trunk) to older versions of Cassandra with a lower version of the 
> protocol (e.g 2.1 with protocol version 3)
> The problem seems to be that we're relying on the ability for the client to 
> automatically downgrade protocol version implemented in Cassandra here:
> https://issues.apache.org/jira/browse/CASSANDRA-12838
> and utilised in the python client here:
> https://datastax-oss.atlassian.net/browse/PYTHON-240
> The problem however comes when we implemented:
> https://datastax-oss.atlassian.net/browse/PYTHON-537
> "Don't downgrade protocol version if explicitly set" 
> (included when we bumped from 3.5.0 to 3.7.0 of the python driver as part of 
> fixing: https://issues.apache.org/jira/browse/CASSANDRA-11534)
> Since we do explicitly specify the protocol version in the bin/cqlsh.py.
> I've got a patch which just adds an option to explicitly specify the protocol 
> version (for those who want to do that) and then otherwise defaults to not 
> setting the protocol version, i.e using the protocol version from the client 
> which we ship, which should by default be the same protocol as the server.
> Then it should downgrade gracefully as was intended. 
> Let me know if that seems reasonable.
> Thanks,
> Matt



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13307) The specification of protocol version in cqlsh means the python driver doesn't automatically downgrade protocol version.

2017-04-05 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-13307:
--
Reviewer: mck

> The specification of protocol version in cqlsh means the python driver 
> doesn't automatically downgrade protocol version.
> 
>
> Key: CASSANDRA-13307
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13307
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 3.11.x
>
>
> Hi,
> Looks like we've regressed on the issue described in:
> https://issues.apache.org/jira/browse/CASSANDRA-9467
> In that we're no longer able to connect from newer cqlsh versions
> (e.g trunk) to older versions of Cassandra with a lower version of the 
> protocol (e.g 2.1 with protocol version 3)
> The problem seems to be that we're relying on the ability for the client to 
> automatically downgrade protocol version implemented in Cassandra here:
> https://issues.apache.org/jira/browse/CASSANDRA-12838
> and utilised in the python client here:
> https://datastax-oss.atlassian.net/browse/PYTHON-240
> The problem however comes when we implemented:
> https://datastax-oss.atlassian.net/browse/PYTHON-537
> "Don't downgrade protocol version if explicitly set" 
> (included when we bumped from 3.5.0 to 3.7.0 of the python driver as part of 
> fixing: https://issues.apache.org/jira/browse/CASSANDRA-11534)
> Since we do explicitly specify the protocol version in the bin/cqlsh.py.
> I've got a patch which just adds an option to explicitly specify the protocol 
> version (for those who want to do that) and then otherwise defaults to not 
> setting the protocol version, i.e using the protocol version from the client 
> which we ship, which should by default be the same protocol as the server.
> Then it should downgrade gracefully as was intended. 
> Let me know if that seems reasonable.
> Thanks,
> Matt



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12929) Fix version check to enable streaming keep-alive

2017-04-04 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955195#comment-15955195
 ] 

Joel Knighton commented on CASSANDRA-12929:
---

Thanks! It happens.

> Fix version check to enable streaming keep-alive
> 
>
> Key: CASSANDRA-12929
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12929
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michael Shuler
>Assignee: Paulo Motta
>  Labels: dtest, test-failure
> Fix For: 4.0
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_novnode_dtest/494/testReport/bootstrap_test/TestBootstrap/simple_bootstrap_test_small_keepalive_period
> {noformat}
> Error Message
> Expected [['COMPLETED']] from SELECT bootstrapped FROM system.local WHERE 
> key='local', but got [[u'IN_PROGRESS']]
>  >> begin captured logging << 
> dtest: DEBUG: cluster ccm directory: /tmp/dtest-YmnyEI
> dtest: DEBUG: Done setting configuration options:
> {   'num_tokens': None, 'phi_convict_threshold': 5, 'start_rpc': 'true'}
> cassandra.cluster: INFO: New Cassandra host  
> discovered
> - >> end captured logging << -
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
> testMethod()
>   File "/home/automaton/cassandra-dtest/tools/decorators.py", line 46, in 
> wrapped
> f(obj)
>   File "/home/automaton/cassandra-dtest/bootstrap_test.py", line 163, in 
> simple_bootstrap_test_small_keepalive_period
> assert_bootstrap_state(self, node2, 'COMPLETED')
>   File "/home/automaton/cassandra-dtest/tools/assertions.py", line 297, in 
> assert_bootstrap_state
> assert_one(session, "SELECT bootstrapped FROM system.local WHERE 
> key='local'", [expected_bootstrap_state])
>   File "/home/automaton/cassandra-dtest/tools/assertions.py", line 130, in 
> assert_one
> assert list_res == [expected], "Expected {} from {}, but got 
> {}".format([expected], query, list_res)
> "Expected [['COMPLETED']] from SELECT bootstrapped FROM system.local WHERE 
> key='local', but got [[u'IN_PROGRESS']]\n >> begin 
> captured logging << \ndtest: DEBUG: cluster ccm 
> directory: /tmp/dtest-YmnyEI\ndtest: DEBUG: Done setting configuration 
> options:\n{   'num_tokens': None, 'phi_convict_threshold': 5, 'start_rpc': 
> 'true'}\ncassandra.cluster: INFO: New Cassandra host  datacenter1> discovered\n- >> end captured logging << 
> -"
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (CASSANDRA-13402) testall failure in org.apache.cassandra.dht.StreamStateStoreTest.testUpdateAndQueryAvailableRanges

2017-04-03 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton resolved CASSANDRA-13402.
---
Resolution: Duplicate

> testall failure in 
> org.apache.cassandra.dht.StreamStateStoreTest.testUpdateAndQueryAvailableRanges
> --
>
> Key: CASSANDRA-13402
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13402
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Sean McCarthy
>  Labels: test-failure, testall
> Attachments: TEST-org.apache.cassandra.dht.StreamStateStoreTest.log
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_testall/1488/testReport/org.apache.cassandra.dht/StreamStateStoreTest/testUpdateAndQueryAvailableRanges
> {code}
> Stacktrace
> java.lang.NullPointerException
>   at 
> org.apache.cassandra.streaming.StreamSession.isKeepAliveSupported(StreamSession.java:244)
>   at 
> org.apache.cassandra.streaming.StreamSession.(StreamSession.java:196)
>   at 
> org.apache.cassandra.dht.StreamStateStoreTest.testUpdateAndQueryAvailableRanges(StreamStateStoreTest.java:53)
> {code}
> Related failures: (13)
> http://cassci.datastax.com/job/trunk_testall/1488/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Reopened] (CASSANDRA-12929) Fix version check to enable streaming keep-alive

2017-04-03 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton reopened CASSANDRA-12929:
---

It looks like this is causing quite a few test failures on commit.

In dtests, this includes many tests in sstable_generation_loading_test and 
snapshot_test.TestSnapshot.test_basic_snapshot_and_restore.

in testall, this includes 
StreamStateStoreTest.testUpdateAndQueryAvailableRanges, 
LocalSyncTaskTest.testDifference, 
StreamingRepairTaskTest.incrementalStreamPlan, 
StreamingRepairTaskTest.fullStreamPlan, 
StreamTransferTaskTest.testScheduleTimeout, and 
StreamTransferTaskTest.testFailSessionDuringTransferShouldNotReleaseReferences.

There may be others that I missed, but that list should get things pointed in 
the right direction.

> Fix version check to enable streaming keep-alive
> 
>
> Key: CASSANDRA-12929
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12929
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michael Shuler
>Assignee: Paulo Motta
>  Labels: dtest, test-failure
> Fix For: 4.0
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_novnode_dtest/494/testReport/bootstrap_test/TestBootstrap/simple_bootstrap_test_small_keepalive_period
> {noformat}
> Error Message
> Expected [['COMPLETED']] from SELECT bootstrapped FROM system.local WHERE 
> key='local', but got [[u'IN_PROGRESS']]
>  >> begin captured logging << 
> dtest: DEBUG: cluster ccm directory: /tmp/dtest-YmnyEI
> dtest: DEBUG: Done setting configuration options:
> {   'num_tokens': None, 'phi_convict_threshold': 5, 'start_rpc': 'true'}
> cassandra.cluster: INFO: New Cassandra host  
> discovered
> - >> end captured logging << -
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
> testMethod()
>   File "/home/automaton/cassandra-dtest/tools/decorators.py", line 46, in 
> wrapped
> f(obj)
>   File "/home/automaton/cassandra-dtest/bootstrap_test.py", line 163, in 
> simple_bootstrap_test_small_keepalive_period
> assert_bootstrap_state(self, node2, 'COMPLETED')
>   File "/home/automaton/cassandra-dtest/tools/assertions.py", line 297, in 
> assert_bootstrap_state
> assert_one(session, "SELECT bootstrapped FROM system.local WHERE 
> key='local'", [expected_bootstrap_state])
>   File "/home/automaton/cassandra-dtest/tools/assertions.py", line 130, in 
> assert_one
> assert list_res == [expected], "Expected {} from {}, but got 
> {}".format([expected], query, list_res)
> "Expected [['COMPLETED']] from SELECT bootstrapped FROM system.local WHERE 
> key='local', but got [[u'IN_PROGRESS']]\n >> begin 
> captured logging << \ndtest: DEBUG: cluster ccm 
> directory: /tmp/dtest-YmnyEI\ndtest: DEBUG: Done setting configuration 
> options:\n{   'num_tokens': None, 'phi_convict_threshold': 5, 'start_rpc': 
> 'true'}\ncassandra.cluster: INFO: New Cassandra host  datacenter1> discovered\n- >> end captured logging << 
> -"
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests

2017-03-22 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12653:
--
Fix Version/s: (was: 3.11.x)
   (was: 4.x)
   (was: 3.0.x)
   (was: 2.2.x)
   4.0
   3.11.0
   3.0.13
   2.2.10

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Fix For: 2.2.10, 3.0.13, 3.11.0, 4.0
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests

2017-03-22 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937125#comment-15937125
 ] 

Joel Knighton commented on CASSANDRA-12653:
---

Committed to 2.2 as {{bf0906b92cf65161d828e31bc46436d427bbb4b8}} and merged 
forward through 3.0, 3.11, and trunk. Added Jason Brown as an additional 
reviewer in the commit since his feedback was incorporated in the latest round 
of patches.

Thanks everyone!

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Fix For: 2.2.10, 3.0.13, 3.11.0, 4.0
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (CASSANDRA-13347) dtest failure in upgrade_tests.upgrade_through_versions_test.TestUpgrade_current_2_2_x_To_indev_3_0_x.rolling_upgrade_test

2017-03-17 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton resolved CASSANDRA-13347.
---
   Resolution: Fixed
Fix Version/s: 3.11.0
   3.0.13

This should be fixed by [CASSANDRA-13320].

> dtest failure in 
> upgrade_tests.upgrade_through_versions_test.TestUpgrade_current_2_2_x_To_indev_3_0_x.rolling_upgrade_test
> --
>
> Key: CASSANDRA-13347
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13347
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Sean McCarthy
>  Labels: dtest, test-failure
> Fix For: 3.0.13, 3.11.0
>
> Attachments: node1_debug.log, node1_gc.log, node1.log, 
> node2_debug.log, node2_gc.log, node2.log, node3_debug.log, node3_gc.log, 
> node3.log
>
>
> example failure:
> http://cassci.datastax.com/job/cassandra-3.0_large_dtest/58/testReport/upgrade_tests.upgrade_through_versions_test/TestUpgrade_current_2_2_x_To_indev_3_0_x/rolling_upgrade_test
> {code}
> Error Message
> Subprocess ['nodetool', '-h', 'localhost', '-p', '7100', ['upgradesstables', 
> '-a']] exited with non-zero status; exit status: 2; 
> stderr: error: null
> -- StackTrace --
> java.lang.AssertionError
>   at org.apache.cassandra.db.rows.Rows.collectStats(Rows.java:70)
>   at 
> org.apache.cassandra.io.sstable.format.big.BigTableWriter$StatsCollector.applyToRow(BigTableWriter.java:197)
>   at 
> org.apache.cassandra.db.transform.BaseRows.applyOne(BaseRows.java:116)
>   at org.apache.cassandra.db.transform.BaseRows.add(BaseRows.java:107)
>   at 
> org.apache.cassandra.db.transform.UnfilteredRows.add(UnfilteredRows.java:41)
>   at 
> org.apache.cassandra.db.transform.Transformation.add(Transformation.java:156)
>   at 
> org.apache.cassandra.db.transform.Transformation.apply(Transformation.java:122)
>   at 
> org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:147)
>   at 
> org.apache.cassandra.io.sstable.SSTableRewriter.append(SSTableRewriter.java:125)
>   at 
> org.apache.cassandra.db.compaction.writers.DefaultCompactionWriter.realAppend(DefaultCompactionWriter.java:57)
>   at 
> org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.append(CompactionAwareWriter.java:109)
>   at 
> org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:195)
>   at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>   at 
> org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:89)
>   at 
> org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:61)
>   at 
> org.apache.cassandra.db.compaction.CompactionManager$5.execute(CompactionManager.java:415)
>   at 
> org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:307)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
>   at java.lang.Thread.run(Thread.java:745)
> {code}{code}
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
> testMethod()
>   File 
> "/home/automaton/cassandra-dtest/upgrade_tests/upgrade_through_versions_test.py",
>  line 279, in rolling_upgrade_test
> self.upgrade_scenario(rolling=True)
>   File 
> "/home/automaton/cassandra-dtest/upgrade_tests/upgrade_through_versions_test.py",
>  line 345, in upgrade_scenario
> self.upgrade_to_version(version_meta, partial=True, nodes=(node,))
>   File 
> "/home/automaton/cassandra-dtest/upgrade_tests/upgrade_through_versions_test.py",
>  line 446, in upgrade_to_version
> node.nodetool('upgradesstables -a')
>   File 
> "/home/automaton/venv/local/lib/python2.7/site-packages/ccmlib/node.py", line 
> 789, in nodetool
> return handle_external_tool_process(p, ['nodetool', '-h', 'localhost', 
> '-p', str(self.jmx_port), cmd.split()])
>   File 
> "/home/automaton/venv/local/lib/python2.7/site-packages/ccmlib/node.py", line 
> 2002, in handle_external_tool_process
> raise ToolError(cmd_args, rc, out, err)
> {code}
> Related failures:
> 

[jira] [Updated] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies

2017-03-15 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-13306:
--
Reviewer: Dave Brosius

> Builds fetch source jars for build dependencies, not just source dependencies
> -
>
> Key: CASSANDRA-13306
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13306
> Project: Cassandra
>  Issue Type: Bug
>  Components: Build
>Reporter: Joel Knighton
>Assignee: Joel Knighton
> Fix For: 4.0
>
>
> A recent commit without a linked JIRA cleaned up dead imports and also added 
> a {{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This 
> causes ant to fetch source jars for the build deps, but we have an explicit 
> separate build-deps-pom-sources that fetches sources.
> This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}.
> Was this an intentional change, [~dbrosius]? It seems to conflate the 
> separate build-deps-pom and build-deps-pom-sources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies

2017-03-15 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927489#comment-15927489
 ] 

Joel Knighton commented on CASSANDRA-13306:
---

Thanks! I ran a round of CI and tested builds with empty and populated .m2.

Committed to trunk as {{fe08463c3b7135a0f1b121bb0d148c80b8c7e123}}.

> Builds fetch source jars for build dependencies, not just source dependencies
> -
>
> Key: CASSANDRA-13306
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13306
> Project: Cassandra
>  Issue Type: Bug
>  Components: Build
>Reporter: Joel Knighton
>Assignee: Joel Knighton
> Fix For: 4.0
>
>
> A recent commit without a linked JIRA cleaned up dead imports and also added 
> a {{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This 
> causes ant to fetch source jars for the build deps, but we have an explicit 
> separate build-deps-pom-sources that fetches sources.
> This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}.
> Was this an intentional change, [~dbrosius]? It seems to conflate the 
> separate build-deps-pom and build-deps-pom-sources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies

2017-03-15 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-13306:
--
   Resolution: Fixed
Fix Version/s: 4.0
   Status: Resolved  (was: Ready to Commit)

> Builds fetch source jars for build dependencies, not just source dependencies
> -
>
> Key: CASSANDRA-13306
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13306
> Project: Cassandra
>  Issue Type: Bug
>  Components: Build
>Reporter: Joel Knighton
>Assignee: Joel Knighton
> Fix For: 4.0
>
>
> A recent commit without a linked JIRA cleaned up dead imports and also added 
> a {{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This 
> causes ant to fetch source jars for the build deps, but we have an explicit 
> separate build-deps-pom-sources that fetches sources.
> This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}.
> Was this an intentional change, [~dbrosius]? It seems to conflate the 
> separate build-deps-pom and build-deps-pom-sources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies

2017-03-15 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-13306:
--
Status: Ready to Commit  (was: Patch Available)

> Builds fetch source jars for build dependencies, not just source dependencies
> -
>
> Key: CASSANDRA-13306
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13306
> Project: Cassandra
>  Issue Type: Bug
>  Components: Build
>Reporter: Joel Knighton
>Assignee: Joel Knighton
>
> A recent commit without a linked JIRA cleaned up dead imports and also added 
> a {{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This 
> causes ant to fetch source jars for the build deps, but we have an explicit 
> separate build-deps-pom-sources that fetches sources.
> This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}.
> Was this an intentional change, [~dbrosius]? It seems to conflate the 
> separate build-deps-pom and build-deps-pom-sources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies

2017-03-15 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-13306:
--
Status: Patch Available  (was: Open)

> Builds fetch source jars for build dependencies, not just source dependencies
> -
>
> Key: CASSANDRA-13306
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13306
> Project: Cassandra
>  Issue Type: Bug
>  Components: Build
>Reporter: Joel Knighton
>Assignee: Joel Knighton
>
> A recent commit without a linked JIRA cleaned up dead imports and also added 
> a {{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This 
> causes ant to fetch source jars for the build deps, but we have an explicit 
> separate build-deps-pom-sources that fetches sources.
> This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}.
> Was this an intentional change, [~dbrosius]? It seems to conflate the 
> separate build-deps-pom and build-deps-pom-sources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies

2017-03-15 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton reassigned CASSANDRA-13306:
-

Assignee: Joel Knighton

> Builds fetch source jars for build dependencies, not just source dependencies
> -
>
> Key: CASSANDRA-13306
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13306
> Project: Cassandra
>  Issue Type: Bug
>  Components: Build
>Reporter: Joel Knighton
>Assignee: Joel Knighton
>
> A recent commit without a linked JIRA cleaned up dead imports and also added 
> a {{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This 
> causes ant to fetch source jars for the build deps, but we have an explicit 
> separate build-deps-pom-sources that fetches sources.
> This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}.
> Was this an intentional change, [~dbrosius]? It seems to conflate the 
> separate build-deps-pom and build-deps-pom-sources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests

2017-03-13 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907535#comment-15907535
 ] 

Joel Knighton commented on CASSANDRA-12653:
---

I planned on leaving that honor to [~spo...@gmail.com] as patch author, but if 
he doesn't, I'm happy to do so.

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests

2017-03-09 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903212#comment-15903212
 ] 

Joel Knighton commented on CASSANDRA-12653:
---

Sure - while I'd argue that a need for a change in the future could be 
introduced in the future patch, I agree that this distinction is very minor and 
won't cause any problems. Thanks for the patch and your patience!

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies

2017-03-07 Thread Joel Knighton (JIRA)
Joel Knighton created CASSANDRA-13306:
-

 Summary: Builds fetch source jars for build dependencies, not just 
source dependencies
 Key: CASSANDRA-13306
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13306
 Project: Cassandra
  Issue Type: Bug
  Components: Build
Reporter: Joel Knighton


A recent commit without a linked JIRA cleaned up dead imports and also added a 
{{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This causes 
ant to fetch source jars for the build deps, but we have an explicit separate 
build-deps-pom-sources that fetches sources.

This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}.

Was this an intentional change, [~dbrosius]? It seems to conflate the separate 
build-deps-pom and build-deps-pom-sources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky

2017-03-06 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897963#comment-15897963
 ] 

Joel Knighton commented on CASSANDRA-13303:
---

[CASSANDRA-13038] introduced the regression in 
{{a5ce963117acf5e4cf0a31057551f2f42385c398}}. The regression was fixed in 
{{adbe2cc4df0134955a2c83ae4ebd0086ea5e9164}}.

> CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super 
> flaky
> ---
>
> Key: CASSANDRA-13303
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13303
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Benjamin Roth
>
> On my machine, this test succeeds maybe 1 out of 10 times.
> Cause seems to be that sstable is not elected for compation in 
> worthDroppingTombstones as droppableRatio is 0.0
> I don't know the primary intention of this test, so I didn't touch it but the 
> conditions are not safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky

2017-03-06 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897953#comment-15897953
 ] 

Joel Knighton commented on CASSANDRA-13303:
---

The exact test output would help diagnose this, but it sounds like the failure 
introduced/fixed in [CASSANDRA-13038], as seen in CI 
[here|http://cassci.datastax.com/job/trunk_testall/1436/testReport/junit/org.apache.cassandra.db.compaction/CompactionsTest/testSingleSSTableCompactionWithSizeTieredCompaction/].

Can you make sure this failure still occurs after fetching latest trunk? If so, 
what's your trunk commit hash?

> CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super 
> flaky
> ---
>
> Key: CASSANDRA-13303
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13303
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Benjamin Roth
>
> On my machine, this test succeeds maybe 1 out of 10 times.
> Cause seems to be that sstable is not elected for compation in 
> worthDroppingTombstones as droppableRatio is 0.0
> I don't know the primary intention of this test, so I didn't touch it but the 
> conditions are not safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky

2017-03-06 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897934#comment-15897934
 ] 

Joel Knighton commented on CASSANDRA-13303:
---

Thanks for the report, but there isn't a lot that's actionable here.

Could you provide the branch(es) it is failing on for you? In addition, the 
specific failure (as shown by test output/stacktrace) you see would help 
someone identify the problem, particularly in cases like this when the test 
isn't failing on CI.

This test recently had a regression introduced and fixed in [CASSANDRA-13038], 
but I don't know if it's the same failure you're seeing.

> CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super 
> flaky
> ---
>
> Key: CASSANDRA-13303
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13303
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Benjamin Roth
>
> On my machine, this test succeeds maybe 1 out of 10 times.
> Cause seems to be that sstable is not elected for compation in 
> worthDroppingTombstones as droppableRatio is 0.0
> I don't know the primary intention of this test, so I didn't touch it but the 
> conditions are not safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests

2017-03-02 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892772#comment-15892772
 ] 

Joel Knighton commented on CASSANDRA-12653:
---

Thanks! The latest changes look good - however, if moving the System.nanoTime() 
call to the comparison site, it seems that the {{firstSynSendAt}} truly does 
reduce to a boolean, since the comparison will now always be true if 
{{firstSynSendAt}} has been set. I don't think the existing patch will cause 
any problems, but it may be more complicated than it needs to be.

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (CASSANDRA-13281) testall failure in org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization

2017-03-02 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton resolved CASSANDRA-13281.
---
Resolution: Duplicate

Confirmed this is a failure due to a small update needed to the test after 
[CASSANDRA-13038]. Reopened and being fixed there.

> testall failure in 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization
> 
>
> Key: CASSANDRA-13281
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13281
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Sean McCarthy
>Assignee: Joel Knighton
>  Labels: test-failure, testall
> Attachments: 
> TEST-org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.log
>
>
> example failure:
> http://cassci.datastax.com/job/cassandra-3.11_testall/96/testReport/org.apache.cassandra.io.sstable.metadata/MetadataSerializerTest/testSerialization
> {code}
> Error Message
> expected: 
> but was:
> {code}{code}
> Stacktrace
> junit.framework.AssertionFailedError: 
> expected: 
> but was:
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization(MetadataSerializerTest.java:72)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (CASSANDRA-13281) testall failure in org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization

2017-03-02 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton reassigned CASSANDRA-13281:
-

Assignee: Joel Knighton

> testall failure in 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization
> 
>
> Key: CASSANDRA-13281
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13281
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Sean McCarthy
>Assignee: Joel Knighton
>  Labels: test-failure, testall
> Attachments: 
> TEST-org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.log
>
>
> example failure:
> http://cassci.datastax.com/job/cassandra-3.11_testall/96/testReport/org.apache.cassandra.io.sstable.metadata/MetadataSerializerTest/testSerialization
> {code}
> Error Message
> expected: 
> but was:
> {code}{code}
> Stacktrace
> junit.framework.AssertionFailedError: 
> expected: 
> but was:
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization(MetadataSerializerTest.java:72)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13038) 33% of compaction time spent in StreamingHistogram.update()

2017-03-01 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891229#comment-15891229
 ] 

Joel Knighton commented on CASSANDRA-13038:
---

Nit + 3.0 changes look good. If CI doesn't have any problems, +1.

> 33% of compaction time spent in StreamingHistogram.update()
> ---
>
> Key: CASSANDRA-13038
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13038
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Corentin Chary
>Assignee: Jeff Jirsa
> Fix For: 3.0.12, 3.11.0
>
> Attachments: compaction-speedup.patch, 
> compaction-streaminghistrogram.png, profiler-snapshot.nps
>
>
> With the following table, that contains a *lot* of cells: 
> {code}
> CREATE TABLE biggraphite.datapoints_11520p_60s (
> metric uuid,
> time_start_ms bigint,
> offset smallint,
> count int,
> value double,
> PRIMARY KEY ((metric, time_start_ms), offset)
> ) WITH CLUSTERING ORDER BY (offset DESC);
> AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 
> 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', 
> 'max_threshold': '32', 'min_threshold': '6'}
> Keyspace : biggraphite
> Read Count: 1822
> Read Latency: 1.8870054884742042 ms.
> Write Count: 2212271647
> Write Latency: 0.027705127678653473 ms.
> Pending Flushes: 0
> Table: datapoints_11520p_60s
> SSTable count: 47
> Space used (live): 300417555945
> Space used (total): 303147395017
> Space used by snapshots (total): 0
> Off heap memory used (total): 207453042
> SSTable Compression Ratio: 0.4955200053039823
> Number of keys (estimate): 16343723
> Memtable cell count: 220576
> Memtable data size: 17115128
> Memtable off heap memory used: 0
> Memtable switch count: 2872
> Local read count: 0
> Local read latency: NaN ms
> Local write count: 1103167888
> Local write latency: 0.025 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.0
> Bloom filter space used: 105118296
> Bloom filter off heap memory used: 106547192
> Index summary off heap memory used: 27730962
> Compression metadata off heap memory used: 73174888
> Compacted partition minimum bytes: 61
> Compacted partition maximum bytes: 51012
> Compacted partition mean bytes: 7899
> Average live cells per slice (last five minutes): NaN
> Maximum live cells per slice (last five minutes): 0
> Average tombstones per slice (last five minutes): NaN
> Maximum tombstones per slice (last five minutes): 0
> Dropped Mutations: 0
> {code}
> It looks like a good chunk of the compaction time is lost in 
> StreamingHistogram.update() (which is used to store the estimated tombstone 
> drop times).
> This could be caused by a huge number of different deletion times which would 
> makes the bin huge but it this histogram should be capped to 100 keys. It's 
> more likely caused by the huge number of cells.
> A simple solutions could be to only take into accounts part of the cells, the 
> fact the this table has a TWCS also gives us an additional hint that sampling 
> deletion times would be fine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13038) 33% of compaction time spent in StreamingHistogram.update()

2017-03-01 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891175#comment-15891175
 ] 

Joel Knighton commented on CASSANDRA-13038:
---

Thanks - on a first skim, both of those look good and fix the tests locally for 
me. One minor nit - if removing the maxSpoolSize from {{equals}} on 
{{StreamingHistogram}}, it seems we should remove it from {{hashCode}} as well 
to respect the method contract.

> 33% of compaction time spent in StreamingHistogram.update()
> ---
>
> Key: CASSANDRA-13038
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13038
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Corentin Chary
>Assignee: Jeff Jirsa
> Fix For: 3.0.12, 3.11.0
>
> Attachments: compaction-speedup.patch, 
> compaction-streaminghistrogram.png, profiler-snapshot.nps
>
>
> With the following table, that contains a *lot* of cells: 
> {code}
> CREATE TABLE biggraphite.datapoints_11520p_60s (
> metric uuid,
> time_start_ms bigint,
> offset smallint,
> count int,
> value double,
> PRIMARY KEY ((metric, time_start_ms), offset)
> ) WITH CLUSTERING ORDER BY (offset DESC);
> AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 
> 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', 
> 'max_threshold': '32', 'min_threshold': '6'}
> Keyspace : biggraphite
> Read Count: 1822
> Read Latency: 1.8870054884742042 ms.
> Write Count: 2212271647
> Write Latency: 0.027705127678653473 ms.
> Pending Flushes: 0
> Table: datapoints_11520p_60s
> SSTable count: 47
> Space used (live): 300417555945
> Space used (total): 303147395017
> Space used by snapshots (total): 0
> Off heap memory used (total): 207453042
> SSTable Compression Ratio: 0.4955200053039823
> Number of keys (estimate): 16343723
> Memtable cell count: 220576
> Memtable data size: 17115128
> Memtable off heap memory used: 0
> Memtable switch count: 2872
> Local read count: 0
> Local read latency: NaN ms
> Local write count: 1103167888
> Local write latency: 0.025 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.0
> Bloom filter space used: 105118296
> Bloom filter off heap memory used: 106547192
> Index summary off heap memory used: 27730962
> Compression metadata off heap memory used: 73174888
> Compacted partition minimum bytes: 61
> Compacted partition maximum bytes: 51012
> Compacted partition mean bytes: 7899
> Average live cells per slice (last five minutes): NaN
> Maximum live cells per slice (last five minutes): 0
> Average tombstones per slice (last five minutes): NaN
> Maximum tombstones per slice (last five minutes): 0
> Dropped Mutations: 0
> {code}
> It looks like a good chunk of the compaction time is lost in 
> StreamingHistogram.update() (which is used to store the estimated tombstone 
> drop times).
> This could be caused by a huge number of different deletion times which would 
> makes the bin huge but it this histogram should be capped to 100 keys. It's 
> more likely caused by the huge number of cells.
> A simple solutions could be to only take into accounts part of the cells, the 
> fact the this table has a TWCS also gives us an additional hint that sampling 
> deletion times would be fine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13038) 33% of compaction time spent in StreamingHistogram.update()

2017-03-01 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891013#comment-15891013
 ] 

Joel Knighton commented on CASSANDRA-13038:
---

Thanks! I can reproduce the {{CompactionsTest}} failure locally, so feel free 
to ping me if I can help diagnose.

> 33% of compaction time spent in StreamingHistogram.update()
> ---
>
> Key: CASSANDRA-13038
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13038
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Corentin Chary
>Assignee: Jeff Jirsa
> Fix For: 3.0.12, 3.11.0
>
> Attachments: compaction-speedup.patch, 
> compaction-streaminghistrogram.png, profiler-snapshot.nps
>
>
> With the following table, that contains a *lot* of cells: 
> {code}
> CREATE TABLE biggraphite.datapoints_11520p_60s (
> metric uuid,
> time_start_ms bigint,
> offset smallint,
> count int,
> value double,
> PRIMARY KEY ((metric, time_start_ms), offset)
> ) WITH CLUSTERING ORDER BY (offset DESC);
> AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 
> 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', 
> 'max_threshold': '32', 'min_threshold': '6'}
> Keyspace : biggraphite
> Read Count: 1822
> Read Latency: 1.8870054884742042 ms.
> Write Count: 2212271647
> Write Latency: 0.027705127678653473 ms.
> Pending Flushes: 0
> Table: datapoints_11520p_60s
> SSTable count: 47
> Space used (live): 300417555945
> Space used (total): 303147395017
> Space used by snapshots (total): 0
> Off heap memory used (total): 207453042
> SSTable Compression Ratio: 0.4955200053039823
> Number of keys (estimate): 16343723
> Memtable cell count: 220576
> Memtable data size: 17115128
> Memtable off heap memory used: 0
> Memtable switch count: 2872
> Local read count: 0
> Local read latency: NaN ms
> Local write count: 1103167888
> Local write latency: 0.025 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.0
> Bloom filter space used: 105118296
> Bloom filter off heap memory used: 106547192
> Index summary off heap memory used: 27730962
> Compression metadata off heap memory used: 73174888
> Compacted partition minimum bytes: 61
> Compacted partition maximum bytes: 51012
> Compacted partition mean bytes: 7899
> Average live cells per slice (last five minutes): NaN
> Maximum live cells per slice (last five minutes): 0
> Average tombstones per slice (last five minutes): NaN
> Maximum tombstones per slice (last five minutes): 0
> Dropped Mutations: 0
> {code}
> It looks like a good chunk of the compaction time is lost in 
> StreamingHistogram.update() (which is used to store the estimated tombstone 
> drop times).
> This could be caused by a huge number of different deletion times which would 
> makes the bin huge but it this histogram should be capped to 100 keys. It's 
> more likely caused by the huge number of cells.
> A simple solutions could be to only take into accounts part of the cells, the 
> fact the this table has a TWCS also gives us an additional hint that sampling 
> deletion times would be fine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (CASSANDRA-13038) 33% of compaction time spent in StreamingHistogram.update()

2017-03-01 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890941#comment-15890941
 ] 

Joel Knighton edited comment on CASSANDRA-13038 at 3/1/17 8:09 PM:
---

It looks like this ticket introduced a few test failures. 
{{org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization}}
 is consistently failing on 3.11 and trunk after this commit, and 
{{org.apache.cassandra.db.compaction.CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction}}
 is failing nearly 100% of the time after this commit on trunk.

In both cases, these tests are failing on the linked CI above and appear to 
have no historical failures. I don't see any discussion of these CI failures 
for the linked branches on the ticket - are they being resolved elsewhere?

EDIT: In addition, reverting this commit fixes these test failures.



was (Author: jkni):
It looks like this ticket introduced a few test failures. 
{{org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization}}
 is consistently failing on 3.11 and trunk after this commit, and 
{{org.apache.cassandra.db.compaction.CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction}}
 is failing nearly 100% of the time after this commit on trunk.

In both cases, these tests are failing on the linked CI above and appear to 
have no historical failures. I don't see any discussion of these CI failures 
for the linked branches on the ticket - are they being resolved elsewhere?



> 33% of compaction time spent in StreamingHistogram.update()
> ---
>
> Key: CASSANDRA-13038
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13038
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Corentin Chary
>Assignee: Jeff Jirsa
> Fix For: 3.0.12, 3.11.0
>
> Attachments: compaction-speedup.patch, 
> compaction-streaminghistrogram.png, profiler-snapshot.nps
>
>
> With the following table, that contains a *lot* of cells: 
> {code}
> CREATE TABLE biggraphite.datapoints_11520p_60s (
> metric uuid,
> time_start_ms bigint,
> offset smallint,
> count int,
> value double,
> PRIMARY KEY ((metric, time_start_ms), offset)
> ) WITH CLUSTERING ORDER BY (offset DESC);
> AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 
> 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', 
> 'max_threshold': '32', 'min_threshold': '6'}
> Keyspace : biggraphite
> Read Count: 1822
> Read Latency: 1.8870054884742042 ms.
> Write Count: 2212271647
> Write Latency: 0.027705127678653473 ms.
> Pending Flushes: 0
> Table: datapoints_11520p_60s
> SSTable count: 47
> Space used (live): 300417555945
> Space used (total): 303147395017
> Space used by snapshots (total): 0
> Off heap memory used (total): 207453042
> SSTable Compression Ratio: 0.4955200053039823
> Number of keys (estimate): 16343723
> Memtable cell count: 220576
> Memtable data size: 17115128
> Memtable off heap memory used: 0
> Memtable switch count: 2872
> Local read count: 0
> Local read latency: NaN ms
> Local write count: 1103167888
> Local write latency: 0.025 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.0
> Bloom filter space used: 105118296
> Bloom filter off heap memory used: 106547192
> Index summary off heap memory used: 27730962
> Compression metadata off heap memory used: 73174888
> Compacted partition minimum bytes: 61
> Compacted partition maximum bytes: 51012
> Compacted partition mean bytes: 7899
> Average live cells per slice (last five minutes): NaN
> Maximum live cells per slice (last five minutes): 0
> Average tombstones per slice (last five minutes): NaN
> Maximum tombstones per slice (last five minutes): 0
> Dropped Mutations: 0
> {code}
> It looks like a good chunk of the compaction time is lost in 
> StreamingHistogram.update() (which is used to store the estimated tombstone 
> drop times).
> This could be caused by a huge number of different deletion times which would 
> makes the bin huge but it this histogram should be capped to 100 keys. It's 
> more likely caused by the huge 

[jira] [Reopened] (CASSANDRA-13038) 33% of compaction time spent in StreamingHistogram.update()

2017-03-01 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton reopened CASSANDRA-13038:
---

It looks like this ticket introduced a few test failures. 
{{org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization}}
 is consistently failing on 3.11 and trunk after this commit, and 
{{org.apache.cassandra.db.compaction.CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction}}
 is failing nearly 100% of the time after this commit on trunk.

In both cases, these tests are failing on the linked CI above and appear to 
have no historical failures. I don't see any discussion of these CI failures 
for the linked branches on the ticket - are they being resolved elsewhere?



> 33% of compaction time spent in StreamingHistogram.update()
> ---
>
> Key: CASSANDRA-13038
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13038
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Corentin Chary
>Assignee: Jeff Jirsa
> Fix For: 3.0.12, 3.11.0
>
> Attachments: compaction-speedup.patch, 
> compaction-streaminghistrogram.png, profiler-snapshot.nps
>
>
> With the following table, that contains a *lot* of cells: 
> {code}
> CREATE TABLE biggraphite.datapoints_11520p_60s (
> metric uuid,
> time_start_ms bigint,
> offset smallint,
> count int,
> value double,
> PRIMARY KEY ((metric, time_start_ms), offset)
> ) WITH CLUSTERING ORDER BY (offset DESC);
> AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 
> 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', 
> 'max_threshold': '32', 'min_threshold': '6'}
> Keyspace : biggraphite
> Read Count: 1822
> Read Latency: 1.8870054884742042 ms.
> Write Count: 2212271647
> Write Latency: 0.027705127678653473 ms.
> Pending Flushes: 0
> Table: datapoints_11520p_60s
> SSTable count: 47
> Space used (live): 300417555945
> Space used (total): 303147395017
> Space used by snapshots (total): 0
> Off heap memory used (total): 207453042
> SSTable Compression Ratio: 0.4955200053039823
> Number of keys (estimate): 16343723
> Memtable cell count: 220576
> Memtable data size: 17115128
> Memtable off heap memory used: 0
> Memtable switch count: 2872
> Local read count: 0
> Local read latency: NaN ms
> Local write count: 1103167888
> Local write latency: 0.025 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.0
> Bloom filter space used: 105118296
> Bloom filter off heap memory used: 106547192
> Index summary off heap memory used: 27730962
> Compression metadata off heap memory used: 73174888
> Compacted partition minimum bytes: 61
> Compacted partition maximum bytes: 51012
> Compacted partition mean bytes: 7899
> Average live cells per slice (last five minutes): NaN
> Maximum live cells per slice (last five minutes): 0
> Average tombstones per slice (last five minutes): NaN
> Maximum tombstones per slice (last five minutes): 0
> Dropped Mutations: 0
> {code}
> It looks like a good chunk of the compaction time is lost in 
> StreamingHistogram.update() (which is used to store the estimated tombstone 
> drop times).
> This could be caused by a huge number of different deletion times which would 
> makes the bin huge but it this histogram should be capped to 100 keys. It's 
> more likely caused by the huge number of cells.
> A simple solutions could be to only take into accounts part of the cells, the 
> fact the this table has a TWCS also gives us an additional hint that sampling 
> deletion times would be fine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests

2017-02-24 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883114#comment-15883114
 ] 

Joel Knighton commented on CASSANDRA-12653:
---

Do those answers address your questions well enough, [~jasobrown]? The latest 
patch addressed my concerns, but I don't want to step on your toes.

I had to restart dtests for 2.2, but the latest patch/CI looks good to me 
otherwise.

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests

2017-02-22 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878461#comment-15878461
 ] 

Joel Knighton commented on CASSANDRA-12653:
---

I think I can answer these - feel free to correct me, [~spo...@gmail.com]

In order,
* Presently, the tests depend on the mock MessagingService, which was added in 
[CASSANDRA-12016] to 3.10+. We'd new tests for 2.2/3.0+, which is desirable, 
but I have no great ideas how to do it other than fiddly byteman tests. 
* I agree with this. Stefan and I discussed it on the first pass of review, and 
I wouldn't mind eliminating that check altogether and making it a boolean. 
OTOH, it's cheap to check deserialization time and excludes the messages that 
were deserialized prior to the check. OTOH, there's no meaningful distinction 
in correctness-preserving behaviors between that and arbitrarily delayed gossip 
messages, and we need to handle the latter correctly anyway. I'm most concerned 
about this check giving future readers false hope :).
* It also seems to be me that it doesn't presently need to be synchronized. 
That said, I assumed it was a defensive choice because the internals are 
definitely not safe to call on multiple threads, and someone may make that 
mistake in the future.

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (CASSANDRA-13135) Forced termination of repair session leaves repair jobs running

2017-02-21 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876697#comment-15876697
 ] 

Joel Knighton edited comment on CASSANDRA-13135 at 2/21/17 8:57 PM:


I think this is definitely worth doing from an organizational/efficiency 
standpoint. I'm not sure if the current behavior will greatly increase repair 
time; termination of the repair session will shut down the task executor used 
by the repair jobs, so the queued repair jobs should fail quickly.

The patches look correct, but I have a few thoughts/questions about the 
approach. It seems more error prone than necessary for {{cleanupJobs}} to take 
a reference to an executor. To me, it looks like we'll only ever want to clean 
up jobs from the executor provided in {{start}} and we could store a reference 
to the executor in {{start}}. It also might make more sense to handle this in 
{{forceShutdown}} rather than in a listener added to the RepairSession if a 
reference to the executor is stored.

There's a small typo in {{ActiveRepairService}} - "cacelled repair jobs" should 
be "cancelled repair jobs".

If you'd rather stay with the approach in the attached patches rather than 
something closer to my questions/comments above, we should remove the comment 
above {{forceShutdown}} saying that it will "clear all RepairJobs". This is 
currently incorrect and will remain incorrect if we continue to cleanup the 
jobs in a listener attached to the RepairSession.



was (Author: jkni):
I think this is definitely worth doing from an organizational/efficiency 
standpoint. I'm not sure if this will greatly increase repair time; termination 
of the repair session will shut down the task executor used by the repair jobs, 
so the queued repair jobs should fail quickly.

The patches look correct, but I have a few thoughts/questions about the 
approach. It seems more error prone than necessary for {{cleanupJobs}} to take 
a reference to an executor. To me, it looks like we'll only ever want to clean 
up jobs from the executor provided in {{start}} and we could store a reference 
to the executor in {{start}}. It also might make more sense to handle this in 
{{forceShutdown}} rather than in a listener added to the RepairSession if a 
reference to the executor is stored.

There's a small typo in {{ActiveRepairService}} - "cacelled repair jobs" should 
be "cancelled repair jobs".

If you'd rather stay with the approach in the attached patches rather than 
something closer to my questions/comments above, we should remove the comment 
above {{forceShutdown}} saying that it will "clear all RepairJobs". This is 
currently incorrect and will remain incorrect if we continue to cleanup the 
jobs in a listener attached to the RepairSession.


> Forced termination of repair session leaves repair jobs running
> ---
>
> Key: CASSANDRA-13135
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13135
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Yuki Morishita
>Assignee: Yuki Morishita
> Fix For: 2.2.x, 3.0.x, 3.11.x
>
>
> Forced termination of repair session (by failure detector or jmx) keeps 
> repair jobs running that the session created after session is terminated.
> This can cause increase in repair time by those unnecessary works left in 
> repair job queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13135) Forced termination of repair session leaves repair jobs running

2017-02-21 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-13135:
--
Status: Awaiting Feedback  (was: Open)

> Forced termination of repair session leaves repair jobs running
> ---
>
> Key: CASSANDRA-13135
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13135
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Yuki Morishita
>Assignee: Yuki Morishita
> Fix For: 2.2.x, 3.0.x, 3.11.x
>
>
> Forced termination of repair session (by failure detector or jmx) keeps 
> repair jobs running that the session created after session is terminated.
> This can cause increase in repair time by those unnecessary works left in 
> repair job queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13135) Forced termination of repair session leaves repair jobs running

2017-02-21 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-13135:
--
Status: Open  (was: Patch Available)

> Forced termination of repair session leaves repair jobs running
> ---
>
> Key: CASSANDRA-13135
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13135
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Yuki Morishita
>Assignee: Yuki Morishita
> Fix For: 2.2.x, 3.0.x, 3.11.x
>
>
> Forced termination of repair session (by failure detector or jmx) keeps 
> repair jobs running that the session created after session is terminated.
> This can cause increase in repair time by those unnecessary works left in 
> repair job queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13135) Forced termination of repair session leaves repair jobs running

2017-02-21 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876697#comment-15876697
 ] 

Joel Knighton commented on CASSANDRA-13135:
---

I think this is definitely worth doing from an organizational/efficiency 
standpoint. I'm not sure if this will greatly increase repair time; termination 
of the repair session will shut down the task executor used by the repair jobs, 
so the queued repair jobs should fail quickly.

The patches look correct, but I have a few thoughts/questions about the 
approach. It seems more error prone than necessary for {{cleanupJobs}} to take 
a reference to an executor. To me, it looks like we'll only ever want to clean 
up jobs from the executor provided in {{start}} and we could store a reference 
to the executor in {{start}}. It also might make more sense to handle this in 
{{forceShutdown}} rather than in a listener added to the RepairSession if a 
reference to the executor is stored.

There's a small typo in {{ActiveRepairService}} - "cacelled repair jobs" should 
be "cancelled repair jobs".

If you'd rather stay with the approach in the attached patches rather than 
something closer to my questions/comments above, we should remove the comment 
above {{forceShutdown}} saying that it will "clear all RepairJobs". This is 
currently incorrect and will remain incorrect if we continue to cleanup the 
jobs in a listener attached to the RepairSession.


> Forced termination of repair session leaves repair jobs running
> ---
>
> Key: CASSANDRA-13135
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13135
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Yuki Morishita
>Assignee: Yuki Morishita
> Fix For: 2.2.x, 3.0.x, 3.11.x
>
>
> Forced termination of repair session (by failure detector or jmx) keeps 
> repair jobs running that the session created after session is terminated.
> This can cause increase in repair time by those unnecessary works left in 
> repair job queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests

2017-02-19 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12653:
--
Status: Awaiting Feedback  (was: Open)

A few questions/comments on the latest patches:
* On all versions, a space is missing in the conditional {{if(firstSynSendAt == 
0)}}
* On 2.2, it looks like the patch adds the {{valuesEqual}} method from later 
versions. Was this intentional? It looks unused.
* On all versions, using {{firstSynSendAt == 0}} to check if it has been 
initialized isn't entirely safe. It's entirely legal (although admittedly rare) 
for {{System.nanoTime}} to return 0. If this happened, all acks would be 
rejected.
* Comparisons for two {{System.nanoTime}} values (such as in 
{{GossipDigestAckVerbHandler}}) should not use t1 < t2. Instead, one should 
check the difference (t1 - t2 < 0) because numerical overflow could occur in 
the {{System.nanoTime}} long.
* In {{maybeFinishShadowRound}}/{{finishShadowRound}}, we should add the states 
to the {{endpointShadowStateMap}} before setting {{inShadowRound}} to false. It 
looks like the current behavior admits a race where {{doShadowRound}} could 
read {{inShadowRound == false}} and exit its loop and copy the 
endpointShadowStateMap before it is filled by shadow round finish.
* I believe {{firstSynSendAt}} is accessed from multiple threads and needs to 
be volatile.

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests

2017-02-19 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12653:
--
Status: Open  (was: Patch Available)

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests

2017-02-13 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863233#comment-15863233
 ] 

Joel Knighton commented on CASSANDRA-12653:
---

[~jjirsa] - yes. That said, I don't anticipate getting to it within the next 
couple days, so feel free to give it the final review if it is higher priority 
for you than that. 

I've given several passes of review to the core concepts and they seem good. I 
think a final code style/details pass is all that remains.

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-11479) BatchlogManager unit tests failing on truncate race condition

2017-02-13 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863235#comment-15863235
 ] 

Joel Knighton commented on CASSANDRA-11479:
---

More on my plate than Yuki's, I believe. The patch made sense to me, but I was 
waiting for a chance to dig deeper into the related compaction code before 
giving it a final OK. In the process, it slipped through the cracks quite 
badly. I'd be happy to do that, but it likely wouldn't happen in the next few 
days if you're interested in taking it on instead.

> BatchlogManager unit tests failing on truncate race condition
> -
>
> Key: CASSANDRA-11479
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11479
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Joel Knighton
>Assignee: Yuki Morishita
> Fix For: 2.2.x, 3.0.x, 3.11.x
>
> Attachments: 
> TEST-org.apache.cassandra.batchlog.BatchlogManagerTest.log
>
>
> Example on CI 
> [here|http://cassci.datastax.com/job/trunk_testall/818/testReport/junit/org.apache.cassandra.batchlog/BatchlogManagerTest/testLegacyReplay_compression/].
>  This seems to have only started happening relatively recently (within the 
> last month or two).
> As far as I can tell, this is only showing up on BatchlogManagerTests purely 
> because it is an aggressive user of truncate. The assertion is hit in the 
> setUp method, so it can happen before any of the test methods. The assertion 
> occurs because a compaction is happening when truncate wants to discard 
> SSTables; trace level logs suggest that this compaction is submitted after 
> the pause on the CompactionStrategyManager.
> This should be reproducible by running BatchlogManagerTest in a loop - it 
> takes up to half an hour in my experience. A trace-level log from such a run 
> is attached - grep for my added log message "SSTABLES COMPACTING WHEN 
> DISCARDING" to find when the assert is hit.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13161) testall failure in org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.testVersions

2017-02-01 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-13161:
--
Status: Ready to Commit  (was: Patch Available)

> testall failure in 
> org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.testVersions
> -
>
> Key: CASSANDRA-13161
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13161
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Sean McCarthy
>Assignee: Benjamin Lerer
>  Labels: test-failure, testall
> Attachments: 
> TEST-org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.log
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_testall/1374/testReport/org.apache.cassandra.db.commitlog/CommitLogDescriptorTest/testVersions
> {code}
> Error Message
> expected:<11> but was:<10>
> {code}{code}
> Stacktrace
> junit.framework.AssertionFailedError: expected:<11> but was:<10>
>   at 
> org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.testVersions(CommitLogDescriptorTest.java:84)
> {code}
> Related Failures:
> http://cassci.datastax.com/job/trunk_testall/1374/testReport/org.apache.cassandra.db.commitlog/CommitLogDescriptorTest/testVersions_compression/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13161) testall failure in org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.testVersions

2017-02-01 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848749#comment-15848749
 ] 

Joel Knighton commented on CASSANDRA-13161:
---

+1 - thanks

> testall failure in 
> org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.testVersions
> -
>
> Key: CASSANDRA-13161
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13161
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Sean McCarthy
>Assignee: Benjamin Lerer
>  Labels: test-failure, testall
> Attachments: 
> TEST-org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.log
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_testall/1374/testReport/org.apache.cassandra.db.commitlog/CommitLogDescriptorTest/testVersions
> {code}
> Error Message
> expected:<11> but was:<10>
> {code}{code}
> Stacktrace
> junit.framework.AssertionFailedError: expected:<11> but was:<10>
>   at 
> org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.testVersions(CommitLogDescriptorTest.java:84)
> {code}
> Related Failures:
> http://cassci.datastax.com/job/trunk_testall/1374/testReport/org.apache.cassandra.db.commitlog/CommitLogDescriptorTest/testVersions_compression/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13112) test failure in snitch_test.TestDynamicEndpointSnitch.test_multidatacenter_local_quorum

2017-01-17 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827384#comment-15827384
 ] 

Joel Knighton commented on CASSANDRA-13112:
---

This should be a dtest only fix. I've PRed a fix at 
[https://github.com/riptano/cassandra-dtest/pull/1425].

> test failure in 
> snitch_test.TestDynamicEndpointSnitch.test_multidatacenter_local_quorum
> ---
>
> Key: CASSANDRA-13112
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13112
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sean McCarthy
>Assignee: Joel Knighton
>  Labels: dtest, test-failure
> Attachments: node1_debug.log, node1_gc.log, node1.log, 
> node2_debug.log, node2_gc.log, node2.log, node3_debug.log, node3_gc.log, 
> node3.log, node4_debug.log, node4_gc.log, node4.log, node5_debug.log, 
> node5_gc.log, node5.log, node6_debug.log, node6_gc.log, node6.log
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_large_dtest/48/testReport/snitch_test/TestDynamicEndpointSnitch/test_multidatacenter_local_quorum
> {code}
> Error Message
> 75 != 76
> {code}{code}
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
> testMethod()
>   File "/home/automaton/cassandra-dtest/tools/decorators.py", line 48, in 
> wrapped
> f(obj)
>   File "/home/automaton/cassandra-dtest/snitch_test.py", line 168, in 
> test_multidatacenter_local_quorum
> bad_jmx.read_attribute(read_stage, 'Value'))
>   File "/usr/lib/python2.7/unittest/case.py", line 513, in assertEqual
> assertion_func(first, second, msg=msg)
>   File "/usr/lib/python2.7/unittest/case.py", line 506, in _baseAssertEqual
> raise self.failureException(msg)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (CASSANDRA-13112) test failure in snitch_test.TestDynamicEndpointSnitch.test_multidatacenter_local_quorum

2017-01-17 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton reassigned CASSANDRA-13112:
-

Assignee: Joel Knighton

> test failure in 
> snitch_test.TestDynamicEndpointSnitch.test_multidatacenter_local_quorum
> ---
>
> Key: CASSANDRA-13112
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13112
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sean McCarthy
>Assignee: Joel Knighton
>  Labels: dtest, test-failure
> Attachments: node1_debug.log, node1_gc.log, node1.log, 
> node2_debug.log, node2_gc.log, node2.log, node3_debug.log, node3_gc.log, 
> node3.log, node4_debug.log, node4_gc.log, node4.log, node5_debug.log, 
> node5_gc.log, node5.log, node6_debug.log, node6_gc.log, node6.log
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_large_dtest/48/testReport/snitch_test/TestDynamicEndpointSnitch/test_multidatacenter_local_quorum
> {code}
> Error Message
> 75 != 76
> {code}{code}
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
> testMethod()
>   File "/home/automaton/cassandra-dtest/tools/decorators.py", line 48, in 
> wrapped
> f(obj)
>   File "/home/automaton/cassandra-dtest/snitch_test.py", line 168, in 
> test_multidatacenter_local_quorum
> bad_jmx.read_attribute(read_stage, 'Value'))
>   File "/usr/lib/python2.7/unittest/case.py", line 513, in assertEqual
> assertion_func(first, second, msg=msg)
>   File "/usr/lib/python2.7/unittest/case.py", line 506, in _baseAssertEqual
> raise self.failureException(msg)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests

2017-01-11 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12653:
--
Fix Version/s: 4.x
   3.x
   3.0.x
   2.2.x

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.x, 4.x
>
> Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests

2017-01-11 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12653:
--
Status: Patch Available  (was: Awaiting Feedback)

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests

2017-01-11 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818707#comment-15818707
 ] 

Joel Knighton commented on CASSANDRA-12653:
---

Thanks for the quick response; I agree with all the points in your message. My 
gut instinct is to make the patch as small as possible since we agree that 
establishing a causal relationship or explicitly separating the shadow gossip 
round is the proper long-term solution, but the patch isn't particularly large 
either way, so I'll move forward with the the patch as proposed.

I'll give the patches another review for any small fixes.

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests

2017-01-10 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12653:
--
Status: Awaiting Feedback  (was: Open)

Thanks for the ping - I gave the patch another skim and have some questions.

The approach of returning endpoint states obtained through gossip seems sound. 
I definitely like this idea because it (mostly) prevents us from needing to 
reason about how shadow rounds affect the proper gossip process. That said, I'm 
not sure it fully accomplishes this goal. At the moment, it should be safe if a 
later response comes back after the gossiper has started, but we need to be 
careful to preserve this property in the future.

I'm not sure how the timestamp check is supposed to work. It initializes the 
field using System.nanoTime() the first time we send a gossip, but in the 
gossip digest ack verb handler, we check timestamps using the endpoint state 
update timestamp, which is not serialized inter-node and also initialized using 
System.nanoTime() by the local JVM. It seems to me that this reduces to a 
boolean check that the gossiper has been properly started at least once, since 
this check will only fail when firstSynSendAt == 0. Am I missing something 
here? It also seems to me that we should initialize the field on starting the 
gossiper rather than checking and possibly initializing it every time we send a 
gossip message.

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests

2017-01-10 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12653:
--
Status: Open  (was: Patch Available)

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12856) dtest failure in replication_test.SnitchConfigurationUpdateTest.test_cannot_restart_with_different_rack

2017-01-09 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813610#comment-15813610
 ] 

Joel Knighton commented on CASSANDRA-12856:
---

Nice catch! I wasn't able to reproduce either, but I could easily induce the 
race. I also confirmed that all historical instances of this failure were on 
tests that start and then immediately stop a node. I started CI for 2.2, 3.0, 
and 3.X. All CI looks good.

My main concern was that this new method doesn't allow a serve() again after 
stop(), but I confirmed this doesn't affect our usage since we recreate the 
object on enabling/disable Thrift. It also doesn't violate any Thrift 
implementation requirements.

I wouldn't argue for this to go into 2.1; while the looping thread is 
unfortunate, it shouldn't cause data loss or cascading failures, and it should 
only affect instances were a server is started and immediately stopped, which 
already suggests an unusual situation. That said, the change is small enough 
that I wouldn't be concerned about it going into 2.1 either.

+1

> dtest failure in 
> replication_test.SnitchConfigurationUpdateTest.test_cannot_restart_with_different_rack
> ---
>
> Key: CASSANDRA-12856
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12856
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sean McCarthy
>Assignee: Stefania
>  Labels: dtest, test-failure
> Attachments: node1.log
>
>
> example failure:
> http://cassci.datastax.com/job/cassandra-2.1_novnode_dtest/280/testReport/replication_test/SnitchConfigurationUpdateTest/test_cannot_restart_with_different_rack
> {code}
> Error Message
> Problem stopping node node1
> {code}{code}
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
> testMethod()
>   File "/home/automaton/cassandra-dtest/replication_test.py", line 630, in 
> test_cannot_restart_with_different_rack
> node1.stop(wait_other_notice=True)
>   File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 727, in 
> stop
> raise NodeError("Problem stopping node %s" % self.name)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12856) dtest failure in replication_test.SnitchConfigurationUpdateTest.test_cannot_restart_with_different_rack

2017-01-09 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12856:
--
Status: Ready to Commit  (was: Patch Available)

> dtest failure in 
> replication_test.SnitchConfigurationUpdateTest.test_cannot_restart_with_different_rack
> ---
>
> Key: CASSANDRA-12856
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12856
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sean McCarthy
>Assignee: Stefania
>  Labels: dtest, test-failure
> Attachments: node1.log
>
>
> example failure:
> http://cassci.datastax.com/job/cassandra-2.1_novnode_dtest/280/testReport/replication_test/SnitchConfigurationUpdateTest/test_cannot_restart_with_different_rack
> {code}
> Error Message
> Problem stopping node node1
> {code}{code}
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
> testMethod()
>   File "/home/automaton/cassandra-dtest/replication_test.py", line 630, in 
> test_cannot_restart_with_different_rack
> node1.stop(wait_other_notice=True)
>   File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 727, in 
> stop
> raise NodeError("Problem stopping node %s" % self.name)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12792) delete with timestamp long.MAX_VALUE for the whole key creates tombstone that cannot be removed.

2017-01-09 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812898#comment-15812898
 ] 

Joel Knighton commented on CASSANDRA-12792:
---

[~jjordan] - 2.2.9, 3.0.10, and 3.10. I also updated the fixver field. Thanks 
for the reminder.

> delete with timestamp long.MAX_VALUE for the whole key creates tombstone that 
> cannot be removed. 
> -
>
> Key: CASSANDRA-12792
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12792
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Ian Ilsley
>Assignee: Joel Knighton
> Fix For: 2.2.9, 3.0.10, 3.10
>
>
> In db/compaction/LazilyCompactedRow.java 
> we only check for  <  MaxPurgeableTimeStamp  
> eg:
> (this.maxRowTombstone.markedForDeleteAt < getMaxPurgeableTimestamp())
> this should probably be <= 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12792) delete with timestamp long.MAX_VALUE for the whole key creates tombstone that cannot be removed.

2017-01-09 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12792:
--
Fix Version/s: 3.0.10
   3.10
   2.2.9

> delete with timestamp long.MAX_VALUE for the whole key creates tombstone that 
> cannot be removed. 
> -
>
> Key: CASSANDRA-12792
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12792
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Ian Ilsley
>Assignee: Joel Knighton
> Fix For: 2.2.9, 3.0.10, 3.10
>
>
> In db/compaction/LazilyCompactedRow.java 
> we only check for  <  MaxPurgeableTimeStamp  
> eg:
> (this.maxRowTombstone.markedForDeleteAt < getMaxPurgeableTimestamp())
> this should probably be <= 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-13074) DynamicEndpointSnitch frequently no-ops through early exit in multi-datacenter situations

2017-01-03 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15795527#comment-15795527
 ] 

Joel Knighton commented on CASSANDRA-13074:
---

[~tjake] - A dtest is a great idea. I've PRed a test to 
[riptano/cassandra-dtest|https://github.com/riptano/cassandra-dtest/pull/1416].

> DynamicEndpointSnitch frequently no-ops through early exit in 
> multi-datacenter situations
> -
>
> Key: CASSANDRA-13074
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13074
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Joel Knighton
>Assignee: Joel Knighton
> Fix For: 2.2.x, 3.0.x, 3.x, 4.x
>
>
> The DynamicEndpointSnitch attempts to use timings from nodes to route reads 
> to better performing nodes.
> In a multi-datacenter situation, timings will likely be empty for nodes 
> outside of the local datacenter, as you'll frequently only be doing 
> local_quorum reads (or a lower consistency level). In this case, the DES 
> exits early and returns the subsnitch ordering. This means poorly performing 
> replicas will never be avoided, no matter how degraded they are.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-13074) DynamicEndpointSnitch frequently no-ops through early exit in multi-datacenter situations

2016-12-22 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-13074:
--
Status: Patch Available  (was: Open)

||branch||testall||dtest||
|[des-snitch-changes-2.2|https://github.com/jkni/cassandra/tree/des-snitch-changes-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-snitch-changes-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-snitch-changes-2.2-dtest]|
|[des-changes-3.0|https://github.com/jkni/cassandra/tree/des-changes-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-3.0-dtest]|
|[des-changes-3.11|https://github.com/jkni/cassandra/tree/des-changes-3.11]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-3.11-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-3.11-dtest]|
|[des-changes-3.X|https://github.com/jkni/cassandra/tree/des-changes-3.X]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-3.X-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-3.X-dtest]|
|[des-changes-trunk|https://github.com/jkni/cassandra/tree/des-changes-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-trunk-dtest]|

I've attached all branches, but the merge forward from 2.2 is clean except for 
a trivial to resolve merge conflict in 3.0 -> 3.11. CI looks clean.

Although the patch is small, there's a fair amount of nuance here. We no longer 
want to seed with a latency of zero - in this case, if you're doing lots of 
local_quorum reads or something similar, populating with zero would mean that 
we no longer get any benefits from stickiness. With this patch, we only 
populate with real latencies.

> DynamicEndpointSnitch frequently no-ops through early exit in 
> multi-datacenter situations
> -
>
> Key: CASSANDRA-13074
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13074
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Joel Knighton
>Assignee: Joel Knighton
> Fix For: 2.2.x, 3.0.x, 3.x, 4.x
>
>
> The DynamicEndpointSnitch attempts to use timings from nodes to route reads 
> to better performing nodes.
> In a multi-datacenter situation, timings will likely be empty for nodes 
> outside of the local datacenter, as you'll frequently only be doing 
> local_quorum reads (or a lower consistency level). In this case, the DES 
> exits early and returns the subsnitch ordering. This means poorly performing 
> replicas will never be avoided, no matter how degraded they are.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-13074) DynamicEndpointSnitch frequently no-ops through early exit in multi-datacenter situations

2016-12-22 Thread Joel Knighton (JIRA)
Joel Knighton created CASSANDRA-13074:
-

 Summary: DynamicEndpointSnitch frequently no-ops through early 
exit in multi-datacenter situations
 Key: CASSANDRA-13074
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13074
 Project: Cassandra
  Issue Type: Bug
  Components: Coordination
Reporter: Joel Knighton
Assignee: Joel Knighton
 Fix For: 2.2.x, 3.0.x, 3.x, 4.x


The DynamicEndpointSnitch attempts to use timings from nodes to route reads to 
better performing nodes.

In a multi-datacenter situation, timings will likely be empty for nodes outside 
of the local datacenter, as you'll frequently only be doing local_quorum reads 
(or a lower consistency level). In this case, the DES exits early and returns 
the subsnitch ordering. This means poorly performing replicas will never be 
avoided, no matter how degraded they are.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8795) Cassandra (possibly under load) occasionally throws an exception during CQL create table

2016-12-14 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749321#comment-15749321
 ] 

Joel Knighton commented on CASSANDRA-8795:
--

I'm not sure it makes sense to make address this in 2.1 any more, as I wouldn't 
call this behavior critical given the behavior of schema in 2.1. This specific 
issue should not affect 2.2+.

> Cassandra (possibly under load) occasionally throws an exception during CQL 
> create table
> 
>
> Key: CASSANDRA-8795
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8795
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Darren Warner
>Assignee: Joel Knighton
>
> CQLSH will return the following:
> {code}
> { name: 'ResponseError',
>   message: 'java.lang.RuntimeException: 
> java.util.concurrent.ExecutionException: java.lang.NullPointerException',
>   info: 'Represents an error message from the server',
>  code: 0,
>  query: 'CREATE TABLE IF NOT EXISTS roles_by_users( userid TIMEUUID, role 
> INT, entityid TIMEUUID, entity_type TEXT, enabled BOOLEAN, PRIMARY KEY 
> (userid, role, entityid, entity_type) );' }
> {code}
> Cassandra system.log shows:
> {code}
> ERROR [MigrationStage:1] 2015-02-11 14:38:48,610 CassandraDaemon.java:153 - 
> Exception in thread Thread[MigrationStage:1,5,main]
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.db.DefsTables.addColumnFamily(DefsTables.java:371) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.db.DefsTables.mergeColumnFamilies(DefsTables.java:293) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.db.DefsTables.mergeSchemaInternal(DefsTables.java:194) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:166) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.service.MigrationManager$2.runMayThrow(MigrationManager.java:393)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_31]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> ~[na:1.8.0_31]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  ~[na:1.8.0_31]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_31]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31]
> ERROR [SharedPool-Worker-2] 2015-02-11 14:38:48,620 QueryMessage.java:132 - 
> Unexpected error during query
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> java.lang.NullPointerException
> at 
> org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:398) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.service.MigrationManager.announce(MigrationManager.java:374)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.service.MigrationManager.announceNewColumnFamily(MigrationManager.java:249)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.cql3.statements.CreateTableStatement.announceMigration(CreateTableStatement.java:113)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:80)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:226)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:248) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:119)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:439)
>  [apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:335)
>  [apache-cassandra-2.1.2.jar:2.1.2]
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  [netty-all-4.0.23.Final.jar:4.0.23.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  [netty-all-4.0.23.Final.jar:4.0.23.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32)
>  [netty-all-4.0.23.Final.jar:4.0.23.Final]
> at 
> 

[jira] [Resolved] (CASSANDRA-8795) Cassandra (possibly under load) occasionally throws an exception during CQL create table

2016-12-14 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton resolved CASSANDRA-8795.
--
   Resolution: Won't Fix
Fix Version/s: (was: 2.1.x)

> Cassandra (possibly under load) occasionally throws an exception during CQL 
> create table
> 
>
> Key: CASSANDRA-8795
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8795
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Darren Warner
>Assignee: Joel Knighton
>
> CQLSH will return the following:
> {code}
> { name: 'ResponseError',
>   message: 'java.lang.RuntimeException: 
> java.util.concurrent.ExecutionException: java.lang.NullPointerException',
>   info: 'Represents an error message from the server',
>  code: 0,
>  query: 'CREATE TABLE IF NOT EXISTS roles_by_users( userid TIMEUUID, role 
> INT, entityid TIMEUUID, entity_type TEXT, enabled BOOLEAN, PRIMARY KEY 
> (userid, role, entityid, entity_type) );' }
> {code}
> Cassandra system.log shows:
> {code}
> ERROR [MigrationStage:1] 2015-02-11 14:38:48,610 CassandraDaemon.java:153 - 
> Exception in thread Thread[MigrationStage:1,5,main]
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.db.DefsTables.addColumnFamily(DefsTables.java:371) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.db.DefsTables.mergeColumnFamilies(DefsTables.java:293) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.db.DefsTables.mergeSchemaInternal(DefsTables.java:194) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:166) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.service.MigrationManager$2.runMayThrow(MigrationManager.java:393)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_31]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> ~[na:1.8.0_31]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  ~[na:1.8.0_31]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_31]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31]
> ERROR [SharedPool-Worker-2] 2015-02-11 14:38:48,620 QueryMessage.java:132 - 
> Unexpected error during query
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> java.lang.NullPointerException
> at 
> org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:398) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.service.MigrationManager.announce(MigrationManager.java:374)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.service.MigrationManager.announceNewColumnFamily(MigrationManager.java:249)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.cql3.statements.CreateTableStatement.announceMigration(CreateTableStatement.java:113)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:80)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:226)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:248) 
> ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:119)
>  ~[apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:439)
>  [apache-cassandra-2.1.2.jar:2.1.2]
> at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:335)
>  [apache-cassandra-2.1.2.jar:2.1.2]
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  [netty-all-4.0.23.Final.jar:4.0.23.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  [netty-all-4.0.23.Final.jar:4.0.23.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32)
>  [netty-all-4.0.23.Final.jar:4.0.23.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext$8.run(AbstractChannelHandlerContext.java:324)
>  [netty-all-4.0.23.Final.jar:4.0.23.Final]
> at 
> 

[jira] [Commented] (CASSANDRA-12652) Failure in SASIIndexTest.testStaticIndex-compression

2016-12-09 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15735671#comment-15735671
 ] 

Joel Knighton commented on CASSANDRA-12652:
---

Thanks for the update!

> Failure in SASIIndexTest.testStaticIndex-compression
> 
>
> Key: CASSANDRA-12652
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12652
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joel Knighton
>Assignee: Alex Petrov
>
> Stacktrace:
> {code}
> junit.framework.AssertionFailedError: expected:<1> but was:<0>
>   at 
> org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1839)
>   at 
> org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1786)
> {code}
> Example failure:
> http://cassci.datastax.com/job/trunk_testall/1176/testReport/org.apache.cassandra.index.sasi/SASIIndexTest/testStaticIndex_compression/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-13012) Paxos regression from CASSANDRA-12716

2016-12-07 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729275#comment-15729275
 ] 

Joel Knighton commented on CASSANDRA-13012:
---

+1

> Paxos regression from CASSANDRA-12716
> -
>
> Key: CASSANDRA-13012
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13012
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sylvain Lebresne
>Assignee: Sylvain Lebresne
>Priority: Minor
>
> I introduced a dumb bug when reading the Paxos state in 
> {{SystemKeyspace.loadPaxosState}} where the new condition on 
> {{proposal_version}} and {{most_recent_commit_version}} is obviously way too 
> strong, and actually entirely unnecessary.
> This is consistently breaking the 
> {{paxos_tests.TestPaxos.contention_test_many_threads}} so I'm not sure why I 
> didn't caught that, sorry. Thanks to [~jkni] who noticed that first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-13012) Paxos regression from CASSANDRA-12716

2016-12-07 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-13012:
--
Status: Ready to Commit  (was: Patch Available)

> Paxos regression from CASSANDRA-12716
> -
>
> Key: CASSANDRA-13012
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13012
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sylvain Lebresne
>Assignee: Sylvain Lebresne
>Priority: Minor
>
> I introduced a dumb bug when reading the Paxos state in 
> {{SystemKeyspace.loadPaxosState}} where the new condition on 
> {{proposal_version}} and {{most_recent_commit_version}} is obviously way too 
> strong, and actually entirely unnecessary.
> This is consistently breaking the 
> {{paxos_tests.TestPaxos.contention_test_many_threads}} so I'm not sure why I 
> didn't caught that, sorry. Thanks to [~jkni] who noticed that first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CASSANDRA-12987) dtest failure in paxos_tests.TestPaxos.contention_test_many_threads

2016-12-07 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton resolved CASSANDRA-12987.
---
Resolution: Duplicate

> dtest failure in paxos_tests.TestPaxos.contention_test_many_threads
> ---
>
> Key: CASSANDRA-12987
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12987
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sean McCarthy
>Assignee: Joel Knighton
>  Labels: dtest, test-failure
> Attachments: node1.log, node1_debug.log, node1_gc.log, node2.log, 
> node2_debug.log, node2_gc.log, node3.log, node3_debug.log, node3_gc.log
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_dtest/1437/testReport/paxos_tests/TestPaxos/contention_test_many_threads
> {code}
> Error Message
> value=299, errors=0, retries=25559
> {code}{code}
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
> testMethod()
>   File "/home/automaton/cassandra-dtest/paxos_tests.py", line 88, in 
> contention_test_many_threads
> self._contention_test(300, 1)
>   File "/home/automaton/cassandra-dtest/paxos_tests.py", line 192, in 
> _contention_test
> self.assertTrue((value == threads * iterations) and (errors == 0), 
> "value={}, errors={}, retries={}".format(value, errors, retries))
>   File "/usr/lib/python2.7/unittest/case.py", line 422, in assertTrue
> raise self.failureException(msg)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11107) Add native_transport_address and native_transport_broadcast_address yaml options

2016-11-29 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706902#comment-15706902
 ] 

Joel Knighton edited comment on CASSANDRA-11107 at 11/29/16 11:28 PM:
--

That's correct - it's true that I could at least piggyback on those since those 
migrations/changes would already be necessary.

EDIT: That is, I would need to make additional changes, but I could time this 
for the same release to prevent the need for additional legacy tables.


was (Author: jkni):
That's correct - it's true that I could at least piggyback on those since those 
migrations/changes would already be necessary.

> Add native_transport_address and native_transport_broadcast_address yaml 
> options
> 
>
> Key: CASSANDRA-11107
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11107
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: n0rad
>Assignee: Joel Knighton
>Priority: Minor
>
> I'm starting cassandra on a container with this /etc/hosts
> {quote}
> 127.0.0.1rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost
> ::1  rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost
> {quote}
> I have the default configuration except :
> {quote}
>  - seeds: "10.1.1.1"
> listen_address : 10.1.1.1
> {quote}
> cassandra will start listening on *127.0.0.1:9042*
> if I set *rpc_address:10.1.1.1* , even if *start_rpc: false*, cassandra will 
> listen on 10.1.1.1
> Since rpc is not started, I assumed that *rpc_address* and 
> *broadcast_rpc_address* will be ignored
> It took me a while to figure that. There may be something to do around this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11107) Add native_transport_address and native_transport_broadcast_address yaml options

2016-11-29 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706902#comment-15706902
 ] 

Joel Knighton commented on CASSANDRA-11107:
---

That's correct - it's true that I could at least piggyback on those since those 
migrations/changes would already be necessary.

> Add native_transport_address and native_transport_broadcast_address yaml 
> options
> 
>
> Key: CASSANDRA-11107
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11107
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: n0rad
>Assignee: Joel Knighton
>Priority: Minor
>
> I'm starting cassandra on a container with this /etc/hosts
> {quote}
> 127.0.0.1rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost
> ::1  rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost
> {quote}
> I have the default configuration except :
> {quote}
>  - seeds: "10.1.1.1"
> listen_address : 10.1.1.1
> {quote}
> cassandra will start listening on *127.0.0.1:9042*
> if I set *rpc_address:10.1.1.1* , even if *start_rpc: false*, cassandra will 
> listen on 10.1.1.1
> Since rpc is not started, I assumed that *rpc_address* and 
> *broadcast_rpc_address* will be ignored
> It took me a while to figure that. There may be something to do around this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11107) Add native_transport_address and native_transport_broadcast_address yaml options

2016-11-29 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706743#comment-15706743
 ] 

Joel Knighton commented on CASSANDRA-11107:
---

I've got a patch in progress that solves the easy parts of this. At this point, 
however, I am having second thoughts regarding the costs/benefits of this 
change.

At this point, to support separate rpc/native_transport configurations, changes 
would seem to include:
* updating the native protocol so that NEW_NODE events include rpc_address and 
native_transport_address (and other TopologyChangeEvents, since identifiers 
used by drivers might include both address configurations)
* updating the PEERS table to include rpc_address and native_transport_address
* adding an ApplicationState in Gossip for native_transport_address.

Drivers would also need to be updated to query native_transport_address 
appropriately. This seems like a fair amount of work when 4.0 will end up 
negating these changes on removing Thrift.

The other option that immediately presents itself is to allow these properties 
to be set in a 3.X yaml but require them to match the rpc configurations. I'm 
not sure this is worth it either.

Let me know what you think, [~slebresne].

> Add native_transport_address and native_transport_broadcast_address yaml 
> options
> 
>
> Key: CASSANDRA-11107
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11107
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: n0rad
>Assignee: Joel Knighton
>Priority: Minor
>
> I'm starting cassandra on a container with this /etc/hosts
> {quote}
> 127.0.0.1rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost
> ::1  rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost
> {quote}
> I have the default configuration except :
> {quote}
>  - seeds: "10.1.1.1"
> listen_address : 10.1.1.1
> {quote}
> cassandra will start listening on *127.0.0.1:9042*
> if I set *rpc_address:10.1.1.1* , even if *start_rpc: false*, cassandra will 
> listen on 10.1.1.1
> Since rpc is not started, I assumed that *rpc_address* and 
> *broadcast_rpc_address* will be ignored
> It took me a while to figure that. There may be something to do around this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests

2016-11-28 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-11381:
--
Status: Open  (was: Patch Available)

> Node running with join_ring=false and authentication can not serve requests
> ---
>
> Key: CASSANDRA-11381
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11381
> Project: Cassandra
>  Issue Type: Bug
>Reporter: mck
>Assignee: mck
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, 
> 11381-3.X.txt, 11381-trunk.txt, dtest-11381-trunk.txt
>
>
> Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has 
> authentication configured, eg PasswordAuthenticator, won't be able to serve 
> requests. This is because {{Auth.setup()}} never gets called during the 
> startup.
> Without {{Auth.setup()}} having been called in {{StorageService}} clients 
> connecting to the node fail with the node throwing
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119)
> at 
> org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
> at 
> com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689)
> at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The exception thrown from the 
> [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119]
> {code}
> ResultMessage.Rows rows = 
> authenticateStatement.execute(QueryState.forInternalCalls(), new 
> QueryOptions(consistencyForUser(username),
>   
>Lists.newArrayList(ByteBufferUtil.bytes(username;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests

2016-11-28 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-11381:
--
Status: Awaiting Feedback  (was: Open)

> Node running with join_ring=false and authentication can not serve requests
> ---
>
> Key: CASSANDRA-11381
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11381
> Project: Cassandra
>  Issue Type: Bug
>Reporter: mck
>Assignee: mck
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, 
> 11381-3.X.txt, 11381-trunk.txt, dtest-11381-trunk.txt
>
>
> Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has 
> authentication configured, eg PasswordAuthenticator, won't be able to serve 
> requests. This is because {{Auth.setup()}} never gets called during the 
> startup.
> Without {{Auth.setup()}} having been called in {{StorageService}} clients 
> connecting to the node fail with the node throwing
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119)
> at 
> org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
> at 
> com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689)
> at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The exception thrown from the 
> [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119]
> {code}
> ResultMessage.Rows rows = 
> authenticateStatement.execute(QueryState.forInternalCalls(), new 
> QueryOptions(consistencyForUser(username),
>   
>Lists.newArrayList(ByteBufferUtil.bytes(username;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests

2016-11-28 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704157#comment-15704157
 ] 

Joel Knighton commented on CASSANDRA-11381:
---

Thanks - the patches look good and I put them through CI.

||branch||testall||dtest||
|[CASSANDRA-11381-2.2|https://github.com/jkni/cassandra/tree/CASSANDRA-11381-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-2.2-dtest]|
|[CASSANDRA-11381-3.0|https://github.com/jkni/cassandra/tree/CASSANDRA-11381-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-3.0-dtest]|
|[CASSANDRA-11381-3.X|https://github.com/jkni/cassandra/tree/CASSANDRA-11381-3.X]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-3.X-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-3.X-dtest]|
|[CASSANDRA-11381-trunk|https://github.com/jkni/cassandra/tree/CASSANDRA-11381-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-trunk-dtest]|

CI looks good for the most part, and I checked that your added dtest passes on 
all branches. CI revealed one small problem - when a fresh node is started with 
join_ring=False that has no tokens for other nodes discovered through gossip 
and no saved tokens, it hits an AssertionError in {{CassandraRoleManager}} 
setup that is not handled and gets logged as an error by a top level error 
handler, as seen 
[here|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-2.2-dtest/1/testReport/junit/topology_test/TestTopology/do_not_join_ring_test/].
 In this specific test case, this behavior is hit because a single node cluster 
is started with join_ring=False. Since this prevents setup from being retried 
within the {{CassandraRoleManager}}, it seems to me that it is probably worth 
checking for an absence of tokens in {{CassandraRoleManager.setupDefaultRole}} 
and throwing a catchable exception/printing a warning so that setup can be 
retried. What do you think? There may be another alternative I haven't 
considered.

> Node running with join_ring=false and authentication can not serve requests
> ---
>
> Key: CASSANDRA-11381
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11381
> Project: Cassandra
>  Issue Type: Bug
>Reporter: mck
>Assignee: mck
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, 
> 11381-3.X.txt, 11381-trunk.txt, dtest-11381-trunk.txt
>
>
> Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has 
> authentication configured, eg PasswordAuthenticator, won't be able to serve 
> requests. This is because {{Auth.setup()}} never gets called during the 
> startup.
> Without {{Auth.setup()}} having been called in {{StorageService}} clients 
> connecting to the node fail with the node throwing
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119)
> at 
> org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
> at 
> com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689)
> at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The exception thrown from the 
> [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119]
> {code}
> ResultMessage.Rows rows = 
> authenticateStatement.execute(QueryState.forInternalCalls(), new 
> 

[jira] [Comment Edited] (CASSANDRA-12652) Failure in SASIIndexTest.testStaticIndex-compression

2016-11-23 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690558#comment-15690558
 ] 

Joel Knighton edited comment on CASSANDRA-12652 at 11/23/16 4:23 PM:
-

Fixed by revert 490c1c27c9b700f14212d9591a516ddb8d0865c7 before release.


was (Author: jkni):
Fixed by revert before release.

> Failure in SASIIndexTest.testStaticIndex-compression
> 
>
> Key: CASSANDRA-12652
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12652
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joel Knighton
>Assignee: Alex Petrov
>
> Stacktrace:
> {code}
> junit.framework.AssertionFailedError: expected:<1> but was:<0>
>   at 
> org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1839)
>   at 
> org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1786)
> {code}
> Example failure:
> http://cassci.datastax.com/job/trunk_testall/1176/testReport/org.apache.cassandra.index.sasi/SASIIndexTest/testStaticIndex_compression/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12652) Failure in SASIIndexTest.testStaticIndex-compression

2016-11-23 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12652:
--
Fix Version/s: (was: 4.x)
   (was: 3.x)

> Failure in SASIIndexTest.testStaticIndex-compression
> 
>
> Key: CASSANDRA-12652
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12652
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joel Knighton
>Assignee: Alex Petrov
>
> Stacktrace:
> {code}
> junit.framework.AssertionFailedError: expected:<1> but was:<0>
>   at 
> org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1839)
>   at 
> org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1786)
> {code}
> Example failure:
> http://cassci.datastax.com/job/trunk_testall/1176/testReport/org.apache.cassandra.index.sasi/SASIIndexTest/testStaticIndex_compression/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12652) Failure in SASIIndexTest.testStaticIndex-compression

2016-11-23 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12652:
--
Resolution: Fixed
Status: Resolved  (was: Awaiting Feedback)

Fixed by revert before release.

> Failure in SASIIndexTest.testStaticIndex-compression
> 
>
> Key: CASSANDRA-12652
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12652
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joel Knighton
>Assignee: Alex Petrov
>
> Stacktrace:
> {code}
> junit.framework.AssertionFailedError: expected:<1> but was:<0>
>   at 
> org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1839)
>   at 
> org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1786)
> {code}
> Example failure:
> http://cassci.datastax.com/job/trunk_testall/1176/testReport/org.apache.cassandra.index.sasi/SASIIndexTest/testStaticIndex_compression/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12652) Failure in SASIIndexTest.testStaticIndex-compression

2016-11-23 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690556#comment-15690556
 ] 

Joel Knighton commented on CASSANDRA-12652:
---

I agree; I'm unable to reproduce this failure on any branch on any machine 
after the revert. I posted this as a comment on [CASSANDRA-11990] to make sure 
this test is considered during the updated implementation. I'm closing this for 
now.

> Failure in SASIIndexTest.testStaticIndex-compression
> 
>
> Key: CASSANDRA-12652
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12652
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joel Knighton
>Assignee: Alex Petrov
> Fix For: 3.x, 4.x
>
>
> Stacktrace:
> {code}
> junit.framework.AssertionFailedError: expected:<1> but was:<0>
>   at 
> org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1839)
>   at 
> org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1786)
> {code}
> Example failure:
> http://cassci.datastax.com/job/trunk_testall/1176/testReport/org.apache.cassandra.index.sasi/SASIIndexTest/testStaticIndex_compression/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11990) Address rows rather than partitions in SASI

2016-11-23 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690552#comment-15690552
 ] 

Joel Knighton commented on CASSANDRA-11990:
---

The revert here seems to have fixed the test failure in [CASSANDRA-12652] - 
extra attention should be paid to this test when an updated implementation is 
available.

> Address rows rather than partitions in SASI
> ---
>
> Key: CASSANDRA-11990
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11990
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL, sasi
>Reporter: Alex Petrov
>Assignee: Alex Petrov
> Fix For: 3.x
>
> Attachments: perf.pdf, size_comparison.png
>
>
> Currently, the lookup in SASI index would return the key position of the 
> partition. After the partition lookup, the rows are iterated and the 
> operators are applied in order to filter out ones that do not match.
> bq. TokenTree which accepts variable size keys (such would enable different 
> partitioners, collections support, primary key indexing etc.), 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11107) Add native_transport_address and native_transport_broadcast_address yaml options

2016-11-23 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690433#comment-15690433
 ] 

Joel Knighton commented on CASSANDRA-11107:
---

That shouldn't be a problem - I expect I'll have time to submit a patch here in 
the next week or so.

> Add native_transport_address and native_transport_broadcast_address yaml 
> options
> 
>
> Key: CASSANDRA-11107
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11107
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: n0rad
>Assignee: Joel Knighton
>Priority: Minor
>
> I'm starting cassandra on a container with this /etc/hosts
> {quote}
> 127.0.0.1rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost
> ::1  rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost
> {quote}
> I have the default configuration except :
> {quote}
>  - seeds: "10.1.1.1"
> listen_address : 10.1.1.1
> {quote}
> cassandra will start listening on *127.0.0.1:9042*
> if I set *rpc_address:10.1.1.1* , even if *start_rpc: false*, cassandra will 
> listen on 10.1.1.1
> Since rpc is not started, I assumed that *rpc_address* and 
> *broadcast_rpc_address* will be ignored
> It took me a while to figure that. There may be something to do around this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12281) Gossip blocks on startup when there are pending range movements

2016-11-18 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12281:
--
Summary: Gossip blocks on startup when there are pending range movements  
(was: Gossip blocks on startup when another node is bootstrapping)

> Gossip blocks on startup when there are pending range movements
> ---
>
> Key: CASSANDRA-12281
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12281
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Eric Evans
>Assignee: Stefan Podkowinski
> Fix For: 2.2.9, 3.0.11, 3.10
>
> Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-3.X.patch, 
> 12281-trunk.patch, restbase1015-a_jstack.txt
>
>
> In our cluster, normal node startup times (after a drain on shutdown) are 
> less than 1 minute.  However, when another node in the cluster is 
> bootstrapping, the same node startup takes nearly 30 minutes to complete, the 
> apparent result of gossip blocking on pending range calculations.
> {noformat}
> $ nodetool-a tpstats
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 0   1840 0
>  0
> ReadStage 0 0   2350 0
>  0
> RequestResponseStage  0 0 53 0
>  0
> ReadRepairStage   0 0  1 0
>  0
> CounterMutationStage  0 0  0 0
>  0
> HintedHandoff 0 0 44 0
>  0
> MiscStage 0 0  0 0
>  0
> CompactionExecutor3 3395 0
>  0
> MemtableReclaimMemory 0 0 30 0
>  0
> PendingRangeCalculator1 2 29 0
>  0
> GossipStage   1  5602164 0
>  0
> MigrationStage0 0  0 0
>  0
> MemtablePostFlush 0 0111 0
>  0
> ValidationExecutor0 0  0 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   0 0 30 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> CacheCleanupExecutor  0 0  0 0
>  0
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION 0
> COUNTER_MUTATION 0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR  0
> {noformat}
> A full thread dump is attached, but the relevant bit seems to be here:
> {noformat}
> [ ... ]
> "GossipStage:1" #1801 daemon prio=5 os_prio=0 tid=0x7fe4cd54b000 
> nid=0xea9 waiting on condition [0x7fddcf883000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0004c1e922c0> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
>   at 
> org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:174)
>   at 
> org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:160)
>   at 
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2023)
>   at 
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1682)
>   at 
> org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1182)
>   at 

[jira] [Comment Edited] (CASSANDRA-12281) Gossip blocks on startup when another node is bootstrapping

2016-11-18 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669481#comment-15669481
 ] 

Joel Knighton edited comment on CASSANDRA-12281 at 11/18/16 5:58 PM:
-

Thanks - your changes and CI look good. I also ran CI on your 
CASSANDRA-12281-trunk branch.

Note to committer: there are very slight differences in the 2.2/3.0/3.x 
branches (not in substantial content, but in comments and other minor fixes). 
The 3.x branch should merge cleanly into trunk, I believe.

+1


was (Author: jkni):
Thanks - your changes and CI look good. I also ran CI on your 
CASSANDRA-12281-trunk branch.

Note to committer: there are very slight differences in the 2.2/3.0/3.x 
branches (not in substantial comment, but in comments and other minor fixes). 
The 3.x branch should merge cleanly into trunk, I believe.

+1

> Gossip blocks on startup when another node is bootstrapping
> ---
>
> Key: CASSANDRA-12281
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12281
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Eric Evans
>Assignee: Stefan Podkowinski
> Fix For: 2.2.9, 3.0.11, 3.10
>
> Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-3.X.patch, 
> 12281-trunk.patch, restbase1015-a_jstack.txt
>
>
> In our cluster, normal node startup times (after a drain on shutdown) are 
> less than 1 minute.  However, when another node in the cluster is 
> bootstrapping, the same node startup takes nearly 30 minutes to complete, the 
> apparent result of gossip blocking on pending range calculations.
> {noformat}
> $ nodetool-a tpstats
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 0   1840 0
>  0
> ReadStage 0 0   2350 0
>  0
> RequestResponseStage  0 0 53 0
>  0
> ReadRepairStage   0 0  1 0
>  0
> CounterMutationStage  0 0  0 0
>  0
> HintedHandoff 0 0 44 0
>  0
> MiscStage 0 0  0 0
>  0
> CompactionExecutor3 3395 0
>  0
> MemtableReclaimMemory 0 0 30 0
>  0
> PendingRangeCalculator1 2 29 0
>  0
> GossipStage   1  5602164 0
>  0
> MigrationStage0 0  0 0
>  0
> MemtablePostFlush 0 0111 0
>  0
> ValidationExecutor0 0  0 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   0 0 30 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> CacheCleanupExecutor  0 0  0 0
>  0
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION 0
> COUNTER_MUTATION 0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR  0
> {noformat}
> A full thread dump is attached, but the relevant bit seems to be here:
> {noformat}
> [ ... ]
> "GossipStage:1" #1801 daemon prio=5 os_prio=0 tid=0x7fe4cd54b000 
> nid=0xea9 waiting on condition [0x7fddcf883000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0004c1e922c0> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>   at 
> 

[jira] [Commented] (CASSANDRA-12792) delete with timestamp long.MAX_VALUE for the whole key creates tombstone that cannot be removed.

2016-11-17 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675890#comment-15675890
 ] 

Joel Knighton commented on CASSANDRA-12792:
---

Good catch - I updated the 2.2 branch above with that change. CI looks good.

> delete with timestamp long.MAX_VALUE for the whole key creates tombstone that 
> cannot be removed. 
> -
>
> Key: CASSANDRA-12792
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12792
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Ian Ilsley
>Assignee: Joel Knighton
>
> In db/compaction/LazilyCompactedRow.java 
> we only check for  <  MaxPurgeableTimeStamp  
> eg:
> (this.maxRowTombstone.markedForDeleteAt < getMaxPurgeableTimestamp())
> this should probably be <= 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-12792) delete with timestamp long.MAX_VALUE for the whole key creates tombstone that cannot be removed.

2016-11-16 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672811#comment-15672811
 ] 

Joel Knighton edited comment on CASSANDRA-12792 at 11/17/16 5:41 AM:
-

I've pushed rebased, updated branches and ran CI. CI looks clean relative to 
upstream. I made your proposed fixes regarding hasTimestamp, null checking, and 
lambda usage in the 3.0+ branches. While looking at the code, I realized that 
the {{PurgeEvaluator}} interface was no longer necessary after an earlier 
refactor and that comparable internals seem to use {{Predicate}} directly. I 
adopted this approach in 2.2+ and changed the 2.2 branch to use anonymous 
classes, since I thought this made it a little easier to follow. Let me know 
your thoughts on these additional changes.

||branch||testall||dtest||
|[CASSANDRA-12792-2.2|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-2.2-dtest]|
|[CASSANDRA-12792-3.0|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.0-dtest]|
|[CASSANDRA-12792-3.X|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-3.X]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.X-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.X-dtest]|
|[CASSANDRA-12792-trunk|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-trunk-dtest]|



was (Author: jkni):
I've pushed rebased, updated branches and ran CI. CI looks clean relative to 
upstream. I made your proposed fixes regarding hasTimestamp, null checking, and 
lambda usage in the 3.0+ branches. While looking at the code, I realized that 
the {{PurgeEvaluator}} interface was no longer necessary after an earier 
refactor and that comparable internals seem to use {{Predicate}} directly. I 
adopted this approach in 2.2+ and changed the 2.2 branch to use anonymous 
classes, since I thought this made it a little easier to follow. Let me know 
your thoughts on these additional changes.

||branch||testall||dtest||
|[CASSANDRA-12792-2.2|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-2.2-dtest]|
|[CASSANDRA-12792-3.0|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.0-dtest]|
|[CASSANDRA-12792-3.X|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-3.X]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.X-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.X-dtest]|
|[CASSANDRA-12792-trunk|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-trunk-dtest]|


> delete with timestamp long.MAX_VALUE for the whole key creates tombstone that 
> cannot be removed. 
> -
>
> Key: CASSANDRA-12792
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12792
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Ian Ilsley
>Assignee: Joel Knighton
>
> In db/compaction/LazilyCompactedRow.java 
> we only check for  <  MaxPurgeableTimeStamp  
> eg:
> (this.maxRowTombstone.markedForDeleteAt < getMaxPurgeableTimestamp())
> this should probably be <= 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12792) delete with timestamp long.MAX_VALUE for the whole key creates tombstone that cannot be removed.

2016-11-16 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12792:
--
Status: Patch Available  (was: In Progress)

I've pushed rebased, updated branches and ran CI. CI looks clean relative to 
upstream. I made your proposed fixes regarding hasTimestamp, null checking, and 
lambda usage in the 3.0+ branches. While looking at the code, I realized that 
the {{PurgeEvaluator}} interface was no longer necessary after an earier 
refactor and that comparable internals seem to use {{Predicate}} directly. I 
adopted this approach in 2.2+ and changed the 2.2 branch to use anonymous 
classes, since I thought this made it a little easier to follow. Let me know 
your thoughts on these additional changes.

||branch||testall||dtest||
|[CASSANDRA-12792-2.2|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-2.2-dtest]|
|[CASSANDRA-12792-3.0|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.0-dtest]|
|[CASSANDRA-12792-3.X|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-3.X]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.X-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.X-dtest]|
|[CASSANDRA-12792-trunk|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-trunk-dtest]|


> delete with timestamp long.MAX_VALUE for the whole key creates tombstone that 
> cannot be removed. 
> -
>
> Key: CASSANDRA-12792
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12792
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Ian Ilsley
>Assignee: Joel Knighton
>
> In db/compaction/LazilyCompactedRow.java 
> we only check for  <  MaxPurgeableTimeStamp  
> eg:
> (this.maxRowTombstone.markedForDeleteAt < getMaxPurgeableTimestamp())
> this should probably be <= 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10244) Replace heartbeats with locally recorded metrics for failure detection

2016-11-16 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-10244:
--
Assignee: (was: Joel Knighton)

> Replace heartbeats with locally recorded metrics for failure detection
> --
>
> Key: CASSANDRA-10244
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10244
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jason Brown
>
> In the current implementation, the primary purpose of sending gossip messages 
> is for delivering the updated heartbeat values of each node in a cluster. The 
> other data that is passed in gossip (node metadata such as status, dc, rack, 
> tokens, and so on) changes very infrequently (or rarely), such that the 
> eventual delivery of that data is entirely reasonable. Heartbeats, however, 
> are quite different. A continuous and nearly consistent delivery time of 
> updated heartbeats is critical for the stability of a cluster. It is through 
> the receipt of the updated heartbeat that a node determines the reachability 
> (UP/DOWN status) of all peers in the cluster. The current implementation of 
> FailureDetector measures the time differences between the heartbeat updates 
> received about a peer (Note: I said about a peer, not from the peer directly, 
> as those values are disseminated via gossip). Without a consistent time 
> delivery of those updates, the FD, via it's use of the PHI-accrual 
> algorigthm, will mark the peer as DOWN (unreachable). The two nodes could be 
> sending all other traffic without problem, but if the heartbeats are not 
> propagated correctly, each of the nodes will mark the other as DOWN, which is 
> clearly suboptimal to cluster health. Further, heartbeat updates are the only 
> mechanism we use to determine reachability (UP/DOWN) of a peer; dynamic 
> snitch measurements, for example, are not included in the determination. 
> To illustrate this, in the current implementation, assume a cluster of nodes: 
> A, B, and C. A partition starts between nodes A and C (no communication 
> succeeds), but both nodes can communicate with B. As B will get the updated 
> heartbeats from both A and C, it will, via gossip, send those over to the 
> other node. Thus, A thinks C is UP, and C thinks A is UP. Unfortunately, due 
> to the partition between them, all communication between A and C will fail, 
> yet neither node will mark the other as down because each is receiving, 
> transitively via B, the updated heartbeat about the other. While it's true 
> that the other node is alive, only having transitive knowledge about a peer, 
> and allowing that to be the sole determinant of UP/DOWN reachability status, 
> is not sufficient for a correct and effieicently operating cluster. 
> This transitive availability is suboptimal, and I propose we drop the 
> heartbeat concept altogether. Instead, the dynamic snitch should become more 
> intelligent, and it's measurements ultimately become the input for 
> determining the reachability status of each peer(as fed into a revamped FD). 
> As we already capture latencies in the dsntich, we can reasonably extend it 
> to include timeouts/missed responses, and make that the basis for the UP/DOWN 
> decisioning. Thus we will have more accurate and relevant peer statueses that 
> is tailored to the local node.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11709) Lock contention when large number of dead nodes come back within short time

2016-11-16 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15670769#comment-15670769
 ] 

Joel Knighton commented on CASSANDRA-11709:
---

I've been unable to get much further on this without a comparably large cluster 
to test on. The branch I've linked above does help parts of the issue by 
reducing the invalidation of the cached ring in unnecessary circumstances; I 
think a patch addressing this issue will need that change as well as others.

Unassigning so as to not block progress.

> Lock contention when large number of dead nodes come back within short time
> ---
>
> Key: CASSANDRA-11709
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11709
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Dikang Gu
>Assignee: Joel Knighton
> Fix For: 2.2.x, 3.x
>
> Attachments: lock.jstack
>
>
> We have a few hundreds nodes across 3 data centers, and we are doing a few 
> millions writes per second into the cluster. 
> We were trying to simulate a data center failure, by disabling the gossip on 
> all the nodes in one data center. After ~20mins, I re-enabled the gossip on 
> those nodes, was doing 5 nodes in each batch, and sleep 5 seconds between the 
> batch.
> After that, I saw the latency of read/write requests increased a lot, and 
> client requests started to timeout.
> On the node, I can see there are huge number of pending tasks in GossipStage. 
> =
> 2016-05-02_23:55:08.99515 WARN  23:55:08 Gossip stage has 36337 pending 
> tasks; skipping status check (no nodes will be marked down)
> 2016-05-02_23:55:09.36009 INFO  23:55:09 Node 
> /2401:db00:2020:717a:face:0:41:0 state jump to normal
> 2016-05-02_23:55:09.99057 INFO  23:55:09 Node 
> /2401:db00:2020:717a:face:0:43:0 state jump to normal
> 2016-05-02_23:55:10.09742 WARN  23:55:10 Gossip stage has 36421 pending 
> tasks; skipping status check (no nodes will be marked down)
> 2016-05-02_23:55:10.91860 INFO  23:55:10 Node 
> /2401:db00:2020:717a:face:0:45:0 state jump to normal
> 2016-05-02_23:55:11.20100 WARN  23:55:11 Gossip stage has 36558 pending 
> tasks; skipping status check (no nodes will be marked down)
> 2016-05-02_23:55:11.57893 INFO  23:55:11 Node 
> /2401:db00:2030:612a:face:0:49:0 state jump to normal
> 2016-05-02_23:55:12.23405 INFO  23:55:12 Node /2401:db00:2020:7189:face:0:7:0 
> state jump to normal
> 
> And I took jstack of the node, I found the read/write threads are blocked by 
> a lock,
>  read thread ==
> "Thrift:7994" daemon prio=10 tid=0x7fde91080800 nid=0x5255 waiting for 
> monitor entry [0x7fde6f8a1000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.cassandra.locator.TokenMetadata.cachedOnlyTokenMap(TokenMetadata.java:546)
> - waiting to lock <0x7fe4faef4398> (a 
> org.apache.cassandra.locator.TokenMetadata)
> at 
> org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:111)
> at 
> org.apache.cassandra.service.StorageService.getLiveNaturalEndpoints(StorageService.java:3155)
> at 
> org.apache.cassandra.service.StorageProxy.getLiveSortedEndpoints(StorageProxy.java:1526)
> at 
> org.apache.cassandra.service.StorageProxy.getLiveSortedEndpoints(StorageProxy.java:1521)
> at 
> org.apache.cassandra.service.AbstractReadExecutor.getReadExecutor(AbstractReadExecutor.java:155)
> at 
> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1328)
> at 
> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1270)
> at 
> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1195)
> at 
> org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:118)
> at 
> org.apache.cassandra.thrift.CassandraServer.getSlice(CassandraServer.java:275)
> at 
> org.apache.cassandra.thrift.CassandraServer.multigetSliceInternal(CassandraServer.java:457)
> at 
> org.apache.cassandra.thrift.CassandraServer.getSliceInternal(CassandraServer.java:346)
> at 
> org.apache.cassandra.thrift.CassandraServer.get_slice(CassandraServer.java:325)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$get_slice.getResult(Cassandra.java:3659)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$get_slice.getResult(Cassandra.java:3643)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at 
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:205)
> at 
> 

[jira] [Updated] (CASSANDRA-11709) Lock contention when large number of dead nodes come back within short time

2016-11-16 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-11709:
--
Assignee: Dikang Gu  (was: Joel Knighton)

> Lock contention when large number of dead nodes come back within short time
> ---
>
> Key: CASSANDRA-11709
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11709
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Dikang Gu
>Assignee: Dikang Gu
> Fix For: 2.2.x, 3.x
>
> Attachments: lock.jstack
>
>
> We have a few hundreds nodes across 3 data centers, and we are doing a few 
> millions writes per second into the cluster. 
> We were trying to simulate a data center failure, by disabling the gossip on 
> all the nodes in one data center. After ~20mins, I re-enabled the gossip on 
> those nodes, was doing 5 nodes in each batch, and sleep 5 seconds between the 
> batch.
> After that, I saw the latency of read/write requests increased a lot, and 
> client requests started to timeout.
> On the node, I can see there are huge number of pending tasks in GossipStage. 
> =
> 2016-05-02_23:55:08.99515 WARN  23:55:08 Gossip stage has 36337 pending 
> tasks; skipping status check (no nodes will be marked down)
> 2016-05-02_23:55:09.36009 INFO  23:55:09 Node 
> /2401:db00:2020:717a:face:0:41:0 state jump to normal
> 2016-05-02_23:55:09.99057 INFO  23:55:09 Node 
> /2401:db00:2020:717a:face:0:43:0 state jump to normal
> 2016-05-02_23:55:10.09742 WARN  23:55:10 Gossip stage has 36421 pending 
> tasks; skipping status check (no nodes will be marked down)
> 2016-05-02_23:55:10.91860 INFO  23:55:10 Node 
> /2401:db00:2020:717a:face:0:45:0 state jump to normal
> 2016-05-02_23:55:11.20100 WARN  23:55:11 Gossip stage has 36558 pending 
> tasks; skipping status check (no nodes will be marked down)
> 2016-05-02_23:55:11.57893 INFO  23:55:11 Node 
> /2401:db00:2030:612a:face:0:49:0 state jump to normal
> 2016-05-02_23:55:12.23405 INFO  23:55:12 Node /2401:db00:2020:7189:face:0:7:0 
> state jump to normal
> 
> And I took jstack of the node, I found the read/write threads are blocked by 
> a lock,
>  read thread ==
> "Thrift:7994" daemon prio=10 tid=0x7fde91080800 nid=0x5255 waiting for 
> monitor entry [0x7fde6f8a1000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.cassandra.locator.TokenMetadata.cachedOnlyTokenMap(TokenMetadata.java:546)
> - waiting to lock <0x7fe4faef4398> (a 
> org.apache.cassandra.locator.TokenMetadata)
> at 
> org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:111)
> at 
> org.apache.cassandra.service.StorageService.getLiveNaturalEndpoints(StorageService.java:3155)
> at 
> org.apache.cassandra.service.StorageProxy.getLiveSortedEndpoints(StorageProxy.java:1526)
> at 
> org.apache.cassandra.service.StorageProxy.getLiveSortedEndpoints(StorageProxy.java:1521)
> at 
> org.apache.cassandra.service.AbstractReadExecutor.getReadExecutor(AbstractReadExecutor.java:155)
> at 
> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1328)
> at 
> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1270)
> at 
> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1195)
> at 
> org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:118)
> at 
> org.apache.cassandra.thrift.CassandraServer.getSlice(CassandraServer.java:275)
> at 
> org.apache.cassandra.thrift.CassandraServer.multigetSliceInternal(CassandraServer.java:457)
> at 
> org.apache.cassandra.thrift.CassandraServer.getSliceInternal(CassandraServer.java:346)
> at 
> org.apache.cassandra.thrift.CassandraServer.get_slice(CassandraServer.java:325)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$get_slice.getResult(Cassandra.java:3659)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$get_slice.getResult(Cassandra.java:3643)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at 
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:205)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> =  writer ===
> "Thrift:7668" daemon prio=10 tid=0x7fde90d91000 nid=0x50e9 waiting for 
> monitor entry [0x7fde78bbc000]

[jira] [Updated] (CASSANDRA-9667) strongly consistent membership and ownership

2016-11-16 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-9667:
-
Assignee: (was: Joel Knighton)

> strongly consistent membership and ownership
> 
>
> Key: CASSANDRA-9667
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9667
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Jason Brown
>  Labels: LWT, membership, ownership
> Fix For: 3.x
>
>
> Currently, there is advice to users to "wait two minutes between adding new 
> nodes" in order for new node tokens, et al, to propagate. Further, as there's 
> no coordination amongst joining node wrt token selection, new nodes can end 
> up selecting ranges that overlap with other joining nodes. This causes a lot 
> of duplicate streaming from the existing source nodes as they shovel out the 
> bootstrap data for those new nodes.
> This ticket proposes creating a mechanism that allows strongly consistent 
> membership and ownership changes in cassandra such that changes are performed 
> in a linearizable and safe manner. The basic idea is to use LWT operations 
> over a global system table, and leverage the linearizability of LWT for 
> ensuring the safety of cluster membership/ownership state changes. This work 
> is inspired by Riak's claimant module.
> The existing workflows for node join, decommission, remove, replace, and 
> range move (there may be others I'm not thinking of) will need to be modified 
> to participate in this scheme, as well as changes to nodetool to enable them.
> Note: we distinguish between membership and ownership in the following ways: 
> for membership we mean "a host in this cluster and it's state". For 
> ownership, we mean "what tokens (or ranges) does each node own"; these nodes 
> must already be a member to be assigned tokens.
> A rough draft sketch of how the 'add new node' workflow might look like is: 
> new nodes would no longer create tokens themselves, but instead contact a 
> member of a Paxos cohort (via a seed). The cohort member will generate the 
> tokens and execute a LWT transaction, ensuring a linearizable change to the 
> membership/ownership state. The updated state will then be disseminated via 
> the existing gossip.
> As for joining specifically, I think we could support two modes: auto-mode 
> and manual-mode. Auto-mode is for adding a single new node per LWT operation, 
> and would require no operator intervention (much like today). In manual-mode, 
> however, multiple new nodes could (somehow) signal their their intent to join 
> to the cluster, but will wait until an operator executes a nodetool command 
> that will trigger the token generation and LWT operation for all pending new 
> nodes. This will allow us better range partitioning and will make the 
> bootstrap streaming more efficient as we won't have overlapping range 
> requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12281) Gossip blocks on startup when another node is bootstrapping

2016-11-15 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669481#comment-15669481
 ] 

Joel Knighton commented on CASSANDRA-12281:
---

Thanks - your changes and CI look good. I also ran CI on your 
CASSANDRA-12281-trunk branch.

Note to committer: there are very slight differences in the 2.2/3.0/3.x 
branches (not in substantial comment, but in comments and other minor fixes). 
The 3.x branch should merge cleanly into trunk, I believe.

> Gossip blocks on startup when another node is bootstrapping
> ---
>
> Key: CASSANDRA-12281
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12281
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Eric Evans
>Assignee: Stefan Podkowinski
> Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-trunk.patch, 
> restbase1015-a_jstack.txt
>
>
> In our cluster, normal node startup times (after a drain on shutdown) are 
> less than 1 minute.  However, when another node in the cluster is 
> bootstrapping, the same node startup takes nearly 30 minutes to complete, the 
> apparent result of gossip blocking on pending range calculations.
> {noformat}
> $ nodetool-a tpstats
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 0   1840 0
>  0
> ReadStage 0 0   2350 0
>  0
> RequestResponseStage  0 0 53 0
>  0
> ReadRepairStage   0 0  1 0
>  0
> CounterMutationStage  0 0  0 0
>  0
> HintedHandoff 0 0 44 0
>  0
> MiscStage 0 0  0 0
>  0
> CompactionExecutor3 3395 0
>  0
> MemtableReclaimMemory 0 0 30 0
>  0
> PendingRangeCalculator1 2 29 0
>  0
> GossipStage   1  5602164 0
>  0
> MigrationStage0 0  0 0
>  0
> MemtablePostFlush 0 0111 0
>  0
> ValidationExecutor0 0  0 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   0 0 30 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> CacheCleanupExecutor  0 0  0 0
>  0
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION 0
> COUNTER_MUTATION 0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR  0
> {noformat}
> A full thread dump is attached, but the relevant bit seems to be here:
> {noformat}
> [ ... ]
> "GossipStage:1" #1801 daemon prio=5 os_prio=0 tid=0x7fe4cd54b000 
> nid=0xea9 waiting on condition [0x7fddcf883000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0004c1e922c0> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
>   at 
> org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:174)
>   at 
> org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:160)
>   at 
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2023)
>   at 
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1682)
>   at 
> 

[jira] [Comment Edited] (CASSANDRA-12281) Gossip blocks on startup when another node is bootstrapping

2016-11-15 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669481#comment-15669481
 ] 

Joel Knighton edited comment on CASSANDRA-12281 at 11/16/16 5:25 AM:
-

Thanks - your changes and CI look good. I also ran CI on your 
CASSANDRA-12281-trunk branch.

Note to committer: there are very slight differences in the 2.2/3.0/3.x 
branches (not in substantial comment, but in comments and other minor fixes). 
The 3.x branch should merge cleanly into trunk, I believe.

+1


was (Author: jkni):
Thanks - your changes and CI look good. I also ran CI on your 
CASSANDRA-12281-trunk branch.

Note to committer: there are very slight differences in the 2.2/3.0/3.x 
branches (not in substantial comment, but in comments and other minor fixes). 
The 3.x branch should merge cleanly into trunk, I believe.

> Gossip blocks on startup when another node is bootstrapping
> ---
>
> Key: CASSANDRA-12281
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12281
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Eric Evans
>Assignee: Stefan Podkowinski
> Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-trunk.patch, 
> restbase1015-a_jstack.txt
>
>
> In our cluster, normal node startup times (after a drain on shutdown) are 
> less than 1 minute.  However, when another node in the cluster is 
> bootstrapping, the same node startup takes nearly 30 minutes to complete, the 
> apparent result of gossip blocking on pending range calculations.
> {noformat}
> $ nodetool-a tpstats
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 0   1840 0
>  0
> ReadStage 0 0   2350 0
>  0
> RequestResponseStage  0 0 53 0
>  0
> ReadRepairStage   0 0  1 0
>  0
> CounterMutationStage  0 0  0 0
>  0
> HintedHandoff 0 0 44 0
>  0
> MiscStage 0 0  0 0
>  0
> CompactionExecutor3 3395 0
>  0
> MemtableReclaimMemory 0 0 30 0
>  0
> PendingRangeCalculator1 2 29 0
>  0
> GossipStage   1  5602164 0
>  0
> MigrationStage0 0  0 0
>  0
> MemtablePostFlush 0 0111 0
>  0
> ValidationExecutor0 0  0 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   0 0 30 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> CacheCleanupExecutor  0 0  0 0
>  0
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION 0
> COUNTER_MUTATION 0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR  0
> {noformat}
> A full thread dump is attached, but the relevant bit seems to be here:
> {noformat}
> [ ... ]
> "GossipStage:1" #1801 daemon prio=5 os_prio=0 tid=0x7fe4cd54b000 
> nid=0xea9 waiting on condition [0x7fddcf883000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0004c1e922c0> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
>   at 
> 

[jira] [Updated] (CASSANDRA-12281) Gossip blocks on startup when another node is bootstrapping

2016-11-15 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12281:
--
Status: Ready to Commit  (was: Patch Available)

> Gossip blocks on startup when another node is bootstrapping
> ---
>
> Key: CASSANDRA-12281
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12281
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Eric Evans
>Assignee: Stefan Podkowinski
> Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-trunk.patch, 
> restbase1015-a_jstack.txt
>
>
> In our cluster, normal node startup times (after a drain on shutdown) are 
> less than 1 minute.  However, when another node in the cluster is 
> bootstrapping, the same node startup takes nearly 30 minutes to complete, the 
> apparent result of gossip blocking on pending range calculations.
> {noformat}
> $ nodetool-a tpstats
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 0   1840 0
>  0
> ReadStage 0 0   2350 0
>  0
> RequestResponseStage  0 0 53 0
>  0
> ReadRepairStage   0 0  1 0
>  0
> CounterMutationStage  0 0  0 0
>  0
> HintedHandoff 0 0 44 0
>  0
> MiscStage 0 0  0 0
>  0
> CompactionExecutor3 3395 0
>  0
> MemtableReclaimMemory 0 0 30 0
>  0
> PendingRangeCalculator1 2 29 0
>  0
> GossipStage   1  5602164 0
>  0
> MigrationStage0 0  0 0
>  0
> MemtablePostFlush 0 0111 0
>  0
> ValidationExecutor0 0  0 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   0 0 30 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> CacheCleanupExecutor  0 0  0 0
>  0
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION 0
> COUNTER_MUTATION 0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR  0
> {noformat}
> A full thread dump is attached, but the relevant bit seems to be here:
> {noformat}
> [ ... ]
> "GossipStage:1" #1801 daemon prio=5 os_prio=0 tid=0x7fe4cd54b000 
> nid=0xea9 waiting on condition [0x7fddcf883000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0004c1e922c0> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
>   at 
> org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:174)
>   at 
> org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:160)
>   at 
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2023)
>   at 
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1682)
>   at 
> org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1182)
>   at org.apache.cassandra.gms.Gossiper.applyNewStates(Gossiper.java:1165)
>   at 
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1128)
>   at 
> 

[jira] [Comment Edited] (CASSANDRA-12273) Casandra stress graph: option to create directory for graph if it doesn't exist

2016-11-10 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654367#comment-15654367
 ] 

Joel Knighton edited comment on CASSANDRA-12273 at 11/10/16 3:43 PM:
-

Very, very close - the only change is that the ticket number is included at the 
end of the patch by ...; reviewed by ... line, like "patch by Murukesh Mohanan; 
reviewed by Joel Knighton for CASSANDRA-12273" instead of including it at the 
end of the commit message. I might suggest changing the message to something 
like "Create log/artifact directories as needed for stress, handling symbolic 
links" to indicate that this changes behavior for the stress tool and not the 
core DB.

{code}
Create log directories as needed, handling symbolic links

patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273
{code}

Thanks again.


was (Author: jkni):
Very, very close - the only change is that the ticket number is included at the 
end of the patch by ...; reviewed by ... line, like "patch by Murukesh Mohanan; 
reviewed by Joel Knighton for CASSANDRA-12273" instead of including it at the 
end of the commit message.

{code}
Create log directories as needed, handling symbolic links

patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273
{code}

Thanks again.

> Casandra stress graph: option to create directory for graph if it doesn't 
> exist
> ---
>
> Key: CASSANDRA-12273
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12273
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Tools
>Reporter: Christopher Batey
>Assignee: Murukesh Mohanan
>Priority: Minor
>  Labels: lhf
> Attachments: 12273.patch
>
>
> I am running it in CI with ephemeral workspace  / build dirs. It would be 
> nice if CS would create the directory so my build tool doesn't have to



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-12273) Casandra stress graph: option to create directory for graph if it doesn't exist

2016-11-10 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654367#comment-15654367
 ] 

Joel Knighton edited comment on CASSANDRA-12273 at 11/10/16 3:42 PM:
-

Very, very close - the only change is that the ticket number is included at the 
end of the patch by ...; reviewed by ... line, like "patch by Murukesh Mohanan; 
reviewed by Joel Knighton for CASSANDRA-12273" instead of including it at the 
end of the commit message.

{code}
Create log directories as needed, handling symbolic links

patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273
{code}

Thanks again.


was (Author: jkni):
Very, very close - the only change is that the ticket number is included at the 
end of the patch by ...; reviewed by ... line, like "patch by Murukesh Mohanan; 
reviewed by Joel Knighton for CASSANDRA-12273" instead of including it at the 
end of the commit message.

{code}
Create log directories as needed, handling symbolic links

patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273
{code}

> Casandra stress graph: option to create directory for graph if it doesn't 
> exist
> ---
>
> Key: CASSANDRA-12273
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12273
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Tools
>Reporter: Christopher Batey
>Assignee: Murukesh Mohanan
>Priority: Minor
>  Labels: lhf
> Attachments: 12273.patch
>
>
> I am running it in CI with ephemeral workspace  / build dirs. It would be 
> nice if CS would create the directory so my build tool doesn't have to



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12273) Casandra stress graph: option to create directory for graph if it doesn't exist

2016-11-10 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654367#comment-15654367
 ] 

Joel Knighton commented on CASSANDRA-12273:
---

Very, very close - the only change is that the ticket number is included at the 
end of the patch by ...; reviewed by ... line, like "patch by Murukesh Mohanan; 
reviewed by Joel Knighton for CASSANDRA-12273" instead of including it at the 
end of the commit message.

{code}
Create log directories as needed, handling symbolic links

patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273
{code}

> Casandra stress graph: option to create directory for graph if it doesn't 
> exist
> ---
>
> Key: CASSANDRA-12273
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12273
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Tools
>Reporter: Christopher Batey
>Assignee: Murukesh Mohanan
>Priority: Minor
>  Labels: lhf
> Attachments: 12273.patch
>
>
> I am running it in CI with ephemeral workspace  / build dirs. It would be 
> nice if CS would create the directory so my build tool doesn't have to



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12273) Casandra stess graph: option to create directory for graph if it doesn't exist

2016-11-09 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12273:
--
Status: Open  (was: Patch Available)

> Casandra stess graph: option to create directory for graph if it doesn't exist
> --
>
> Key: CASSANDRA-12273
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12273
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Tools
>Reporter: Christopher Batey
>Assignee: Murukesh Mohanan
>Priority: Minor
>  Labels: lhf
> Attachments: 12273.patch
>
>
> I am running it in CI with ephemeral workspace  / build dirs. It would be 
> nice if CS would create the directory so my build tool doesn't have to



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12273) Casandra stress graph: option to create directory for graph if it doesn't exist

2016-11-09 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12273:
--
Summary: Casandra stress graph: option to create directory for graph if it 
doesn't exist  (was: Casandra stess graph: option to create directory for graph 
if it doesn't exist)

> Casandra stress graph: option to create directory for graph if it doesn't 
> exist
> ---
>
> Key: CASSANDRA-12273
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12273
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Tools
>Reporter: Christopher Batey
>Assignee: Murukesh Mohanan
>Priority: Minor
>  Labels: lhf
> Attachments: 12273.patch
>
>
> I am running it in CI with ephemeral workspace  / build dirs. It would be 
> nice if CS would create the directory so my build tool doesn't have to



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12273) Casandra stess graph: option to create directory for graph if it doesn't exist

2016-11-09 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12273:
--
Status: Awaiting Feedback  (was: Open)

> Casandra stess graph: option to create directory for graph if it doesn't exist
> --
>
> Key: CASSANDRA-12273
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12273
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Tools
>Reporter: Christopher Batey
>Assignee: Murukesh Mohanan
>Priority: Minor
>  Labels: lhf
> Attachments: 12273.patch
>
>
> I am running it in CI with ephemeral workspace  / build dirs. It would be 
> nice if CS would create the directory so my build tool doesn't have to



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12273) Casandra stess graph: option to create directory for graph if it doesn't exist

2016-11-09 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651814#comment-15651814
 ] 

Joel Knighton commented on CASSANDRA-12273:
---

Thanks for the patch [~muru]! Your approach looks sound.

A very similar issue exists on trunk with hdrfile logging if an {{hdrfile}} is 
specified in {{SettingsLog.java}}. If you're interested, I think it makes a lot 
of sense to also fix that problem as part of this ticket, as people affected by 
this issue will likely also be affected by the fact that hdrfile paths do not 
have their directory created. I also think it makes sense to canonicalize the 
path before {{Files.createDirectories}}, since this would avoid needing to 
special-case symlinks. This could be done by using {{getCanonicalPath}} instead 
of {{toURI}}.

For future patches, it is easier to accept contributions if they include a 
CHANGES.txt entry and an appropriately formatted commit message in a patch 
created with {[git format-patch}}. The details on this are available in the 
[docs|http://cassandra.apache.org/doc/latest/development/patches.html].

If you aren't interested in updating the patch with these changes, I still 
think this patch is worth merging and will update this issue with an 
appropriately formatted commit and will approve it after CI.

> Casandra stess graph: option to create directory for graph if it doesn't exist
> --
>
> Key: CASSANDRA-12273
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12273
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Tools
>Reporter: Christopher Batey
>Assignee: Murukesh Mohanan
>Priority: Minor
>  Labels: lhf
> Attachments: 12273.patch
>
>
> I am running it in CI with ephemeral workspace  / build dirs. It would be 
> nice if CS would create the directory so my build tool doesn't have to



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12273) Casandra stess graph: option to create directory for graph if it doesn't exist

2016-11-08 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12273:
--
Assignee: Murukesh Mohanan

> Casandra stess graph: option to create directory for graph if it doesn't exist
> --
>
> Key: CASSANDRA-12273
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12273
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Tools
>Reporter: Christopher Batey
>Assignee: Murukesh Mohanan
>Priority: Minor
>  Labels: lhf
> Attachments: 12273.patch
>
>
> I am running it in CI with ephemeral workspace  / build dirs. It would be 
> nice if CS would create the directory so my build tool doesn't have to



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12273) Casandra stess graph: option to create directory for graph if it doesn't exist

2016-11-08 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-12273:
--
Assignee: (was: Christopher Batey)

> Casandra stess graph: option to create directory for graph if it doesn't exist
> --
>
> Key: CASSANDRA-12273
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12273
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Tools
>Reporter: Christopher Batey
>Priority: Minor
>  Labels: lhf
> Attachments: 12273.patch
>
>
> I am running it in CI with ephemeral workspace  / build dirs. It would be 
> nice if CS would create the directory so my build tool doesn't have to



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests

2016-11-07 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-11381:
--
Status: Awaiting Feedback  (was: Open)

> Node running with join_ring=false and authentication can not serve requests
> ---
>
> Key: CASSANDRA-11381
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11381
> Project: Cassandra
>  Issue Type: Bug
>Reporter: mck
>Assignee: mck
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, 
> 11381-trunk.txt, dtest-11381-trunk.txt
>
>
> Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has 
> authentication configured, eg PasswordAuthenticator, won't be able to serve 
> requests. This is because {{Auth.setup()}} never gets called during the 
> startup.
> Without {{Auth.setup()}} having been called in {{StorageService}} clients 
> connecting to the node fail with the node throwing
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119)
> at 
> org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
> at 
> com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689)
> at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The exception thrown from the 
> [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119]
> {code}
> ResultMessage.Rows rows = 
> authenticateStatement.execute(QueryState.forInternalCalls(), new 
> QueryOptions(consistencyForUser(username),
>   
>Lists.newArrayList(ByteBufferUtil.bytes(username;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests

2016-11-07 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645280#comment-15645280
 ] 

Joel Knighton edited comment on CASSANDRA-11381 at 11/7/16 8:10 PM:


Thanks for pinging me on this - as you suspected, it slipped through the 
cracks. 

On reviewing the final version of this patch, I found one problem with the 2.2+ 
patches. The proposed patch technically breaks the documented {{IRoleManager}}, 
{{IAuthenticator}}, and {{IAuthorizer}} interfaces. With the implementation 
given, {{doAuthSetup}} will be called twice for a node started with 
{{join_ring=False}}, so {{setup()}} will be called twice for the role manager, 
authenticator, and authorizer. In the documentation for these three public 
interfaces, we state that {{setup()}} will only be called once after starting a 
node. I think we should preserve this documented behavior. While slightly less 
elegant, I think we should instead track whether we've run {{doAuthSetup}} and 
not repeat this call for a node started with {{join_ring=False}} that is asked 
to join. This means the parts of the patch implementing idempotency for the 
MigrationManager listener registration become unnecessary.


was (Author: jkni):
Thanks for pinging me on this - as you suspected, it slipped through the 
cracks. 

On reviewing the final version of this patch, I found one problem with the 2.2+ 
patches. The proposed patch technically breaks the documented {{IRoleManager}}, 
{{IAuthenticator}}, and {{IAuthorizer}} interfaces. With the implementation 
given, {{doAuthSetup}} will be called twice for a node started with 
{{join_ring=False}}, so {{setup()}} will be called twice for the role manager, 
authenticator, and authorizer. In the documentation for these three public 
interfaces, we state that {{setup()}} will only be called once after starting a 
node. I think we should preserve this documented behavior. While slightly less 
elegant, I think we should instead track whether we've run {{doAuthSetup}} and 
not repeat this call for a node started with {{join_ring=False}} that is asked 
to join. 

> Node running with join_ring=false and authentication can not serve requests
> ---
>
> Key: CASSANDRA-11381
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11381
> Project: Cassandra
>  Issue Type: Bug
>Reporter: mck
>Assignee: mck
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, 
> 11381-trunk.txt, dtest-11381-trunk.txt
>
>
> Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has 
> authentication configured, eg PasswordAuthenticator, won't be able to serve 
> requests. This is because {{Auth.setup()}} never gets called during the 
> startup.
> Without {{Auth.setup()}} having been called in {{StorageService}} clients 
> connecting to the node fail with the node throwing
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119)
> at 
> org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
> at 
> com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689)
> at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The exception thrown from the 
> [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119]
> {code}
> ResultMessage.Rows rows = 
> authenticateStatement.execute(QueryState.forInternalCalls(), new 
> QueryOptions(consistencyForUser(username),
>   
>Lists.newArrayList(ByteBufferUtil.bytes(username;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests

2016-11-07 Thread Joel Knighton (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Knighton updated CASSANDRA-11381:
--
Status: Open  (was: Patch Available)

> Node running with join_ring=false and authentication can not serve requests
> ---
>
> Key: CASSANDRA-11381
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11381
> Project: Cassandra
>  Issue Type: Bug
>Reporter: mck
>Assignee: mck
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, 
> 11381-trunk.txt, dtest-11381-trunk.txt
>
>
> Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has 
> authentication configured, eg PasswordAuthenticator, won't be able to serve 
> requests. This is because {{Auth.setup()}} never gets called during the 
> startup.
> Without {{Auth.setup()}} having been called in {{StorageService}} clients 
> connecting to the node fail with the node throwing
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119)
> at 
> org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
> at 
> com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689)
> at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The exception thrown from the 
> [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119]
> {code}
> ResultMessage.Rows rows = 
> authenticateStatement.execute(QueryState.forInternalCalls(), new 
> QueryOptions(consistencyForUser(username),
>   
>Lists.newArrayList(ByteBufferUtil.bytes(username;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests

2016-11-07 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645280#comment-15645280
 ] 

Joel Knighton commented on CASSANDRA-11381:
---

Thanks for pinging me on this - as you suspected, it slipped through the 
cracks. 

On reviewing the final version of this patch, I found one problem with the 2.2+ 
patches. The proposed patch technically breaks the documented {{IRoleManager}}, 
{{IAuthenticator}}, and {{IAuthorizer}} interfaces. With the implementation 
given, {{doAuthSetup}} will be called twice for a node started with 
{{join_ring=False}}, so {{setup()}} will be called twice for the role manager, 
authenticator, and authorizer. In the documentation for these three public 
interfaces, we state that {{setup()}} will only be called once after starting a 
node. I think we should preserve this documented behavior. While slightly less 
elegant, I think we should instead track whether we've run {{doAuthSetup}} and 
not repeat this call for a node started with {{join_ring=False}} that is asked 
to join. 

> Node running with join_ring=false and authentication can not serve requests
> ---
>
> Key: CASSANDRA-11381
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11381
> Project: Cassandra
>  Issue Type: Bug
>Reporter: mck
>Assignee: mck
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, 
> 11381-trunk.txt, dtest-11381-trunk.txt
>
>
> Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has 
> authentication configured, eg PasswordAuthenticator, won't be able to serve 
> requests. This is because {{Auth.setup()}} never gets called during the 
> startup.
> Without {{Auth.setup()}} having been called in {{StorageService}} clients 
> connecting to the node fail with the node throwing
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119)
> at 
> org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
> at 
> com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689)
> at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The exception thrown from the 
> [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119]
> {code}
> ResultMessage.Rows rows = 
> authenticateStatement.execute(QueryState.forInternalCalls(), new 
> QueryOptions(consistencyForUser(username),
>   
>Lists.newArrayList(ByteBufferUtil.bytes(username;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12281) Gossip blocks on startup when another node is bootstrapping

2016-11-07 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644323#comment-15644323
 ] 

Joel Knighton commented on CASSANDRA-12281:
---

Ah, good catch on the aggregate log message spanning the trace and debug cases. 
That makes a lot of sense - thanks for the explanation. I'll keep this at the 
top of my queue for when CI is available.

> Gossip blocks on startup when another node is bootstrapping
> ---
>
> Key: CASSANDRA-12281
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12281
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Eric Evans
>Assignee: Stefan Podkowinski
> Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-trunk.patch, 
> restbase1015-a_jstack.txt
>
>
> In our cluster, normal node startup times (after a drain on shutdown) are 
> less than 1 minute.  However, when another node in the cluster is 
> bootstrapping, the same node startup takes nearly 30 minutes to complete, the 
> apparent result of gossip blocking on pending range calculations.
> {noformat}
> $ nodetool-a tpstats
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 0   1840 0
>  0
> ReadStage 0 0   2350 0
>  0
> RequestResponseStage  0 0 53 0
>  0
> ReadRepairStage   0 0  1 0
>  0
> CounterMutationStage  0 0  0 0
>  0
> HintedHandoff 0 0 44 0
>  0
> MiscStage 0 0  0 0
>  0
> CompactionExecutor3 3395 0
>  0
> MemtableReclaimMemory 0 0 30 0
>  0
> PendingRangeCalculator1 2 29 0
>  0
> GossipStage   1  5602164 0
>  0
> MigrationStage0 0  0 0
>  0
> MemtablePostFlush 0 0111 0
>  0
> ValidationExecutor0 0  0 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   0 0 30 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> CacheCleanupExecutor  0 0  0 0
>  0
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION 0
> COUNTER_MUTATION 0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR  0
> {noformat}
> A full thread dump is attached, but the relevant bit seems to be here:
> {noformat}
> [ ... ]
> "GossipStage:1" #1801 daemon prio=5 os_prio=0 tid=0x7fe4cd54b000 
> nid=0xea9 waiting on condition [0x7fddcf883000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0004c1e922c0> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
>   at 
> org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:174)
>   at 
> org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:160)
>   at 
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2023)
>   at 
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1682)
>   at 
> org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1182)
>   at 

[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests

2016-11-04 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637549#comment-15637549
 ] 

Joel Knighton commented on CASSANDRA-12653:
---

[~spo...@gmail.com] - yes! I sincerely apologize for the delay here. If anyone 
else is interested in reviewing this, they're welcome to pick it up, but it's 
near the top of my list and I hope to get to this soon.

> In-flight shadow round requests
> ---
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Minor
> Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12281) Gossip blocks on startup when another node is bootstrapping

2016-11-04 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637528#comment-15637528
 ] 

Joel Knighton commented on CASSANDRA-12281:
---

Thanks for the patch and your patience as I get to this for review! I've been 
quite busy lately.

The approach overall seems sound. While calculating pending ranges can be a 
little slow, I don't think we risk falling too far behind, because the huge 
delays here appear to be a result of cascading delays to other tasks. The 
PendingRangeCalculatorService's restriction on one queued task that will 
reflect cluster state at time of execution helps with this.

A few small questions/nits:
- Is there a reason that the test is excluded from the 2.2 branch? Byteman is 
available for tests on the 2.2 branch since [CASSANDRA-12377], and I don't see 
anything else that stops the test from being useful there.
- Generally, the tests are organized as a top-level class for some entity or 
fundamental operation in the codebase and then specific test methods for unit 
tests/regression tests. I think it would make sense to establish a 
{{PendingRangeCalculatorServiceTest}} and introduce the specific test for 
[CASSANDRA-12281] inside that class.
- In the {{PendingRangeCalculatorService}}, I'm not sure we need to move the 
"Finished calculation for ..." log message to trace. Most Gossip/TokenMetadata 
state changes are logged at debug, especially when they reflect some detail 
about the aggregate state of an operation.
- A few minor spelling fixes in the test "aquire" -> "acquire", "fist" -> 
"first". (Note that I normally wouldn't bother with these, but since the test 
could likely use a few other changes, I think it is worthwhile to fix these.)
- In the test's setUp, the call to {{Keyspace.setInitialized}} is redundant. 
The call to {{SchemaLoader.prepareServer}} will already perform this.
- CI looks good overall. The 3.0-dtest run has a few materialized view dtest 
failures that are likely unrelated, but it would be good if you could retrigger 
CI for at least this branch.
- There's no CI/branch posted for the 3.X series. While this has barely 
diverged from trunk at this point, it'd be nice if you could run CI for this 
branch.

Thanks again.

> Gossip blocks on startup when another node is bootstrapping
> ---
>
> Key: CASSANDRA-12281
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12281
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Eric Evans
>Assignee: Stefan Podkowinski
> Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-trunk.patch, 
> restbase1015-a_jstack.txt
>
>
> In our cluster, normal node startup times (after a drain on shutdown) are 
> less than 1 minute.  However, when another node in the cluster is 
> bootstrapping, the same node startup takes nearly 30 minutes to complete, the 
> apparent result of gossip blocking on pending range calculations.
> {noformat}
> $ nodetool-a tpstats
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 0   1840 0
>  0
> ReadStage 0 0   2350 0
>  0
> RequestResponseStage  0 0 53 0
>  0
> ReadRepairStage   0 0  1 0
>  0
> CounterMutationStage  0 0  0 0
>  0
> HintedHandoff 0 0 44 0
>  0
> MiscStage 0 0  0 0
>  0
> CompactionExecutor3 3395 0
>  0
> MemtableReclaimMemory 0 0 30 0
>  0
> PendingRangeCalculator1 2 29 0
>  0
> GossipStage   1  5602164 0
>  0
> MigrationStage0 0  0 0
>  0
> MemtablePostFlush 0 0111 0
>  0
> ValidationExecutor0 0  0 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   0 0 30 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> CacheCleanupExecutor  0 0 

<    1   2   3   4   5   6   7   8   >