[jira] [Commented] (CASSANDRA-13307) The specification of protocol version in cqlsh means the python driver doesn't automatically downgrade protocol version.
[ https://issues.apache.org/jira/browse/CASSANDRA-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15956912#comment-15956912 ] Joel Knighton commented on CASSANDRA-13307: --- There's one you can set through the Edit button, if you scroll down. If you don't have permissions to access/edit that somehow, come complain in #cassandra-dev on IRC. Thanks for volunteering to review! > The specification of protocol version in cqlsh means the python driver > doesn't automatically downgrade protocol version. > > > Key: CASSANDRA-13307 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13307 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Matt Byrd >Assignee: Matt Byrd >Priority: Minor > Fix For: 3.11.x > > > Hi, > Looks like we've regressed on the issue described in: > https://issues.apache.org/jira/browse/CASSANDRA-9467 > In that we're no longer able to connect from newer cqlsh versions > (e.g trunk) to older versions of Cassandra with a lower version of the > protocol (e.g 2.1 with protocol version 3) > The problem seems to be that we're relying on the ability for the client to > automatically downgrade protocol version implemented in Cassandra here: > https://issues.apache.org/jira/browse/CASSANDRA-12838 > and utilised in the python client here: > https://datastax-oss.atlassian.net/browse/PYTHON-240 > The problem however comes when we implemented: > https://datastax-oss.atlassian.net/browse/PYTHON-537 > "Don't downgrade protocol version if explicitly set" > (included when we bumped from 3.5.0 to 3.7.0 of the python driver as part of > fixing: https://issues.apache.org/jira/browse/CASSANDRA-11534) > Since we do explicitly specify the protocol version in the bin/cqlsh.py. > I've got a patch which just adds an option to explicitly specify the protocol > version (for those who want to do that) and then otherwise defaults to not > setting the protocol version, i.e using the protocol version from the client > which we ship, which should by default be the same protocol as the server. > Then it should downgrade gracefully as was intended. > Let me know if that seems reasonable. > Thanks, > Matt -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13307) The specification of protocol version in cqlsh means the python driver doesn't automatically downgrade protocol version.
[ https://issues.apache.org/jira/browse/CASSANDRA-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-13307: -- Reviewer: mck > The specification of protocol version in cqlsh means the python driver > doesn't automatically downgrade protocol version. > > > Key: CASSANDRA-13307 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13307 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Matt Byrd >Assignee: Matt Byrd >Priority: Minor > Fix For: 3.11.x > > > Hi, > Looks like we've regressed on the issue described in: > https://issues.apache.org/jira/browse/CASSANDRA-9467 > In that we're no longer able to connect from newer cqlsh versions > (e.g trunk) to older versions of Cassandra with a lower version of the > protocol (e.g 2.1 with protocol version 3) > The problem seems to be that we're relying on the ability for the client to > automatically downgrade protocol version implemented in Cassandra here: > https://issues.apache.org/jira/browse/CASSANDRA-12838 > and utilised in the python client here: > https://datastax-oss.atlassian.net/browse/PYTHON-240 > The problem however comes when we implemented: > https://datastax-oss.atlassian.net/browse/PYTHON-537 > "Don't downgrade protocol version if explicitly set" > (included when we bumped from 3.5.0 to 3.7.0 of the python driver as part of > fixing: https://issues.apache.org/jira/browse/CASSANDRA-11534) > Since we do explicitly specify the protocol version in the bin/cqlsh.py. > I've got a patch which just adds an option to explicitly specify the protocol > version (for those who want to do that) and then otherwise defaults to not > setting the protocol version, i.e using the protocol version from the client > which we ship, which should by default be the same protocol as the server. > Then it should downgrade gracefully as was intended. > Let me know if that seems reasonable. > Thanks, > Matt -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12929) Fix version check to enable streaming keep-alive
[ https://issues.apache.org/jira/browse/CASSANDRA-12929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955195#comment-15955195 ] Joel Knighton commented on CASSANDRA-12929: --- Thanks! It happens. > Fix version check to enable streaming keep-alive > > > Key: CASSANDRA-12929 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12929 > Project: Cassandra > Issue Type: Bug >Reporter: Michael Shuler >Assignee: Paulo Motta > Labels: dtest, test-failure > Fix For: 4.0 > > > example failure: > http://cassci.datastax.com/job/trunk_novnode_dtest/494/testReport/bootstrap_test/TestBootstrap/simple_bootstrap_test_small_keepalive_period > {noformat} > Error Message > Expected [['COMPLETED']] from SELECT bootstrapped FROM system.local WHERE > key='local', but got [[u'IN_PROGRESS']] > >> begin captured logging << > dtest: DEBUG: cluster ccm directory: /tmp/dtest-YmnyEI > dtest: DEBUG: Done setting configuration options: > { 'num_tokens': None, 'phi_convict_threshold': 5, 'start_rpc': 'true'} > cassandra.cluster: INFO: New Cassandra host > discovered > - >> end captured logging << - > Stacktrace > File "/usr/lib/python2.7/unittest/case.py", line 329, in run > testMethod() > File "/home/automaton/cassandra-dtest/tools/decorators.py", line 46, in > wrapped > f(obj) > File "/home/automaton/cassandra-dtest/bootstrap_test.py", line 163, in > simple_bootstrap_test_small_keepalive_period > assert_bootstrap_state(self, node2, 'COMPLETED') > File "/home/automaton/cassandra-dtest/tools/assertions.py", line 297, in > assert_bootstrap_state > assert_one(session, "SELECT bootstrapped FROM system.local WHERE > key='local'", [expected_bootstrap_state]) > File "/home/automaton/cassandra-dtest/tools/assertions.py", line 130, in > assert_one > assert list_res == [expected], "Expected {} from {}, but got > {}".format([expected], query, list_res) > "Expected [['COMPLETED']] from SELECT bootstrapped FROM system.local WHERE > key='local', but got [[u'IN_PROGRESS']]\n >> begin > captured logging << \ndtest: DEBUG: cluster ccm > directory: /tmp/dtest-YmnyEI\ndtest: DEBUG: Done setting configuration > options:\n{ 'num_tokens': None, 'phi_convict_threshold': 5, 'start_rpc': > 'true'}\ncassandra.cluster: INFO: New Cassandra host datacenter1> discovered\n- >> end captured logging << > -" > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (CASSANDRA-13402) testall failure in org.apache.cassandra.dht.StreamStateStoreTest.testUpdateAndQueryAvailableRanges
[ https://issues.apache.org/jira/browse/CASSANDRA-13402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton resolved CASSANDRA-13402. --- Resolution: Duplicate > testall failure in > org.apache.cassandra.dht.StreamStateStoreTest.testUpdateAndQueryAvailableRanges > -- > > Key: CASSANDRA-13402 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13402 > Project: Cassandra > Issue Type: Bug > Components: Testing >Reporter: Sean McCarthy > Labels: test-failure, testall > Attachments: TEST-org.apache.cassandra.dht.StreamStateStoreTest.log > > > example failure: > http://cassci.datastax.com/job/trunk_testall/1488/testReport/org.apache.cassandra.dht/StreamStateStoreTest/testUpdateAndQueryAvailableRanges > {code} > Stacktrace > java.lang.NullPointerException > at > org.apache.cassandra.streaming.StreamSession.isKeepAliveSupported(StreamSession.java:244) > at > org.apache.cassandra.streaming.StreamSession.(StreamSession.java:196) > at > org.apache.cassandra.dht.StreamStateStoreTest.testUpdateAndQueryAvailableRanges(StreamStateStoreTest.java:53) > {code} > Related failures: (13) > http://cassci.datastax.com/job/trunk_testall/1488/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Reopened] (CASSANDRA-12929) Fix version check to enable streaming keep-alive
[ https://issues.apache.org/jira/browse/CASSANDRA-12929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton reopened CASSANDRA-12929: --- It looks like this is causing quite a few test failures on commit. In dtests, this includes many tests in sstable_generation_loading_test and snapshot_test.TestSnapshot.test_basic_snapshot_and_restore. in testall, this includes StreamStateStoreTest.testUpdateAndQueryAvailableRanges, LocalSyncTaskTest.testDifference, StreamingRepairTaskTest.incrementalStreamPlan, StreamingRepairTaskTest.fullStreamPlan, StreamTransferTaskTest.testScheduleTimeout, and StreamTransferTaskTest.testFailSessionDuringTransferShouldNotReleaseReferences. There may be others that I missed, but that list should get things pointed in the right direction. > Fix version check to enable streaming keep-alive > > > Key: CASSANDRA-12929 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12929 > Project: Cassandra > Issue Type: Bug >Reporter: Michael Shuler >Assignee: Paulo Motta > Labels: dtest, test-failure > Fix For: 4.0 > > > example failure: > http://cassci.datastax.com/job/trunk_novnode_dtest/494/testReport/bootstrap_test/TestBootstrap/simple_bootstrap_test_small_keepalive_period > {noformat} > Error Message > Expected [['COMPLETED']] from SELECT bootstrapped FROM system.local WHERE > key='local', but got [[u'IN_PROGRESS']] > >> begin captured logging << > dtest: DEBUG: cluster ccm directory: /tmp/dtest-YmnyEI > dtest: DEBUG: Done setting configuration options: > { 'num_tokens': None, 'phi_convict_threshold': 5, 'start_rpc': 'true'} > cassandra.cluster: INFO: New Cassandra host > discovered > - >> end captured logging << - > Stacktrace > File "/usr/lib/python2.7/unittest/case.py", line 329, in run > testMethod() > File "/home/automaton/cassandra-dtest/tools/decorators.py", line 46, in > wrapped > f(obj) > File "/home/automaton/cassandra-dtest/bootstrap_test.py", line 163, in > simple_bootstrap_test_small_keepalive_period > assert_bootstrap_state(self, node2, 'COMPLETED') > File "/home/automaton/cassandra-dtest/tools/assertions.py", line 297, in > assert_bootstrap_state > assert_one(session, "SELECT bootstrapped FROM system.local WHERE > key='local'", [expected_bootstrap_state]) > File "/home/automaton/cassandra-dtest/tools/assertions.py", line 130, in > assert_one > assert list_res == [expected], "Expected {} from {}, but got > {}".format([expected], query, list_res) > "Expected [['COMPLETED']] from SELECT bootstrapped FROM system.local WHERE > key='local', but got [[u'IN_PROGRESS']]\n >> begin > captured logging << \ndtest: DEBUG: cluster ccm > directory: /tmp/dtest-YmnyEI\ndtest: DEBUG: Done setting configuration > options:\n{ 'num_tokens': None, 'phi_convict_threshold': 5, 'start_rpc': > 'true'}\ncassandra.cluster: INFO: New Cassandra host datacenter1> discovered\n- >> end captured logging << > -" > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Fix Version/s: (was: 3.11.x) (was: 4.x) (was: 3.0.x) (was: 2.2.x) 4.0 3.11.0 3.0.13 2.2.10 > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.10, 3.0.13, 3.11.0, 4.0 > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937125#comment-15937125 ] Joel Knighton commented on CASSANDRA-12653: --- Committed to 2.2 as {{bf0906b92cf65161d828e31bc46436d427bbb4b8}} and merged forward through 3.0, 3.11, and trunk. Added Jason Brown as an additional reviewer in the commit since his feedback was incorporated in the latest round of patches. Thanks everyone! > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.10, 3.0.13, 3.11.0, 4.0 > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (CASSANDRA-13347) dtest failure in upgrade_tests.upgrade_through_versions_test.TestUpgrade_current_2_2_x_To_indev_3_0_x.rolling_upgrade_test
[ https://issues.apache.org/jira/browse/CASSANDRA-13347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton resolved CASSANDRA-13347. --- Resolution: Fixed Fix Version/s: 3.11.0 3.0.13 This should be fixed by [CASSANDRA-13320]. > dtest failure in > upgrade_tests.upgrade_through_versions_test.TestUpgrade_current_2_2_x_To_indev_3_0_x.rolling_upgrade_test > -- > > Key: CASSANDRA-13347 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13347 > Project: Cassandra > Issue Type: Bug > Components: Testing >Reporter: Sean McCarthy > Labels: dtest, test-failure > Fix For: 3.0.13, 3.11.0 > > Attachments: node1_debug.log, node1_gc.log, node1.log, > node2_debug.log, node2_gc.log, node2.log, node3_debug.log, node3_gc.log, > node3.log > > > example failure: > http://cassci.datastax.com/job/cassandra-3.0_large_dtest/58/testReport/upgrade_tests.upgrade_through_versions_test/TestUpgrade_current_2_2_x_To_indev_3_0_x/rolling_upgrade_test > {code} > Error Message > Subprocess ['nodetool', '-h', 'localhost', '-p', '7100', ['upgradesstables', > '-a']] exited with non-zero status; exit status: 2; > stderr: error: null > -- StackTrace -- > java.lang.AssertionError > at org.apache.cassandra.db.rows.Rows.collectStats(Rows.java:70) > at > org.apache.cassandra.io.sstable.format.big.BigTableWriter$StatsCollector.applyToRow(BigTableWriter.java:197) > at > org.apache.cassandra.db.transform.BaseRows.applyOne(BaseRows.java:116) > at org.apache.cassandra.db.transform.BaseRows.add(BaseRows.java:107) > at > org.apache.cassandra.db.transform.UnfilteredRows.add(UnfilteredRows.java:41) > at > org.apache.cassandra.db.transform.Transformation.add(Transformation.java:156) > at > org.apache.cassandra.db.transform.Transformation.apply(Transformation.java:122) > at > org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:147) > at > org.apache.cassandra.io.sstable.SSTableRewriter.append(SSTableRewriter.java:125) > at > org.apache.cassandra.db.compaction.writers.DefaultCompactionWriter.realAppend(DefaultCompactionWriter.java:57) > at > org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.append(CompactionAwareWriter.java:109) > at > org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:195) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:89) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:61) > at > org.apache.cassandra.db.compaction.CompactionManager$5.execute(CompactionManager.java:415) > at > org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:307) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) > at java.lang.Thread.run(Thread.java:745) > {code}{code} > Stacktrace > File "/usr/lib/python2.7/unittest/case.py", line 329, in run > testMethod() > File > "/home/automaton/cassandra-dtest/upgrade_tests/upgrade_through_versions_test.py", > line 279, in rolling_upgrade_test > self.upgrade_scenario(rolling=True) > File > "/home/automaton/cassandra-dtest/upgrade_tests/upgrade_through_versions_test.py", > line 345, in upgrade_scenario > self.upgrade_to_version(version_meta, partial=True, nodes=(node,)) > File > "/home/automaton/cassandra-dtest/upgrade_tests/upgrade_through_versions_test.py", > line 446, in upgrade_to_version > node.nodetool('upgradesstables -a') > File > "/home/automaton/venv/local/lib/python2.7/site-packages/ccmlib/node.py", line > 789, in nodetool > return handle_external_tool_process(p, ['nodetool', '-h', 'localhost', > '-p', str(self.jmx_port), cmd.split()]) > File > "/home/automaton/venv/local/lib/python2.7/site-packages/ccmlib/node.py", line > 2002, in handle_external_tool_process > raise ToolError(cmd_args, rc, out, err) > {code} > Related failures: >
[jira] [Updated] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies
[ https://issues.apache.org/jira/browse/CASSANDRA-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-13306: -- Reviewer: Dave Brosius > Builds fetch source jars for build dependencies, not just source dependencies > - > > Key: CASSANDRA-13306 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13306 > Project: Cassandra > Issue Type: Bug > Components: Build >Reporter: Joel Knighton >Assignee: Joel Knighton > Fix For: 4.0 > > > A recent commit without a linked JIRA cleaned up dead imports and also added > a {{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This > causes ant to fetch source jars for the build deps, but we have an explicit > separate build-deps-pom-sources that fetches sources. > This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}. > Was this an intentional change, [~dbrosius]? It seems to conflate the > separate build-deps-pom and build-deps-pom-sources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies
[ https://issues.apache.org/jira/browse/CASSANDRA-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927489#comment-15927489 ] Joel Knighton commented on CASSANDRA-13306: --- Thanks! I ran a round of CI and tested builds with empty and populated .m2. Committed to trunk as {{fe08463c3b7135a0f1b121bb0d148c80b8c7e123}}. > Builds fetch source jars for build dependencies, not just source dependencies > - > > Key: CASSANDRA-13306 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13306 > Project: Cassandra > Issue Type: Bug > Components: Build >Reporter: Joel Knighton >Assignee: Joel Knighton > Fix For: 4.0 > > > A recent commit without a linked JIRA cleaned up dead imports and also added > a {{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This > causes ant to fetch source jars for the build deps, but we have an explicit > separate build-deps-pom-sources that fetches sources. > This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}. > Was this an intentional change, [~dbrosius]? It seems to conflate the > separate build-deps-pom and build-deps-pom-sources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies
[ https://issues.apache.org/jira/browse/CASSANDRA-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-13306: -- Resolution: Fixed Fix Version/s: 4.0 Status: Resolved (was: Ready to Commit) > Builds fetch source jars for build dependencies, not just source dependencies > - > > Key: CASSANDRA-13306 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13306 > Project: Cassandra > Issue Type: Bug > Components: Build >Reporter: Joel Knighton >Assignee: Joel Knighton > Fix For: 4.0 > > > A recent commit without a linked JIRA cleaned up dead imports and also added > a {{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This > causes ant to fetch source jars for the build deps, but we have an explicit > separate build-deps-pom-sources that fetches sources. > This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}. > Was this an intentional change, [~dbrosius]? It seems to conflate the > separate build-deps-pom and build-deps-pom-sources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies
[ https://issues.apache.org/jira/browse/CASSANDRA-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-13306: -- Status: Ready to Commit (was: Patch Available) > Builds fetch source jars for build dependencies, not just source dependencies > - > > Key: CASSANDRA-13306 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13306 > Project: Cassandra > Issue Type: Bug > Components: Build >Reporter: Joel Knighton >Assignee: Joel Knighton > > A recent commit without a linked JIRA cleaned up dead imports and also added > a {{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This > causes ant to fetch source jars for the build deps, but we have an explicit > separate build-deps-pom-sources that fetches sources. > This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}. > Was this an intentional change, [~dbrosius]? It seems to conflate the > separate build-deps-pom and build-deps-pom-sources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies
[ https://issues.apache.org/jira/browse/CASSANDRA-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-13306: -- Status: Patch Available (was: Open) > Builds fetch source jars for build dependencies, not just source dependencies > - > > Key: CASSANDRA-13306 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13306 > Project: Cassandra > Issue Type: Bug > Components: Build >Reporter: Joel Knighton >Assignee: Joel Knighton > > A recent commit without a linked JIRA cleaned up dead imports and also added > a {{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This > causes ant to fetch source jars for the build deps, but we have an explicit > separate build-deps-pom-sources that fetches sources. > This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}. > Was this an intentional change, [~dbrosius]? It seems to conflate the > separate build-deps-pom and build-deps-pom-sources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies
[ https://issues.apache.org/jira/browse/CASSANDRA-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton reassigned CASSANDRA-13306: - Assignee: Joel Knighton > Builds fetch source jars for build dependencies, not just source dependencies > - > > Key: CASSANDRA-13306 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13306 > Project: Cassandra > Issue Type: Bug > Components: Build >Reporter: Joel Knighton >Assignee: Joel Knighton > > A recent commit without a linked JIRA cleaned up dead imports and also added > a {{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This > causes ant to fetch source jars for the build deps, but we have an explicit > separate build-deps-pom-sources that fetches sources. > This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}. > Was this an intentional change, [~dbrosius]? It seems to conflate the > separate build-deps-pom and build-deps-pom-sources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907535#comment-15907535 ] Joel Knighton commented on CASSANDRA-12653: --- I planned on leaving that honor to [~spo...@gmail.com] as patch author, but if he doesn't, I'm happy to do so. > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903212#comment-15903212 ] Joel Knighton commented on CASSANDRA-12653: --- Sure - while I'd argue that a need for a change in the future could be introduced in the future patch, I agree that this distinction is very minor and won't cause any problems. Thanks for the patch and your patience! > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (CASSANDRA-13306) Builds fetch source jars for build dependencies, not just source dependencies
Joel Knighton created CASSANDRA-13306: - Summary: Builds fetch source jars for build dependencies, not just source dependencies Key: CASSANDRA-13306 URL: https://issues.apache.org/jira/browse/CASSANDRA-13306 Project: Cassandra Issue Type: Bug Components: Build Reporter: Joel Knighton A recent commit without a linked JIRA cleaned up dead imports and also added a {{sourcesFilesetId}} to artifact fetching for the build-deps-pom. This causes ant to fetch source jars for the build deps, but we have an explicit separate build-deps-pom-sources that fetches sources. This happened in commit {{e96ce6d132129025ff6b923129cb67eed2f97931}}. Was this an intentional change, [~dbrosius]? It seems to conflate the separate build-deps-pom and build-deps-pom-sources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky
[ https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897963#comment-15897963 ] Joel Knighton commented on CASSANDRA-13303: --- [CASSANDRA-13038] introduced the regression in {{a5ce963117acf5e4cf0a31057551f2f42385c398}}. The regression was fixed in {{adbe2cc4df0134955a2c83ae4ebd0086ea5e9164}}. > CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super > flaky > --- > > Key: CASSANDRA-13303 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13303 > Project: Cassandra > Issue Type: Bug >Reporter: Benjamin Roth > > On my machine, this test succeeds maybe 1 out of 10 times. > Cause seems to be that sstable is not elected for compation in > worthDroppingTombstones as droppableRatio is 0.0 > I don't know the primary intention of this test, so I didn't touch it but the > conditions are not safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky
[ https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897953#comment-15897953 ] Joel Knighton commented on CASSANDRA-13303: --- The exact test output would help diagnose this, but it sounds like the failure introduced/fixed in [CASSANDRA-13038], as seen in CI [here|http://cassci.datastax.com/job/trunk_testall/1436/testReport/junit/org.apache.cassandra.db.compaction/CompactionsTest/testSingleSSTableCompactionWithSizeTieredCompaction/]. Can you make sure this failure still occurs after fetching latest trunk? If so, what's your trunk commit hash? > CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super > flaky > --- > > Key: CASSANDRA-13303 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13303 > Project: Cassandra > Issue Type: Bug >Reporter: Benjamin Roth > > On my machine, this test succeeds maybe 1 out of 10 times. > Cause seems to be that sstable is not elected for compation in > worthDroppingTombstones as droppableRatio is 0.0 > I don't know the primary intention of this test, so I didn't touch it but the > conditions are not safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13303) CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super flaky
[ https://issues.apache.org/jira/browse/CASSANDRA-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897934#comment-15897934 ] Joel Knighton commented on CASSANDRA-13303: --- Thanks for the report, but there isn't a lot that's actionable here. Could you provide the branch(es) it is failing on for you? In addition, the specific failure (as shown by test output/stacktrace) you see would help someone identify the problem, particularly in cases like this when the test isn't failing on CI. This test recently had a regression introduced and fixed in [CASSANDRA-13038], but I don't know if it's the same failure you're seeing. > CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction super > flaky > --- > > Key: CASSANDRA-13303 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13303 > Project: Cassandra > Issue Type: Bug >Reporter: Benjamin Roth > > On my machine, this test succeeds maybe 1 out of 10 times. > Cause seems to be that sstable is not elected for compation in > worthDroppingTombstones as droppableRatio is 0.0 > I don't know the primary intention of this test, so I didn't touch it but the > conditions are not safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892772#comment-15892772 ] Joel Knighton commented on CASSANDRA-12653: --- Thanks! The latest changes look good - however, if moving the System.nanoTime() call to the comparison site, it seems that the {{firstSynSendAt}} truly does reduce to a boolean, since the comparison will now always be true if {{firstSynSendAt}} has been set. I don't think the existing patch will cause any problems, but it may be more complicated than it needs to be. > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (CASSANDRA-13281) testall failure in org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization
[ https://issues.apache.org/jira/browse/CASSANDRA-13281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton resolved CASSANDRA-13281. --- Resolution: Duplicate Confirmed this is a failure due to a small update needed to the test after [CASSANDRA-13038]. Reopened and being fixed there. > testall failure in > org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization > > > Key: CASSANDRA-13281 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13281 > Project: Cassandra > Issue Type: Bug > Components: Testing >Reporter: Sean McCarthy >Assignee: Joel Knighton > Labels: test-failure, testall > Attachments: > TEST-org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.log > > > example failure: > http://cassci.datastax.com/job/cassandra-3.11_testall/96/testReport/org.apache.cassandra.io.sstable.metadata/MetadataSerializerTest/testSerialization > {code} > Error Message > expected:> but was: > {code}{code} > Stacktrace > junit.framework.AssertionFailedError: > expected: > but was: > at > org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization(MetadataSerializerTest.java:72) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (CASSANDRA-13281) testall failure in org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization
[ https://issues.apache.org/jira/browse/CASSANDRA-13281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton reassigned CASSANDRA-13281: - Assignee: Joel Knighton > testall failure in > org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization > > > Key: CASSANDRA-13281 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13281 > Project: Cassandra > Issue Type: Bug > Components: Testing >Reporter: Sean McCarthy >Assignee: Joel Knighton > Labels: test-failure, testall > Attachments: > TEST-org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.log > > > example failure: > http://cassci.datastax.com/job/cassandra-3.11_testall/96/testReport/org.apache.cassandra.io.sstable.metadata/MetadataSerializerTest/testSerialization > {code} > Error Message > expected:> but was: > {code}{code} > Stacktrace > junit.framework.AssertionFailedError: > expected: > but was: > at > org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization(MetadataSerializerTest.java:72) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13038) 33% of compaction time spent in StreamingHistogram.update()
[ https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891229#comment-15891229 ] Joel Knighton commented on CASSANDRA-13038: --- Nit + 3.0 changes look good. If CI doesn't have any problems, +1. > 33% of compaction time spent in StreamingHistogram.update() > --- > > Key: CASSANDRA-13038 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13038 > Project: Cassandra > Issue Type: Bug > Components: Compaction >Reporter: Corentin Chary >Assignee: Jeff Jirsa > Fix For: 3.0.12, 3.11.0 > > Attachments: compaction-speedup.patch, > compaction-streaminghistrogram.png, profiler-snapshot.nps > > > With the following table, that contains a *lot* of cells: > {code} > CREATE TABLE biggraphite.datapoints_11520p_60s ( > metric uuid, > time_start_ms bigint, > offset smallint, > count int, > value double, > PRIMARY KEY ((metric, time_start_ms), offset) > ) WITH CLUSTERING ORDER BY (offset DESC); > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', > 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', > 'max_threshold': '32', 'min_threshold': '6'} > Keyspace : biggraphite > Read Count: 1822 > Read Latency: 1.8870054884742042 ms. > Write Count: 2212271647 > Write Latency: 0.027705127678653473 ms. > Pending Flushes: 0 > Table: datapoints_11520p_60s > SSTable count: 47 > Space used (live): 300417555945 > Space used (total): 303147395017 > Space used by snapshots (total): 0 > Off heap memory used (total): 207453042 > SSTable Compression Ratio: 0.4955200053039823 > Number of keys (estimate): 16343723 > Memtable cell count: 220576 > Memtable data size: 17115128 > Memtable off heap memory used: 0 > Memtable switch count: 2872 > Local read count: 0 > Local read latency: NaN ms > Local write count: 1103167888 > Local write latency: 0.025 ms > Pending flushes: 0 > Percent repaired: 0.0 > Bloom filter false positives: 0 > Bloom filter false ratio: 0.0 > Bloom filter space used: 105118296 > Bloom filter off heap memory used: 106547192 > Index summary off heap memory used: 27730962 > Compression metadata off heap memory used: 73174888 > Compacted partition minimum bytes: 61 > Compacted partition maximum bytes: 51012 > Compacted partition mean bytes: 7899 > Average live cells per slice (last five minutes): NaN > Maximum live cells per slice (last five minutes): 0 > Average tombstones per slice (last five minutes): NaN > Maximum tombstones per slice (last five minutes): 0 > Dropped Mutations: 0 > {code} > It looks like a good chunk of the compaction time is lost in > StreamingHistogram.update() (which is used to store the estimated tombstone > drop times). > This could be caused by a huge number of different deletion times which would > makes the bin huge but it this histogram should be capped to 100 keys. It's > more likely caused by the huge number of cells. > A simple solutions could be to only take into accounts part of the cells, the > fact the this table has a TWCS also gives us an additional hint that sampling > deletion times would be fine. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13038) 33% of compaction time spent in StreamingHistogram.update()
[ https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891175#comment-15891175 ] Joel Knighton commented on CASSANDRA-13038: --- Thanks - on a first skim, both of those look good and fix the tests locally for me. One minor nit - if removing the maxSpoolSize from {{equals}} on {{StreamingHistogram}}, it seems we should remove it from {{hashCode}} as well to respect the method contract. > 33% of compaction time spent in StreamingHistogram.update() > --- > > Key: CASSANDRA-13038 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13038 > Project: Cassandra > Issue Type: Bug > Components: Compaction >Reporter: Corentin Chary >Assignee: Jeff Jirsa > Fix For: 3.0.12, 3.11.0 > > Attachments: compaction-speedup.patch, > compaction-streaminghistrogram.png, profiler-snapshot.nps > > > With the following table, that contains a *lot* of cells: > {code} > CREATE TABLE biggraphite.datapoints_11520p_60s ( > metric uuid, > time_start_ms bigint, > offset smallint, > count int, > value double, > PRIMARY KEY ((metric, time_start_ms), offset) > ) WITH CLUSTERING ORDER BY (offset DESC); > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', > 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', > 'max_threshold': '32', 'min_threshold': '6'} > Keyspace : biggraphite > Read Count: 1822 > Read Latency: 1.8870054884742042 ms. > Write Count: 2212271647 > Write Latency: 0.027705127678653473 ms. > Pending Flushes: 0 > Table: datapoints_11520p_60s > SSTable count: 47 > Space used (live): 300417555945 > Space used (total): 303147395017 > Space used by snapshots (total): 0 > Off heap memory used (total): 207453042 > SSTable Compression Ratio: 0.4955200053039823 > Number of keys (estimate): 16343723 > Memtable cell count: 220576 > Memtable data size: 17115128 > Memtable off heap memory used: 0 > Memtable switch count: 2872 > Local read count: 0 > Local read latency: NaN ms > Local write count: 1103167888 > Local write latency: 0.025 ms > Pending flushes: 0 > Percent repaired: 0.0 > Bloom filter false positives: 0 > Bloom filter false ratio: 0.0 > Bloom filter space used: 105118296 > Bloom filter off heap memory used: 106547192 > Index summary off heap memory used: 27730962 > Compression metadata off heap memory used: 73174888 > Compacted partition minimum bytes: 61 > Compacted partition maximum bytes: 51012 > Compacted partition mean bytes: 7899 > Average live cells per slice (last five minutes): NaN > Maximum live cells per slice (last five minutes): 0 > Average tombstones per slice (last five minutes): NaN > Maximum tombstones per slice (last five minutes): 0 > Dropped Mutations: 0 > {code} > It looks like a good chunk of the compaction time is lost in > StreamingHistogram.update() (which is used to store the estimated tombstone > drop times). > This could be caused by a huge number of different deletion times which would > makes the bin huge but it this histogram should be capped to 100 keys. It's > more likely caused by the huge number of cells. > A simple solutions could be to only take into accounts part of the cells, the > fact the this table has a TWCS also gives us an additional hint that sampling > deletion times would be fine. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13038) 33% of compaction time spent in StreamingHistogram.update()
[ https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891013#comment-15891013 ] Joel Knighton commented on CASSANDRA-13038: --- Thanks! I can reproduce the {{CompactionsTest}} failure locally, so feel free to ping me if I can help diagnose. > 33% of compaction time spent in StreamingHistogram.update() > --- > > Key: CASSANDRA-13038 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13038 > Project: Cassandra > Issue Type: Bug > Components: Compaction >Reporter: Corentin Chary >Assignee: Jeff Jirsa > Fix For: 3.0.12, 3.11.0 > > Attachments: compaction-speedup.patch, > compaction-streaminghistrogram.png, profiler-snapshot.nps > > > With the following table, that contains a *lot* of cells: > {code} > CREATE TABLE biggraphite.datapoints_11520p_60s ( > metric uuid, > time_start_ms bigint, > offset smallint, > count int, > value double, > PRIMARY KEY ((metric, time_start_ms), offset) > ) WITH CLUSTERING ORDER BY (offset DESC); > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', > 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', > 'max_threshold': '32', 'min_threshold': '6'} > Keyspace : biggraphite > Read Count: 1822 > Read Latency: 1.8870054884742042 ms. > Write Count: 2212271647 > Write Latency: 0.027705127678653473 ms. > Pending Flushes: 0 > Table: datapoints_11520p_60s > SSTable count: 47 > Space used (live): 300417555945 > Space used (total): 303147395017 > Space used by snapshots (total): 0 > Off heap memory used (total): 207453042 > SSTable Compression Ratio: 0.4955200053039823 > Number of keys (estimate): 16343723 > Memtable cell count: 220576 > Memtable data size: 17115128 > Memtable off heap memory used: 0 > Memtable switch count: 2872 > Local read count: 0 > Local read latency: NaN ms > Local write count: 1103167888 > Local write latency: 0.025 ms > Pending flushes: 0 > Percent repaired: 0.0 > Bloom filter false positives: 0 > Bloom filter false ratio: 0.0 > Bloom filter space used: 105118296 > Bloom filter off heap memory used: 106547192 > Index summary off heap memory used: 27730962 > Compression metadata off heap memory used: 73174888 > Compacted partition minimum bytes: 61 > Compacted partition maximum bytes: 51012 > Compacted partition mean bytes: 7899 > Average live cells per slice (last five minutes): NaN > Maximum live cells per slice (last five minutes): 0 > Average tombstones per slice (last five minutes): NaN > Maximum tombstones per slice (last five minutes): 0 > Dropped Mutations: 0 > {code} > It looks like a good chunk of the compaction time is lost in > StreamingHistogram.update() (which is used to store the estimated tombstone > drop times). > This could be caused by a huge number of different deletion times which would > makes the bin huge but it this histogram should be capped to 100 keys. It's > more likely caused by the huge number of cells. > A simple solutions could be to only take into accounts part of the cells, the > fact the this table has a TWCS also gives us an additional hint that sampling > deletion times would be fine. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (CASSANDRA-13038) 33% of compaction time spent in StreamingHistogram.update()
[ https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890941#comment-15890941 ] Joel Knighton edited comment on CASSANDRA-13038 at 3/1/17 8:09 PM: --- It looks like this ticket introduced a few test failures. {{org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization}} is consistently failing on 3.11 and trunk after this commit, and {{org.apache.cassandra.db.compaction.CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction}} is failing nearly 100% of the time after this commit on trunk. In both cases, these tests are failing on the linked CI above and appear to have no historical failures. I don't see any discussion of these CI failures for the linked branches on the ticket - are they being resolved elsewhere? EDIT: In addition, reverting this commit fixes these test failures. was (Author: jkni): It looks like this ticket introduced a few test failures. {{org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization}} is consistently failing on 3.11 and trunk after this commit, and {{org.apache.cassandra.db.compaction.CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction}} is failing nearly 100% of the time after this commit on trunk. In both cases, these tests are failing on the linked CI above and appear to have no historical failures. I don't see any discussion of these CI failures for the linked branches on the ticket - are they being resolved elsewhere? > 33% of compaction time spent in StreamingHistogram.update() > --- > > Key: CASSANDRA-13038 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13038 > Project: Cassandra > Issue Type: Bug > Components: Compaction >Reporter: Corentin Chary >Assignee: Jeff Jirsa > Fix For: 3.0.12, 3.11.0 > > Attachments: compaction-speedup.patch, > compaction-streaminghistrogram.png, profiler-snapshot.nps > > > With the following table, that contains a *lot* of cells: > {code} > CREATE TABLE biggraphite.datapoints_11520p_60s ( > metric uuid, > time_start_ms bigint, > offset smallint, > count int, > value double, > PRIMARY KEY ((metric, time_start_ms), offset) > ) WITH CLUSTERING ORDER BY (offset DESC); > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', > 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', > 'max_threshold': '32', 'min_threshold': '6'} > Keyspace : biggraphite > Read Count: 1822 > Read Latency: 1.8870054884742042 ms. > Write Count: 2212271647 > Write Latency: 0.027705127678653473 ms. > Pending Flushes: 0 > Table: datapoints_11520p_60s > SSTable count: 47 > Space used (live): 300417555945 > Space used (total): 303147395017 > Space used by snapshots (total): 0 > Off heap memory used (total): 207453042 > SSTable Compression Ratio: 0.4955200053039823 > Number of keys (estimate): 16343723 > Memtable cell count: 220576 > Memtable data size: 17115128 > Memtable off heap memory used: 0 > Memtable switch count: 2872 > Local read count: 0 > Local read latency: NaN ms > Local write count: 1103167888 > Local write latency: 0.025 ms > Pending flushes: 0 > Percent repaired: 0.0 > Bloom filter false positives: 0 > Bloom filter false ratio: 0.0 > Bloom filter space used: 105118296 > Bloom filter off heap memory used: 106547192 > Index summary off heap memory used: 27730962 > Compression metadata off heap memory used: 73174888 > Compacted partition minimum bytes: 61 > Compacted partition maximum bytes: 51012 > Compacted partition mean bytes: 7899 > Average live cells per slice (last five minutes): NaN > Maximum live cells per slice (last five minutes): 0 > Average tombstones per slice (last five minutes): NaN > Maximum tombstones per slice (last five minutes): 0 > Dropped Mutations: 0 > {code} > It looks like a good chunk of the compaction time is lost in > StreamingHistogram.update() (which is used to store the estimated tombstone > drop times). > This could be caused by a huge number of different deletion times which would > makes the bin huge but it this histogram should be capped to 100 keys. It's > more likely caused by the huge
[jira] [Reopened] (CASSANDRA-13038) 33% of compaction time spent in StreamingHistogram.update()
[ https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton reopened CASSANDRA-13038: --- It looks like this ticket introduced a few test failures. {{org.apache.cassandra.io.sstable.metadata.MetadataSerializerTest.testSerialization}} is consistently failing on 3.11 and trunk after this commit, and {{org.apache.cassandra.db.compaction.CompactionsTest.testSingleSSTableCompactionWithSizeTieredCompaction}} is failing nearly 100% of the time after this commit on trunk. In both cases, these tests are failing on the linked CI above and appear to have no historical failures. I don't see any discussion of these CI failures for the linked branches on the ticket - are they being resolved elsewhere? > 33% of compaction time spent in StreamingHistogram.update() > --- > > Key: CASSANDRA-13038 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13038 > Project: Cassandra > Issue Type: Bug > Components: Compaction >Reporter: Corentin Chary >Assignee: Jeff Jirsa > Fix For: 3.0.12, 3.11.0 > > Attachments: compaction-speedup.patch, > compaction-streaminghistrogram.png, profiler-snapshot.nps > > > With the following table, that contains a *lot* of cells: > {code} > CREATE TABLE biggraphite.datapoints_11520p_60s ( > metric uuid, > time_start_ms bigint, > offset smallint, > count int, > value double, > PRIMARY KEY ((metric, time_start_ms), offset) > ) WITH CLUSTERING ORDER BY (offset DESC); > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', > 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', > 'max_threshold': '32', 'min_threshold': '6'} > Keyspace : biggraphite > Read Count: 1822 > Read Latency: 1.8870054884742042 ms. > Write Count: 2212271647 > Write Latency: 0.027705127678653473 ms. > Pending Flushes: 0 > Table: datapoints_11520p_60s > SSTable count: 47 > Space used (live): 300417555945 > Space used (total): 303147395017 > Space used by snapshots (total): 0 > Off heap memory used (total): 207453042 > SSTable Compression Ratio: 0.4955200053039823 > Number of keys (estimate): 16343723 > Memtable cell count: 220576 > Memtable data size: 17115128 > Memtable off heap memory used: 0 > Memtable switch count: 2872 > Local read count: 0 > Local read latency: NaN ms > Local write count: 1103167888 > Local write latency: 0.025 ms > Pending flushes: 0 > Percent repaired: 0.0 > Bloom filter false positives: 0 > Bloom filter false ratio: 0.0 > Bloom filter space used: 105118296 > Bloom filter off heap memory used: 106547192 > Index summary off heap memory used: 27730962 > Compression metadata off heap memory used: 73174888 > Compacted partition minimum bytes: 61 > Compacted partition maximum bytes: 51012 > Compacted partition mean bytes: 7899 > Average live cells per slice (last five minutes): NaN > Maximum live cells per slice (last five minutes): 0 > Average tombstones per slice (last five minutes): NaN > Maximum tombstones per slice (last five minutes): 0 > Dropped Mutations: 0 > {code} > It looks like a good chunk of the compaction time is lost in > StreamingHistogram.update() (which is used to store the estimated tombstone > drop times). > This could be caused by a huge number of different deletion times which would > makes the bin huge but it this histogram should be capped to 100 keys. It's > more likely caused by the huge number of cells. > A simple solutions could be to only take into accounts part of the cells, the > fact the this table has a TWCS also gives us an additional hint that sampling > deletion times would be fine. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883114#comment-15883114 ] Joel Knighton commented on CASSANDRA-12653: --- Do those answers address your questions well enough, [~jasobrown]? The latest patch addressed my concerns, but I don't want to step on your toes. I had to restart dtests for 2.2, but the latest patch/CI looks good to me otherwise. > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878461#comment-15878461 ] Joel Knighton commented on CASSANDRA-12653: --- I think I can answer these - feel free to correct me, [~spo...@gmail.com] In order, * Presently, the tests depend on the mock MessagingService, which was added in [CASSANDRA-12016] to 3.10+. We'd new tests for 2.2/3.0+, which is desirable, but I have no great ideas how to do it other than fiddly byteman tests. * I agree with this. Stefan and I discussed it on the first pass of review, and I wouldn't mind eliminating that check altogether and making it a boolean. OTOH, it's cheap to check deserialization time and excludes the messages that were deserialized prior to the check. OTOH, there's no meaningful distinction in correctness-preserving behaviors between that and arbitrarily delayed gossip messages, and we need to handle the latter correctly anyway. I'm most concerned about this check giving future readers false hope :). * It also seems to be me that it doesn't presently need to be synchronized. That said, I assumed it was a defensive choice because the internals are definitely not safe to call on multiple threads, and someone may make that mistake in the future. > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (CASSANDRA-13135) Forced termination of repair session leaves repair jobs running
[ https://issues.apache.org/jira/browse/CASSANDRA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876697#comment-15876697 ] Joel Knighton edited comment on CASSANDRA-13135 at 2/21/17 8:57 PM: I think this is definitely worth doing from an organizational/efficiency standpoint. I'm not sure if the current behavior will greatly increase repair time; termination of the repair session will shut down the task executor used by the repair jobs, so the queued repair jobs should fail quickly. The patches look correct, but I have a few thoughts/questions about the approach. It seems more error prone than necessary for {{cleanupJobs}} to take a reference to an executor. To me, it looks like we'll only ever want to clean up jobs from the executor provided in {{start}} and we could store a reference to the executor in {{start}}. It also might make more sense to handle this in {{forceShutdown}} rather than in a listener added to the RepairSession if a reference to the executor is stored. There's a small typo in {{ActiveRepairService}} - "cacelled repair jobs" should be "cancelled repair jobs". If you'd rather stay with the approach in the attached patches rather than something closer to my questions/comments above, we should remove the comment above {{forceShutdown}} saying that it will "clear all RepairJobs". This is currently incorrect and will remain incorrect if we continue to cleanup the jobs in a listener attached to the RepairSession. was (Author: jkni): I think this is definitely worth doing from an organizational/efficiency standpoint. I'm not sure if this will greatly increase repair time; termination of the repair session will shut down the task executor used by the repair jobs, so the queued repair jobs should fail quickly. The patches look correct, but I have a few thoughts/questions about the approach. It seems more error prone than necessary for {{cleanupJobs}} to take a reference to an executor. To me, it looks like we'll only ever want to clean up jobs from the executor provided in {{start}} and we could store a reference to the executor in {{start}}. It also might make more sense to handle this in {{forceShutdown}} rather than in a listener added to the RepairSession if a reference to the executor is stored. There's a small typo in {{ActiveRepairService}} - "cacelled repair jobs" should be "cancelled repair jobs". If you'd rather stay with the approach in the attached patches rather than something closer to my questions/comments above, we should remove the comment above {{forceShutdown}} saying that it will "clear all RepairJobs". This is currently incorrect and will remain incorrect if we continue to cleanup the jobs in a listener attached to the RepairSession. > Forced termination of repair session leaves repair jobs running > --- > > Key: CASSANDRA-13135 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13135 > Project: Cassandra > Issue Type: Bug >Reporter: Yuki Morishita >Assignee: Yuki Morishita > Fix For: 2.2.x, 3.0.x, 3.11.x > > > Forced termination of repair session (by failure detector or jmx) keeps > repair jobs running that the session created after session is terminated. > This can cause increase in repair time by those unnecessary works left in > repair job queue. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13135) Forced termination of repair session leaves repair jobs running
[ https://issues.apache.org/jira/browse/CASSANDRA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-13135: -- Status: Awaiting Feedback (was: Open) > Forced termination of repair session leaves repair jobs running > --- > > Key: CASSANDRA-13135 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13135 > Project: Cassandra > Issue Type: Bug >Reporter: Yuki Morishita >Assignee: Yuki Morishita > Fix For: 2.2.x, 3.0.x, 3.11.x > > > Forced termination of repair session (by failure detector or jmx) keeps > repair jobs running that the session created after session is terminated. > This can cause increase in repair time by those unnecessary works left in > repair job queue. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13135) Forced termination of repair session leaves repair jobs running
[ https://issues.apache.org/jira/browse/CASSANDRA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-13135: -- Status: Open (was: Patch Available) > Forced termination of repair session leaves repair jobs running > --- > > Key: CASSANDRA-13135 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13135 > Project: Cassandra > Issue Type: Bug >Reporter: Yuki Morishita >Assignee: Yuki Morishita > Fix For: 2.2.x, 3.0.x, 3.11.x > > > Forced termination of repair session (by failure detector or jmx) keeps > repair jobs running that the session created after session is terminated. > This can cause increase in repair time by those unnecessary works left in > repair job queue. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13135) Forced termination of repair session leaves repair jobs running
[ https://issues.apache.org/jira/browse/CASSANDRA-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876697#comment-15876697 ] Joel Knighton commented on CASSANDRA-13135: --- I think this is definitely worth doing from an organizational/efficiency standpoint. I'm not sure if this will greatly increase repair time; termination of the repair session will shut down the task executor used by the repair jobs, so the queued repair jobs should fail quickly. The patches look correct, but I have a few thoughts/questions about the approach. It seems more error prone than necessary for {{cleanupJobs}} to take a reference to an executor. To me, it looks like we'll only ever want to clean up jobs from the executor provided in {{start}} and we could store a reference to the executor in {{start}}. It also might make more sense to handle this in {{forceShutdown}} rather than in a listener added to the RepairSession if a reference to the executor is stored. There's a small typo in {{ActiveRepairService}} - "cacelled repair jobs" should be "cancelled repair jobs". If you'd rather stay with the approach in the attached patches rather than something closer to my questions/comments above, we should remove the comment above {{forceShutdown}} saying that it will "clear all RepairJobs". This is currently incorrect and will remain incorrect if we continue to cleanup the jobs in a listener attached to the RepairSession. > Forced termination of repair session leaves repair jobs running > --- > > Key: CASSANDRA-13135 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13135 > Project: Cassandra > Issue Type: Bug >Reporter: Yuki Morishita >Assignee: Yuki Morishita > Fix For: 2.2.x, 3.0.x, 3.11.x > > > Forced termination of repair session (by failure detector or jmx) keeps > repair jobs running that the session created after session is terminated. > This can cause increase in repair time by those unnecessary works left in > repair job queue. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Status: Awaiting Feedback (was: Open) A few questions/comments on the latest patches: * On all versions, a space is missing in the conditional {{if(firstSynSendAt == 0)}} * On 2.2, it looks like the patch adds the {{valuesEqual}} method from later versions. Was this intentional? It looks unused. * On all versions, using {{firstSynSendAt == 0}} to check if it has been initialized isn't entirely safe. It's entirely legal (although admittedly rare) for {{System.nanoTime}} to return 0. If this happened, all acks would be rejected. * Comparisons for two {{System.nanoTime}} values (such as in {{GossipDigestAckVerbHandler}}) should not use t1 < t2. Instead, one should check the difference (t1 - t2 < 0) because numerical overflow could occur in the {{System.nanoTime}} long. * In {{maybeFinishShadowRound}}/{{finishShadowRound}}, we should add the states to the {{endpointShadowStateMap}} before setting {{inShadowRound}} to false. It looks like the current behavior admits a race where {{doShadowRound}} could read {{inShadowRound == false}} and exit its loop and copy the endpointShadowStateMap before it is filled by shadow round finish. * I believe {{firstSynSendAt}} is accessed from multiple threads and needs to be volatile. > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Status: Open (was: Patch Available) > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863233#comment-15863233 ] Joel Knighton commented on CASSANDRA-12653: --- [~jjirsa] - yes. That said, I don't anticipate getting to it within the next couple days, so feel free to give it the final review if it is higher priority for you than that. I've given several passes of review to the core concepts and they seem good. I think a final code style/details pass is all that remains. > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-11479) BatchlogManager unit tests failing on truncate race condition
[ https://issues.apache.org/jira/browse/CASSANDRA-11479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863235#comment-15863235 ] Joel Knighton commented on CASSANDRA-11479: --- More on my plate than Yuki's, I believe. The patch made sense to me, but I was waiting for a chance to dig deeper into the related compaction code before giving it a final OK. In the process, it slipped through the cracks quite badly. I'd be happy to do that, but it likely wouldn't happen in the next few days if you're interested in taking it on instead. > BatchlogManager unit tests failing on truncate race condition > - > > Key: CASSANDRA-11479 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11479 > Project: Cassandra > Issue Type: Bug > Components: Compaction >Reporter: Joel Knighton >Assignee: Yuki Morishita > Fix For: 2.2.x, 3.0.x, 3.11.x > > Attachments: > TEST-org.apache.cassandra.batchlog.BatchlogManagerTest.log > > > Example on CI > [here|http://cassci.datastax.com/job/trunk_testall/818/testReport/junit/org.apache.cassandra.batchlog/BatchlogManagerTest/testLegacyReplay_compression/]. > This seems to have only started happening relatively recently (within the > last month or two). > As far as I can tell, this is only showing up on BatchlogManagerTests purely > because it is an aggressive user of truncate. The assertion is hit in the > setUp method, so it can happen before any of the test methods. The assertion > occurs because a compaction is happening when truncate wants to discard > SSTables; trace level logs suggest that this compaction is submitted after > the pause on the CompactionStrategyManager. > This should be reproducible by running BatchlogManagerTest in a loop - it > takes up to half an hour in my experience. A trace-level log from such a run > is attached - grep for my added log message "SSTABLES COMPACTING WHEN > DISCARDING" to find when the assert is hit. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13161) testall failure in org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.testVersions
[ https://issues.apache.org/jira/browse/CASSANDRA-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-13161: -- Status: Ready to Commit (was: Patch Available) > testall failure in > org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.testVersions > - > > Key: CASSANDRA-13161 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13161 > Project: Cassandra > Issue Type: Bug > Components: Testing >Reporter: Sean McCarthy >Assignee: Benjamin Lerer > Labels: test-failure, testall > Attachments: > TEST-org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.log > > > example failure: > http://cassci.datastax.com/job/trunk_testall/1374/testReport/org.apache.cassandra.db.commitlog/CommitLogDescriptorTest/testVersions > {code} > Error Message > expected:<11> but was:<10> > {code}{code} > Stacktrace > junit.framework.AssertionFailedError: expected:<11> but was:<10> > at > org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.testVersions(CommitLogDescriptorTest.java:84) > {code} > Related Failures: > http://cassci.datastax.com/job/trunk_testall/1374/testReport/org.apache.cassandra.db.commitlog/CommitLogDescriptorTest/testVersions_compression/ -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13161) testall failure in org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.testVersions
[ https://issues.apache.org/jira/browse/CASSANDRA-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848749#comment-15848749 ] Joel Knighton commented on CASSANDRA-13161: --- +1 - thanks > testall failure in > org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.testVersions > - > > Key: CASSANDRA-13161 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13161 > Project: Cassandra > Issue Type: Bug > Components: Testing >Reporter: Sean McCarthy >Assignee: Benjamin Lerer > Labels: test-failure, testall > Attachments: > TEST-org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.log > > > example failure: > http://cassci.datastax.com/job/trunk_testall/1374/testReport/org.apache.cassandra.db.commitlog/CommitLogDescriptorTest/testVersions > {code} > Error Message > expected:<11> but was:<10> > {code}{code} > Stacktrace > junit.framework.AssertionFailedError: expected:<11> but was:<10> > at > org.apache.cassandra.db.commitlog.CommitLogDescriptorTest.testVersions(CommitLogDescriptorTest.java:84) > {code} > Related Failures: > http://cassci.datastax.com/job/trunk_testall/1374/testReport/org.apache.cassandra.db.commitlog/CommitLogDescriptorTest/testVersions_compression/ -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13112) test failure in snitch_test.TestDynamicEndpointSnitch.test_multidatacenter_local_quorum
[ https://issues.apache.org/jira/browse/CASSANDRA-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827384#comment-15827384 ] Joel Knighton commented on CASSANDRA-13112: --- This should be a dtest only fix. I've PRed a fix at [https://github.com/riptano/cassandra-dtest/pull/1425]. > test failure in > snitch_test.TestDynamicEndpointSnitch.test_multidatacenter_local_quorum > --- > > Key: CASSANDRA-13112 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13112 > Project: Cassandra > Issue Type: Bug >Reporter: Sean McCarthy >Assignee: Joel Knighton > Labels: dtest, test-failure > Attachments: node1_debug.log, node1_gc.log, node1.log, > node2_debug.log, node2_gc.log, node2.log, node3_debug.log, node3_gc.log, > node3.log, node4_debug.log, node4_gc.log, node4.log, node5_debug.log, > node5_gc.log, node5.log, node6_debug.log, node6_gc.log, node6.log > > > example failure: > http://cassci.datastax.com/job/trunk_large_dtest/48/testReport/snitch_test/TestDynamicEndpointSnitch/test_multidatacenter_local_quorum > {code} > Error Message > 75 != 76 > {code}{code} > Stacktrace > File "/usr/lib/python2.7/unittest/case.py", line 329, in run > testMethod() > File "/home/automaton/cassandra-dtest/tools/decorators.py", line 48, in > wrapped > f(obj) > File "/home/automaton/cassandra-dtest/snitch_test.py", line 168, in > test_multidatacenter_local_quorum > bad_jmx.read_attribute(read_stage, 'Value')) > File "/usr/lib/python2.7/unittest/case.py", line 513, in assertEqual > assertion_func(first, second, msg=msg) > File "/usr/lib/python2.7/unittest/case.py", line 506, in _baseAssertEqual > raise self.failureException(msg) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (CASSANDRA-13112) test failure in snitch_test.TestDynamicEndpointSnitch.test_multidatacenter_local_quorum
[ https://issues.apache.org/jira/browse/CASSANDRA-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton reassigned CASSANDRA-13112: - Assignee: Joel Knighton > test failure in > snitch_test.TestDynamicEndpointSnitch.test_multidatacenter_local_quorum > --- > > Key: CASSANDRA-13112 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13112 > Project: Cassandra > Issue Type: Bug >Reporter: Sean McCarthy >Assignee: Joel Knighton > Labels: dtest, test-failure > Attachments: node1_debug.log, node1_gc.log, node1.log, > node2_debug.log, node2_gc.log, node2.log, node3_debug.log, node3_gc.log, > node3.log, node4_debug.log, node4_gc.log, node4.log, node5_debug.log, > node5_gc.log, node5.log, node6_debug.log, node6_gc.log, node6.log > > > example failure: > http://cassci.datastax.com/job/trunk_large_dtest/48/testReport/snitch_test/TestDynamicEndpointSnitch/test_multidatacenter_local_quorum > {code} > Error Message > 75 != 76 > {code}{code} > Stacktrace > File "/usr/lib/python2.7/unittest/case.py", line 329, in run > testMethod() > File "/home/automaton/cassandra-dtest/tools/decorators.py", line 48, in > wrapped > f(obj) > File "/home/automaton/cassandra-dtest/snitch_test.py", line 168, in > test_multidatacenter_local_quorum > bad_jmx.read_attribute(read_stage, 'Value')) > File "/usr/lib/python2.7/unittest/case.py", line 513, in assertEqual > assertion_func(first, second, msg=msg) > File "/usr/lib/python2.7/unittest/case.py", line 506, in _baseAssertEqual > raise self.failureException(msg) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Fix Version/s: 4.x 3.x 3.0.x 2.2.x > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.x, 4.x > > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Status: Patch Available (was: Awaiting Feedback) > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818707#comment-15818707 ] Joel Knighton commented on CASSANDRA-12653: --- Thanks for the quick response; I agree with all the points in your message. My gut instinct is to make the patch as small as possible since we agree that establishing a causal relationship or explicitly separating the shadow gossip round is the proper long-term solution, but the patch isn't particularly large either way, so I'll move forward with the the patch as proposed. I'll give the patches another review for any small fixes. > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Status: Awaiting Feedback (was: Open) Thanks for the ping - I gave the patch another skim and have some questions. The approach of returning endpoint states obtained through gossip seems sound. I definitely like this idea because it (mostly) prevents us from needing to reason about how shadow rounds affect the proper gossip process. That said, I'm not sure it fully accomplishes this goal. At the moment, it should be safe if a later response comes back after the gossiper has started, but we need to be careful to preserve this property in the future. I'm not sure how the timestamp check is supposed to work. It initializes the field using System.nanoTime() the first time we send a gossip, but in the gossip digest ack verb handler, we check timestamps using the endpoint state update timestamp, which is not serialized inter-node and also initialized using System.nanoTime() by the local JVM. It seems to me that this reduces to a boolean check that the gossiper has been properly started at least once, since this check will only fail when firstSynSendAt == 0. Am I missing something here? It also seems to me that we should initialize the field on starting the gossiper rather than checking and possibly initializing it every time we send a gossip message. > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Status: Open (was: Patch Available) > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12856) dtest failure in replication_test.SnitchConfigurationUpdateTest.test_cannot_restart_with_different_rack
[ https://issues.apache.org/jira/browse/CASSANDRA-12856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813610#comment-15813610 ] Joel Knighton commented on CASSANDRA-12856: --- Nice catch! I wasn't able to reproduce either, but I could easily induce the race. I also confirmed that all historical instances of this failure were on tests that start and then immediately stop a node. I started CI for 2.2, 3.0, and 3.X. All CI looks good. My main concern was that this new method doesn't allow a serve() again after stop(), but I confirmed this doesn't affect our usage since we recreate the object on enabling/disable Thrift. It also doesn't violate any Thrift implementation requirements. I wouldn't argue for this to go into 2.1; while the looping thread is unfortunate, it shouldn't cause data loss or cascading failures, and it should only affect instances were a server is started and immediately stopped, which already suggests an unusual situation. That said, the change is small enough that I wouldn't be concerned about it going into 2.1 either. +1 > dtest failure in > replication_test.SnitchConfigurationUpdateTest.test_cannot_restart_with_different_rack > --- > > Key: CASSANDRA-12856 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12856 > Project: Cassandra > Issue Type: Bug >Reporter: Sean McCarthy >Assignee: Stefania > Labels: dtest, test-failure > Attachments: node1.log > > > example failure: > http://cassci.datastax.com/job/cassandra-2.1_novnode_dtest/280/testReport/replication_test/SnitchConfigurationUpdateTest/test_cannot_restart_with_different_rack > {code} > Error Message > Problem stopping node node1 > {code}{code} > Stacktrace > File "/usr/lib/python2.7/unittest/case.py", line 329, in run > testMethod() > File "/home/automaton/cassandra-dtest/replication_test.py", line 630, in > test_cannot_restart_with_different_rack > node1.stop(wait_other_notice=True) > File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 727, in > stop > raise NodeError("Problem stopping node %s" % self.name) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12856) dtest failure in replication_test.SnitchConfigurationUpdateTest.test_cannot_restart_with_different_rack
[ https://issues.apache.org/jira/browse/CASSANDRA-12856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12856: -- Status: Ready to Commit (was: Patch Available) > dtest failure in > replication_test.SnitchConfigurationUpdateTest.test_cannot_restart_with_different_rack > --- > > Key: CASSANDRA-12856 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12856 > Project: Cassandra > Issue Type: Bug >Reporter: Sean McCarthy >Assignee: Stefania > Labels: dtest, test-failure > Attachments: node1.log > > > example failure: > http://cassci.datastax.com/job/cassandra-2.1_novnode_dtest/280/testReport/replication_test/SnitchConfigurationUpdateTest/test_cannot_restart_with_different_rack > {code} > Error Message > Problem stopping node node1 > {code}{code} > Stacktrace > File "/usr/lib/python2.7/unittest/case.py", line 329, in run > testMethod() > File "/home/automaton/cassandra-dtest/replication_test.py", line 630, in > test_cannot_restart_with_different_rack > node1.stop(wait_other_notice=True) > File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 727, in > stop > raise NodeError("Problem stopping node %s" % self.name) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12792) delete with timestamp long.MAX_VALUE for the whole key creates tombstone that cannot be removed.
[ https://issues.apache.org/jira/browse/CASSANDRA-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812898#comment-15812898 ] Joel Knighton commented on CASSANDRA-12792: --- [~jjordan] - 2.2.9, 3.0.10, and 3.10. I also updated the fixver field. Thanks for the reminder. > delete with timestamp long.MAX_VALUE for the whole key creates tombstone that > cannot be removed. > - > > Key: CASSANDRA-12792 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12792 > Project: Cassandra > Issue Type: Bug > Components: Compaction >Reporter: Ian Ilsley >Assignee: Joel Knighton > Fix For: 2.2.9, 3.0.10, 3.10 > > > In db/compaction/LazilyCompactedRow.java > we only check for < MaxPurgeableTimeStamp > eg: > (this.maxRowTombstone.markedForDeleteAt < getMaxPurgeableTimestamp()) > this should probably be <= -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12792) delete with timestamp long.MAX_VALUE for the whole key creates tombstone that cannot be removed.
[ https://issues.apache.org/jira/browse/CASSANDRA-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12792: -- Fix Version/s: 3.0.10 3.10 2.2.9 > delete with timestamp long.MAX_VALUE for the whole key creates tombstone that > cannot be removed. > - > > Key: CASSANDRA-12792 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12792 > Project: Cassandra > Issue Type: Bug > Components: Compaction >Reporter: Ian Ilsley >Assignee: Joel Knighton > Fix For: 2.2.9, 3.0.10, 3.10 > > > In db/compaction/LazilyCompactedRow.java > we only check for < MaxPurgeableTimeStamp > eg: > (this.maxRowTombstone.markedForDeleteAt < getMaxPurgeableTimestamp()) > this should probably be <= -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-13074) DynamicEndpointSnitch frequently no-ops through early exit in multi-datacenter situations
[ https://issues.apache.org/jira/browse/CASSANDRA-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15795527#comment-15795527 ] Joel Knighton commented on CASSANDRA-13074: --- [~tjake] - A dtest is a great idea. I've PRed a test to [riptano/cassandra-dtest|https://github.com/riptano/cassandra-dtest/pull/1416]. > DynamicEndpointSnitch frequently no-ops through early exit in > multi-datacenter situations > - > > Key: CASSANDRA-13074 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13074 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Joel Knighton >Assignee: Joel Knighton > Fix For: 2.2.x, 3.0.x, 3.x, 4.x > > > The DynamicEndpointSnitch attempts to use timings from nodes to route reads > to better performing nodes. > In a multi-datacenter situation, timings will likely be empty for nodes > outside of the local datacenter, as you'll frequently only be doing > local_quorum reads (or a lower consistency level). In this case, the DES > exits early and returns the subsnitch ordering. This means poorly performing > replicas will never be avoided, no matter how degraded they are. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-13074) DynamicEndpointSnitch frequently no-ops through early exit in multi-datacenter situations
[ https://issues.apache.org/jira/browse/CASSANDRA-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-13074: -- Status: Patch Available (was: Open) ||branch||testall||dtest|| |[des-snitch-changes-2.2|https://github.com/jkni/cassandra/tree/des-snitch-changes-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-snitch-changes-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-snitch-changes-2.2-dtest]| |[des-changes-3.0|https://github.com/jkni/cassandra/tree/des-changes-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-3.0-dtest]| |[des-changes-3.11|https://github.com/jkni/cassandra/tree/des-changes-3.11]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-3.11-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-3.11-dtest]| |[des-changes-3.X|https://github.com/jkni/cassandra/tree/des-changes-3.X]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-3.X-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-3.X-dtest]| |[des-changes-trunk|https://github.com/jkni/cassandra/tree/des-changes-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-des-changes-trunk-dtest]| I've attached all branches, but the merge forward from 2.2 is clean except for a trivial to resolve merge conflict in 3.0 -> 3.11. CI looks clean. Although the patch is small, there's a fair amount of nuance here. We no longer want to seed with a latency of zero - in this case, if you're doing lots of local_quorum reads or something similar, populating with zero would mean that we no longer get any benefits from stickiness. With this patch, we only populate with real latencies. > DynamicEndpointSnitch frequently no-ops through early exit in > multi-datacenter situations > - > > Key: CASSANDRA-13074 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13074 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Joel Knighton >Assignee: Joel Knighton > Fix For: 2.2.x, 3.0.x, 3.x, 4.x > > > The DynamicEndpointSnitch attempts to use timings from nodes to route reads > to better performing nodes. > In a multi-datacenter situation, timings will likely be empty for nodes > outside of the local datacenter, as you'll frequently only be doing > local_quorum reads (or a lower consistency level). In this case, the DES > exits early and returns the subsnitch ordering. This means poorly performing > replicas will never be avoided, no matter how degraded they are. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-13074) DynamicEndpointSnitch frequently no-ops through early exit in multi-datacenter situations
Joel Knighton created CASSANDRA-13074: - Summary: DynamicEndpointSnitch frequently no-ops through early exit in multi-datacenter situations Key: CASSANDRA-13074 URL: https://issues.apache.org/jira/browse/CASSANDRA-13074 Project: Cassandra Issue Type: Bug Components: Coordination Reporter: Joel Knighton Assignee: Joel Knighton Fix For: 2.2.x, 3.0.x, 3.x, 4.x The DynamicEndpointSnitch attempts to use timings from nodes to route reads to better performing nodes. In a multi-datacenter situation, timings will likely be empty for nodes outside of the local datacenter, as you'll frequently only be doing local_quorum reads (or a lower consistency level). In this case, the DES exits early and returns the subsnitch ordering. This means poorly performing replicas will never be avoided, no matter how degraded they are. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8795) Cassandra (possibly under load) occasionally throws an exception during CQL create table
[ https://issues.apache.org/jira/browse/CASSANDRA-8795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749321#comment-15749321 ] Joel Knighton commented on CASSANDRA-8795: -- I'm not sure it makes sense to make address this in 2.1 any more, as I wouldn't call this behavior critical given the behavior of schema in 2.1. This specific issue should not affect 2.2+. > Cassandra (possibly under load) occasionally throws an exception during CQL > create table > > > Key: CASSANDRA-8795 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8795 > Project: Cassandra > Issue Type: Bug >Reporter: Darren Warner >Assignee: Joel Knighton > > CQLSH will return the following: > {code} > { name: 'ResponseError', > message: 'java.lang.RuntimeException: > java.util.concurrent.ExecutionException: java.lang.NullPointerException', > info: 'Represents an error message from the server', > code: 0, > query: 'CREATE TABLE IF NOT EXISTS roles_by_users( userid TIMEUUID, role > INT, entityid TIMEUUID, entity_type TEXT, enabled BOOLEAN, PRIMARY KEY > (userid, role, entityid, entity_type) );' } > {code} > Cassandra system.log shows: > {code} > ERROR [MigrationStage:1] 2015-02-11 14:38:48,610 CassandraDaemon.java:153 - > Exception in thread Thread[MigrationStage:1,5,main] > java.lang.NullPointerException: null > at > org.apache.cassandra.db.DefsTables.addColumnFamily(DefsTables.java:371) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.db.DefsTables.mergeColumnFamilies(DefsTables.java:293) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.db.DefsTables.mergeSchemaInternal(DefsTables.java:194) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:166) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.service.MigrationManager$2.runMayThrow(MigrationManager.java:393) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_31] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ~[na:1.8.0_31] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ~[na:1.8.0_31] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_31] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31] > ERROR [SharedPool-Worker-2] 2015-02-11 14:38:48,620 QueryMessage.java:132 - > Unexpected error during query > java.lang.RuntimeException: java.util.concurrent.ExecutionException: > java.lang.NullPointerException > at > org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:398) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.service.MigrationManager.announce(MigrationManager.java:374) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.service.MigrationManager.announceNewColumnFamily(MigrationManager.java:249) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.cql3.statements.CreateTableStatement.announceMigration(CreateTableStatement.java:113) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:80) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:226) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:248) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:119) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:439) > [apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:335) > [apache-cassandra-2.1.2.jar:2.1.2] > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > [netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > [netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32) > [netty-all-4.0.23.Final.jar:4.0.23.Final] > at >
[jira] [Resolved] (CASSANDRA-8795) Cassandra (possibly under load) occasionally throws an exception during CQL create table
[ https://issues.apache.org/jira/browse/CASSANDRA-8795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton resolved CASSANDRA-8795. -- Resolution: Won't Fix Fix Version/s: (was: 2.1.x) > Cassandra (possibly under load) occasionally throws an exception during CQL > create table > > > Key: CASSANDRA-8795 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8795 > Project: Cassandra > Issue Type: Bug >Reporter: Darren Warner >Assignee: Joel Knighton > > CQLSH will return the following: > {code} > { name: 'ResponseError', > message: 'java.lang.RuntimeException: > java.util.concurrent.ExecutionException: java.lang.NullPointerException', > info: 'Represents an error message from the server', > code: 0, > query: 'CREATE TABLE IF NOT EXISTS roles_by_users( userid TIMEUUID, role > INT, entityid TIMEUUID, entity_type TEXT, enabled BOOLEAN, PRIMARY KEY > (userid, role, entityid, entity_type) );' } > {code} > Cassandra system.log shows: > {code} > ERROR [MigrationStage:1] 2015-02-11 14:38:48,610 CassandraDaemon.java:153 - > Exception in thread Thread[MigrationStage:1,5,main] > java.lang.NullPointerException: null > at > org.apache.cassandra.db.DefsTables.addColumnFamily(DefsTables.java:371) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.db.DefsTables.mergeColumnFamilies(DefsTables.java:293) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.db.DefsTables.mergeSchemaInternal(DefsTables.java:194) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:166) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.service.MigrationManager$2.runMayThrow(MigrationManager.java:393) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_31] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ~[na:1.8.0_31] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ~[na:1.8.0_31] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_31] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31] > ERROR [SharedPool-Worker-2] 2015-02-11 14:38:48,620 QueryMessage.java:132 - > Unexpected error during query > java.lang.RuntimeException: java.util.concurrent.ExecutionException: > java.lang.NullPointerException > at > org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:398) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.service.MigrationManager.announce(MigrationManager.java:374) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.service.MigrationManager.announceNewColumnFamily(MigrationManager.java:249) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.cql3.statements.CreateTableStatement.announceMigration(CreateTableStatement.java:113) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:80) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:226) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:248) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:119) > ~[apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:439) > [apache-cassandra-2.1.2.jar:2.1.2] > at > org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:335) > [apache-cassandra-2.1.2.jar:2.1.2] > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > [netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > [netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32) > [netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.AbstractChannelHandlerContext$8.run(AbstractChannelHandlerContext.java:324) > [netty-all-4.0.23.Final.jar:4.0.23.Final] > at >
[jira] [Commented] (CASSANDRA-12652) Failure in SASIIndexTest.testStaticIndex-compression
[ https://issues.apache.org/jira/browse/CASSANDRA-12652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15735671#comment-15735671 ] Joel Knighton commented on CASSANDRA-12652: --- Thanks for the update! > Failure in SASIIndexTest.testStaticIndex-compression > > > Key: CASSANDRA-12652 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12652 > Project: Cassandra > Issue Type: Bug > Components: Testing >Reporter: Joel Knighton >Assignee: Alex Petrov > > Stacktrace: > {code} > junit.framework.AssertionFailedError: expected:<1> but was:<0> > at > org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1839) > at > org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1786) > {code} > Example failure: > http://cassci.datastax.com/job/trunk_testall/1176/testReport/org.apache.cassandra.index.sasi/SASIIndexTest/testStaticIndex_compression/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-13012) Paxos regression from CASSANDRA-12716
[ https://issues.apache.org/jira/browse/CASSANDRA-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729275#comment-15729275 ] Joel Knighton commented on CASSANDRA-13012: --- +1 > Paxos regression from CASSANDRA-12716 > - > > Key: CASSANDRA-13012 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13012 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Sylvain Lebresne >Priority: Minor > > I introduced a dumb bug when reading the Paxos state in > {{SystemKeyspace.loadPaxosState}} where the new condition on > {{proposal_version}} and {{most_recent_commit_version}} is obviously way too > strong, and actually entirely unnecessary. > This is consistently breaking the > {{paxos_tests.TestPaxos.contention_test_many_threads}} so I'm not sure why I > didn't caught that, sorry. Thanks to [~jkni] who noticed that first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-13012) Paxos regression from CASSANDRA-12716
[ https://issues.apache.org/jira/browse/CASSANDRA-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-13012: -- Status: Ready to Commit (was: Patch Available) > Paxos regression from CASSANDRA-12716 > - > > Key: CASSANDRA-13012 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13012 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Sylvain Lebresne >Priority: Minor > > I introduced a dumb bug when reading the Paxos state in > {{SystemKeyspace.loadPaxosState}} where the new condition on > {{proposal_version}} and {{most_recent_commit_version}} is obviously way too > strong, and actually entirely unnecessary. > This is consistently breaking the > {{paxos_tests.TestPaxos.contention_test_many_threads}} so I'm not sure why I > didn't caught that, sorry. Thanks to [~jkni] who noticed that first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (CASSANDRA-12987) dtest failure in paxos_tests.TestPaxos.contention_test_many_threads
[ https://issues.apache.org/jira/browse/CASSANDRA-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton resolved CASSANDRA-12987. --- Resolution: Duplicate > dtest failure in paxos_tests.TestPaxos.contention_test_many_threads > --- > > Key: CASSANDRA-12987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12987 > Project: Cassandra > Issue Type: Bug >Reporter: Sean McCarthy >Assignee: Joel Knighton > Labels: dtest, test-failure > Attachments: node1.log, node1_debug.log, node1_gc.log, node2.log, > node2_debug.log, node2_gc.log, node3.log, node3_debug.log, node3_gc.log > > > example failure: > http://cassci.datastax.com/job/trunk_dtest/1437/testReport/paxos_tests/TestPaxos/contention_test_many_threads > {code} > Error Message > value=299, errors=0, retries=25559 > {code}{code} > Stacktrace > File "/usr/lib/python2.7/unittest/case.py", line 329, in run > testMethod() > File "/home/automaton/cassandra-dtest/paxos_tests.py", line 88, in > contention_test_many_threads > self._contention_test(300, 1) > File "/home/automaton/cassandra-dtest/paxos_tests.py", line 192, in > _contention_test > self.assertTrue((value == threads * iterations) and (errors == 0), > "value={}, errors={}, retries={}".format(value, errors, retries)) > File "/usr/lib/python2.7/unittest/case.py", line 422, in assertTrue > raise self.failureException(msg) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11107) Add native_transport_address and native_transport_broadcast_address yaml options
[ https://issues.apache.org/jira/browse/CASSANDRA-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706902#comment-15706902 ] Joel Knighton edited comment on CASSANDRA-11107 at 11/29/16 11:28 PM: -- That's correct - it's true that I could at least piggyback on those since those migrations/changes would already be necessary. EDIT: That is, I would need to make additional changes, but I could time this for the same release to prevent the need for additional legacy tables. was (Author: jkni): That's correct - it's true that I could at least piggyback on those since those migrations/changes would already be necessary. > Add native_transport_address and native_transport_broadcast_address yaml > options > > > Key: CASSANDRA-11107 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11107 > Project: Cassandra > Issue Type: Improvement > Components: Configuration >Reporter: n0rad >Assignee: Joel Knighton >Priority: Minor > > I'm starting cassandra on a container with this /etc/hosts > {quote} > 127.0.0.1rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost > ::1 rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost > {quote} > I have the default configuration except : > {quote} > - seeds: "10.1.1.1" > listen_address : 10.1.1.1 > {quote} > cassandra will start listening on *127.0.0.1:9042* > if I set *rpc_address:10.1.1.1* , even if *start_rpc: false*, cassandra will > listen on 10.1.1.1 > Since rpc is not started, I assumed that *rpc_address* and > *broadcast_rpc_address* will be ignored > It took me a while to figure that. There may be something to do around this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11107) Add native_transport_address and native_transport_broadcast_address yaml options
[ https://issues.apache.org/jira/browse/CASSANDRA-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706902#comment-15706902 ] Joel Knighton commented on CASSANDRA-11107: --- That's correct - it's true that I could at least piggyback on those since those migrations/changes would already be necessary. > Add native_transport_address and native_transport_broadcast_address yaml > options > > > Key: CASSANDRA-11107 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11107 > Project: Cassandra > Issue Type: Improvement > Components: Configuration >Reporter: n0rad >Assignee: Joel Knighton >Priority: Minor > > I'm starting cassandra on a container with this /etc/hosts > {quote} > 127.0.0.1rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost > ::1 rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost > {quote} > I have the default configuration except : > {quote} > - seeds: "10.1.1.1" > listen_address : 10.1.1.1 > {quote} > cassandra will start listening on *127.0.0.1:9042* > if I set *rpc_address:10.1.1.1* , even if *start_rpc: false*, cassandra will > listen on 10.1.1.1 > Since rpc is not started, I assumed that *rpc_address* and > *broadcast_rpc_address* will be ignored > It took me a while to figure that. There may be something to do around this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11107) Add native_transport_address and native_transport_broadcast_address yaml options
[ https://issues.apache.org/jira/browse/CASSANDRA-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706743#comment-15706743 ] Joel Knighton commented on CASSANDRA-11107: --- I've got a patch in progress that solves the easy parts of this. At this point, however, I am having second thoughts regarding the costs/benefits of this change. At this point, to support separate rpc/native_transport configurations, changes would seem to include: * updating the native protocol so that NEW_NODE events include rpc_address and native_transport_address (and other TopologyChangeEvents, since identifiers used by drivers might include both address configurations) * updating the PEERS table to include rpc_address and native_transport_address * adding an ApplicationState in Gossip for native_transport_address. Drivers would also need to be updated to query native_transport_address appropriately. This seems like a fair amount of work when 4.0 will end up negating these changes on removing Thrift. The other option that immediately presents itself is to allow these properties to be set in a 3.X yaml but require them to match the rpc configurations. I'm not sure this is worth it either. Let me know what you think, [~slebresne]. > Add native_transport_address and native_transport_broadcast_address yaml > options > > > Key: CASSANDRA-11107 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11107 > Project: Cassandra > Issue Type: Improvement > Components: Configuration >Reporter: n0rad >Assignee: Joel Knighton >Priority: Minor > > I'm starting cassandra on a container with this /etc/hosts > {quote} > 127.0.0.1rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost > ::1 rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost > {quote} > I have the default configuration except : > {quote} > - seeds: "10.1.1.1" > listen_address : 10.1.1.1 > {quote} > cassandra will start listening on *127.0.0.1:9042* > if I set *rpc_address:10.1.1.1* , even if *start_rpc: false*, cassandra will > listen on 10.1.1.1 > Since rpc is not started, I assumed that *rpc_address* and > *broadcast_rpc_address* will be ignored > It took me a while to figure that. There may be something to do around this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests
[ https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-11381: -- Status: Open (was: Patch Available) > Node running with join_ring=false and authentication can not serve requests > --- > > Key: CASSANDRA-11381 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11381 > Project: Cassandra > Issue Type: Bug >Reporter: mck >Assignee: mck > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, > 11381-3.X.txt, 11381-trunk.txt, dtest-11381-trunk.txt > > > Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has > authentication configured, eg PasswordAuthenticator, won't be able to serve > requests. This is because {{Auth.setup()}} never gets called during the > startup. > Without {{Auth.setup()}} having been called in {{StorageService}} clients > connecting to the node fail with the node throwing > {noformat} > java.lang.NullPointerException > at > org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119) > at > org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at com.thinkaurelius.thrift.Message.invoke(Message.java:314) > at > com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689) > at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The exception thrown from the > [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119] > {code} > ResultMessage.Rows rows = > authenticateStatement.execute(QueryState.forInternalCalls(), new > QueryOptions(consistencyForUser(username), > >Lists.newArrayList(ByteBufferUtil.bytes(username; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests
[ https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-11381: -- Status: Awaiting Feedback (was: Open) > Node running with join_ring=false and authentication can not serve requests > --- > > Key: CASSANDRA-11381 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11381 > Project: Cassandra > Issue Type: Bug >Reporter: mck >Assignee: mck > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, > 11381-3.X.txt, 11381-trunk.txt, dtest-11381-trunk.txt > > > Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has > authentication configured, eg PasswordAuthenticator, won't be able to serve > requests. This is because {{Auth.setup()}} never gets called during the > startup. > Without {{Auth.setup()}} having been called in {{StorageService}} clients > connecting to the node fail with the node throwing > {noformat} > java.lang.NullPointerException > at > org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119) > at > org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at com.thinkaurelius.thrift.Message.invoke(Message.java:314) > at > com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689) > at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The exception thrown from the > [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119] > {code} > ResultMessage.Rows rows = > authenticateStatement.execute(QueryState.forInternalCalls(), new > QueryOptions(consistencyForUser(username), > >Lists.newArrayList(ByteBufferUtil.bytes(username; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests
[ https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704157#comment-15704157 ] Joel Knighton commented on CASSANDRA-11381: --- Thanks - the patches look good and I put them through CI. ||branch||testall||dtest|| |[CASSANDRA-11381-2.2|https://github.com/jkni/cassandra/tree/CASSANDRA-11381-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-2.2-dtest]| |[CASSANDRA-11381-3.0|https://github.com/jkni/cassandra/tree/CASSANDRA-11381-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-3.0-dtest]| |[CASSANDRA-11381-3.X|https://github.com/jkni/cassandra/tree/CASSANDRA-11381-3.X]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-3.X-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-3.X-dtest]| |[CASSANDRA-11381-trunk|https://github.com/jkni/cassandra/tree/CASSANDRA-11381-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-trunk-dtest]| CI looks good for the most part, and I checked that your added dtest passes on all branches. CI revealed one small problem - when a fresh node is started with join_ring=False that has no tokens for other nodes discovered through gossip and no saved tokens, it hits an AssertionError in {{CassandraRoleManager}} setup that is not handled and gets logged as an error by a top level error handler, as seen [here|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-11381-2.2-dtest/1/testReport/junit/topology_test/TestTopology/do_not_join_ring_test/]. In this specific test case, this behavior is hit because a single node cluster is started with join_ring=False. Since this prevents setup from being retried within the {{CassandraRoleManager}}, it seems to me that it is probably worth checking for an absence of tokens in {{CassandraRoleManager.setupDefaultRole}} and throwing a catchable exception/printing a warning so that setup can be retried. What do you think? There may be another alternative I haven't considered. > Node running with join_ring=false and authentication can not serve requests > --- > > Key: CASSANDRA-11381 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11381 > Project: Cassandra > Issue Type: Bug >Reporter: mck >Assignee: mck > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, > 11381-3.X.txt, 11381-trunk.txt, dtest-11381-trunk.txt > > > Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has > authentication configured, eg PasswordAuthenticator, won't be able to serve > requests. This is because {{Auth.setup()}} never gets called during the > startup. > Without {{Auth.setup()}} having been called in {{StorageService}} clients > connecting to the node fail with the node throwing > {noformat} > java.lang.NullPointerException > at > org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119) > at > org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at com.thinkaurelius.thrift.Message.invoke(Message.java:314) > at > com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689) > at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The exception thrown from the > [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119] > {code} > ResultMessage.Rows rows = > authenticateStatement.execute(QueryState.forInternalCalls(), new >
[jira] [Comment Edited] (CASSANDRA-12652) Failure in SASIIndexTest.testStaticIndex-compression
[ https://issues.apache.org/jira/browse/CASSANDRA-12652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690558#comment-15690558 ] Joel Knighton edited comment on CASSANDRA-12652 at 11/23/16 4:23 PM: - Fixed by revert 490c1c27c9b700f14212d9591a516ddb8d0865c7 before release. was (Author: jkni): Fixed by revert before release. > Failure in SASIIndexTest.testStaticIndex-compression > > > Key: CASSANDRA-12652 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12652 > Project: Cassandra > Issue Type: Bug > Components: Testing >Reporter: Joel Knighton >Assignee: Alex Petrov > > Stacktrace: > {code} > junit.framework.AssertionFailedError: expected:<1> but was:<0> > at > org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1839) > at > org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1786) > {code} > Example failure: > http://cassci.datastax.com/job/trunk_testall/1176/testReport/org.apache.cassandra.index.sasi/SASIIndexTest/testStaticIndex_compression/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12652) Failure in SASIIndexTest.testStaticIndex-compression
[ https://issues.apache.org/jira/browse/CASSANDRA-12652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12652: -- Fix Version/s: (was: 4.x) (was: 3.x) > Failure in SASIIndexTest.testStaticIndex-compression > > > Key: CASSANDRA-12652 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12652 > Project: Cassandra > Issue Type: Bug > Components: Testing >Reporter: Joel Knighton >Assignee: Alex Petrov > > Stacktrace: > {code} > junit.framework.AssertionFailedError: expected:<1> but was:<0> > at > org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1839) > at > org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1786) > {code} > Example failure: > http://cassci.datastax.com/job/trunk_testall/1176/testReport/org.apache.cassandra.index.sasi/SASIIndexTest/testStaticIndex_compression/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12652) Failure in SASIIndexTest.testStaticIndex-compression
[ https://issues.apache.org/jira/browse/CASSANDRA-12652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12652: -- Resolution: Fixed Status: Resolved (was: Awaiting Feedback) Fixed by revert before release. > Failure in SASIIndexTest.testStaticIndex-compression > > > Key: CASSANDRA-12652 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12652 > Project: Cassandra > Issue Type: Bug > Components: Testing >Reporter: Joel Knighton >Assignee: Alex Petrov > > Stacktrace: > {code} > junit.framework.AssertionFailedError: expected:<1> but was:<0> > at > org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1839) > at > org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1786) > {code} > Example failure: > http://cassci.datastax.com/job/trunk_testall/1176/testReport/org.apache.cassandra.index.sasi/SASIIndexTest/testStaticIndex_compression/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12652) Failure in SASIIndexTest.testStaticIndex-compression
[ https://issues.apache.org/jira/browse/CASSANDRA-12652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690556#comment-15690556 ] Joel Knighton commented on CASSANDRA-12652: --- I agree; I'm unable to reproduce this failure on any branch on any machine after the revert. I posted this as a comment on [CASSANDRA-11990] to make sure this test is considered during the updated implementation. I'm closing this for now. > Failure in SASIIndexTest.testStaticIndex-compression > > > Key: CASSANDRA-12652 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12652 > Project: Cassandra > Issue Type: Bug > Components: Testing >Reporter: Joel Knighton >Assignee: Alex Petrov > Fix For: 3.x, 4.x > > > Stacktrace: > {code} > junit.framework.AssertionFailedError: expected:<1> but was:<0> > at > org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1839) > at > org.apache.cassandra.index.sasi.SASIIndexTest.testStaticIndex(SASIIndexTest.java:1786) > {code} > Example failure: > http://cassci.datastax.com/job/trunk_testall/1176/testReport/org.apache.cassandra.index.sasi/SASIIndexTest/testStaticIndex_compression/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11990) Address rows rather than partitions in SASI
[ https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690552#comment-15690552 ] Joel Knighton commented on CASSANDRA-11990: --- The revert here seems to have fixed the test failure in [CASSANDRA-12652] - extra attention should be paid to this test when an updated implementation is available. > Address rows rather than partitions in SASI > --- > > Key: CASSANDRA-11990 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11990 > Project: Cassandra > Issue Type: Improvement > Components: CQL, sasi >Reporter: Alex Petrov >Assignee: Alex Petrov > Fix For: 3.x > > Attachments: perf.pdf, size_comparison.png > > > Currently, the lookup in SASI index would return the key position of the > partition. After the partition lookup, the rows are iterated and the > operators are applied in order to filter out ones that do not match. > bq. TokenTree which accepts variable size keys (such would enable different > partitioners, collections support, primary key indexing etc.), -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11107) Add native_transport_address and native_transport_broadcast_address yaml options
[ https://issues.apache.org/jira/browse/CASSANDRA-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690433#comment-15690433 ] Joel Knighton commented on CASSANDRA-11107: --- That shouldn't be a problem - I expect I'll have time to submit a patch here in the next week or so. > Add native_transport_address and native_transport_broadcast_address yaml > options > > > Key: CASSANDRA-11107 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11107 > Project: Cassandra > Issue Type: Improvement > Components: Configuration >Reporter: n0rad >Assignee: Joel Knighton >Priority: Minor > > I'm starting cassandra on a container with this /etc/hosts > {quote} > 127.0.0.1rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost > ::1 rkt-235c219a-f0dc-4958-9e03-5afe2581bbe1 localhost > {quote} > I have the default configuration except : > {quote} > - seeds: "10.1.1.1" > listen_address : 10.1.1.1 > {quote} > cassandra will start listening on *127.0.0.1:9042* > if I set *rpc_address:10.1.1.1* , even if *start_rpc: false*, cassandra will > listen on 10.1.1.1 > Since rpc is not started, I assumed that *rpc_address* and > *broadcast_rpc_address* will be ignored > It took me a while to figure that. There may be something to do around this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12281) Gossip blocks on startup when there are pending range movements
[ https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12281: -- Summary: Gossip blocks on startup when there are pending range movements (was: Gossip blocks on startup when another node is bootstrapping) > Gossip blocks on startup when there are pending range movements > --- > > Key: CASSANDRA-12281 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12281 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Eric Evans >Assignee: Stefan Podkowinski > Fix For: 2.2.9, 3.0.11, 3.10 > > Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-3.X.patch, > 12281-trunk.patch, restbase1015-a_jstack.txt > > > In our cluster, normal node startup times (after a drain on shutdown) are > less than 1 minute. However, when another node in the cluster is > bootstrapping, the same node startup takes nearly 30 minutes to complete, the > apparent result of gossip blocking on pending range calculations. > {noformat} > $ nodetool-a tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 1840 0 > 0 > ReadStage 0 0 2350 0 > 0 > RequestResponseStage 0 0 53 0 > 0 > ReadRepairStage 0 0 1 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > HintedHandoff 0 0 44 0 > 0 > MiscStage 0 0 0 0 > 0 > CompactionExecutor3 3395 0 > 0 > MemtableReclaimMemory 0 0 30 0 > 0 > PendingRangeCalculator1 2 29 0 > 0 > GossipStage 1 5602164 0 > 0 > MigrationStage0 0 0 0 > 0 > MemtablePostFlush 0 0111 0 > 0 > ValidationExecutor0 0 0 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0 30 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {noformat} > A full thread dump is attached, but the relevant bit seems to be here: > {noformat} > [ ... ] > "GossipStage:1" #1801 daemon prio=5 os_prio=0 tid=0x7fe4cd54b000 > nid=0xea9 waiting on condition [0x7fddcf883000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0004c1e922c0> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at > org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:174) > at > org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:160) > at > org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2023) > at > org.apache.cassandra.service.StorageService.onChange(StorageService.java:1682) > at > org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1182) > at
[jira] [Comment Edited] (CASSANDRA-12281) Gossip blocks on startup when another node is bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669481#comment-15669481 ] Joel Knighton edited comment on CASSANDRA-12281 at 11/18/16 5:58 PM: - Thanks - your changes and CI look good. I also ran CI on your CASSANDRA-12281-trunk branch. Note to committer: there are very slight differences in the 2.2/3.0/3.x branches (not in substantial content, but in comments and other minor fixes). The 3.x branch should merge cleanly into trunk, I believe. +1 was (Author: jkni): Thanks - your changes and CI look good. I also ran CI on your CASSANDRA-12281-trunk branch. Note to committer: there are very slight differences in the 2.2/3.0/3.x branches (not in substantial comment, but in comments and other minor fixes). The 3.x branch should merge cleanly into trunk, I believe. +1 > Gossip blocks on startup when another node is bootstrapping > --- > > Key: CASSANDRA-12281 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12281 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Eric Evans >Assignee: Stefan Podkowinski > Fix For: 2.2.9, 3.0.11, 3.10 > > Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-3.X.patch, > 12281-trunk.patch, restbase1015-a_jstack.txt > > > In our cluster, normal node startup times (after a drain on shutdown) are > less than 1 minute. However, when another node in the cluster is > bootstrapping, the same node startup takes nearly 30 minutes to complete, the > apparent result of gossip blocking on pending range calculations. > {noformat} > $ nodetool-a tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 1840 0 > 0 > ReadStage 0 0 2350 0 > 0 > RequestResponseStage 0 0 53 0 > 0 > ReadRepairStage 0 0 1 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > HintedHandoff 0 0 44 0 > 0 > MiscStage 0 0 0 0 > 0 > CompactionExecutor3 3395 0 > 0 > MemtableReclaimMemory 0 0 30 0 > 0 > PendingRangeCalculator1 2 29 0 > 0 > GossipStage 1 5602164 0 > 0 > MigrationStage0 0 0 0 > 0 > MemtablePostFlush 0 0111 0 > 0 > ValidationExecutor0 0 0 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0 30 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {noformat} > A full thread dump is attached, but the relevant bit seems to be here: > {noformat} > [ ... ] > "GossipStage:1" #1801 daemon prio=5 os_prio=0 tid=0x7fe4cd54b000 > nid=0xea9 waiting on condition [0x7fddcf883000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0004c1e922c0> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at >
[jira] [Commented] (CASSANDRA-12792) delete with timestamp long.MAX_VALUE for the whole key creates tombstone that cannot be removed.
[ https://issues.apache.org/jira/browse/CASSANDRA-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675890#comment-15675890 ] Joel Knighton commented on CASSANDRA-12792: --- Good catch - I updated the 2.2 branch above with that change. CI looks good. > delete with timestamp long.MAX_VALUE for the whole key creates tombstone that > cannot be removed. > - > > Key: CASSANDRA-12792 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12792 > Project: Cassandra > Issue Type: Bug > Components: Compaction >Reporter: Ian Ilsley >Assignee: Joel Knighton > > In db/compaction/LazilyCompactedRow.java > we only check for < MaxPurgeableTimeStamp > eg: > (this.maxRowTombstone.markedForDeleteAt < getMaxPurgeableTimestamp()) > this should probably be <= -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-12792) delete with timestamp long.MAX_VALUE for the whole key creates tombstone that cannot be removed.
[ https://issues.apache.org/jira/browse/CASSANDRA-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672811#comment-15672811 ] Joel Knighton edited comment on CASSANDRA-12792 at 11/17/16 5:41 AM: - I've pushed rebased, updated branches and ran CI. CI looks clean relative to upstream. I made your proposed fixes regarding hasTimestamp, null checking, and lambda usage in the 3.0+ branches. While looking at the code, I realized that the {{PurgeEvaluator}} interface was no longer necessary after an earlier refactor and that comparable internals seem to use {{Predicate}} directly. I adopted this approach in 2.2+ and changed the 2.2 branch to use anonymous classes, since I thought this made it a little easier to follow. Let me know your thoughts on these additional changes. ||branch||testall||dtest|| |[CASSANDRA-12792-2.2|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-2.2-dtest]| |[CASSANDRA-12792-3.0|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.0-dtest]| |[CASSANDRA-12792-3.X|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-3.X]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.X-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.X-dtest]| |[CASSANDRA-12792-trunk|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-trunk-dtest]| was (Author: jkni): I've pushed rebased, updated branches and ran CI. CI looks clean relative to upstream. I made your proposed fixes regarding hasTimestamp, null checking, and lambda usage in the 3.0+ branches. While looking at the code, I realized that the {{PurgeEvaluator}} interface was no longer necessary after an earier refactor and that comparable internals seem to use {{Predicate}} directly. I adopted this approach in 2.2+ and changed the 2.2 branch to use anonymous classes, since I thought this made it a little easier to follow. Let me know your thoughts on these additional changes. ||branch||testall||dtest|| |[CASSANDRA-12792-2.2|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-2.2-dtest]| |[CASSANDRA-12792-3.0|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.0-dtest]| |[CASSANDRA-12792-3.X|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-3.X]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.X-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.X-dtest]| |[CASSANDRA-12792-trunk|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-trunk-dtest]| > delete with timestamp long.MAX_VALUE for the whole key creates tombstone that > cannot be removed. > - > > Key: CASSANDRA-12792 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12792 > Project: Cassandra > Issue Type: Bug > Components: Compaction >Reporter: Ian Ilsley >Assignee: Joel Knighton > > In db/compaction/LazilyCompactedRow.java > we only check for < MaxPurgeableTimeStamp > eg: > (this.maxRowTombstone.markedForDeleteAt < getMaxPurgeableTimestamp()) > this should probably be <= -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12792) delete with timestamp long.MAX_VALUE for the whole key creates tombstone that cannot be removed.
[ https://issues.apache.org/jira/browse/CASSANDRA-12792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12792: -- Status: Patch Available (was: In Progress) I've pushed rebased, updated branches and ran CI. CI looks clean relative to upstream. I made your proposed fixes regarding hasTimestamp, null checking, and lambda usage in the 3.0+ branches. While looking at the code, I realized that the {{PurgeEvaluator}} interface was no longer necessary after an earier refactor and that comparable internals seem to use {{Predicate}} directly. I adopted this approach in 2.2+ and changed the 2.2 branch to use anonymous classes, since I thought this made it a little easier to follow. Let me know your thoughts on these additional changes. ||branch||testall||dtest|| |[CASSANDRA-12792-2.2|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-2.2-dtest]| |[CASSANDRA-12792-3.0|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.0-dtest]| |[CASSANDRA-12792-3.X|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-3.X]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.X-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-3.X-dtest]| |[CASSANDRA-12792-trunk|https://github.com/jkni/cassandra/tree/CASSANDRA-12792-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-CASSANDRA-12792-trunk-dtest]| > delete with timestamp long.MAX_VALUE for the whole key creates tombstone that > cannot be removed. > - > > Key: CASSANDRA-12792 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12792 > Project: Cassandra > Issue Type: Bug > Components: Compaction >Reporter: Ian Ilsley >Assignee: Joel Knighton > > In db/compaction/LazilyCompactedRow.java > we only check for < MaxPurgeableTimeStamp > eg: > (this.maxRowTombstone.markedForDeleteAt < getMaxPurgeableTimestamp()) > this should probably be <= -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-10244) Replace heartbeats with locally recorded metrics for failure detection
[ https://issues.apache.org/jira/browse/CASSANDRA-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-10244: -- Assignee: (was: Joel Knighton) > Replace heartbeats with locally recorded metrics for failure detection > -- > > Key: CASSANDRA-10244 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10244 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown > > In the current implementation, the primary purpose of sending gossip messages > is for delivering the updated heartbeat values of each node in a cluster. The > other data that is passed in gossip (node metadata such as status, dc, rack, > tokens, and so on) changes very infrequently (or rarely), such that the > eventual delivery of that data is entirely reasonable. Heartbeats, however, > are quite different. A continuous and nearly consistent delivery time of > updated heartbeats is critical for the stability of a cluster. It is through > the receipt of the updated heartbeat that a node determines the reachability > (UP/DOWN status) of all peers in the cluster. The current implementation of > FailureDetector measures the time differences between the heartbeat updates > received about a peer (Note: I said about a peer, not from the peer directly, > as those values are disseminated via gossip). Without a consistent time > delivery of those updates, the FD, via it's use of the PHI-accrual > algorigthm, will mark the peer as DOWN (unreachable). The two nodes could be > sending all other traffic without problem, but if the heartbeats are not > propagated correctly, each of the nodes will mark the other as DOWN, which is > clearly suboptimal to cluster health. Further, heartbeat updates are the only > mechanism we use to determine reachability (UP/DOWN) of a peer; dynamic > snitch measurements, for example, are not included in the determination. > To illustrate this, in the current implementation, assume a cluster of nodes: > A, B, and C. A partition starts between nodes A and C (no communication > succeeds), but both nodes can communicate with B. As B will get the updated > heartbeats from both A and C, it will, via gossip, send those over to the > other node. Thus, A thinks C is UP, and C thinks A is UP. Unfortunately, due > to the partition between them, all communication between A and C will fail, > yet neither node will mark the other as down because each is receiving, > transitively via B, the updated heartbeat about the other. While it's true > that the other node is alive, only having transitive knowledge about a peer, > and allowing that to be the sole determinant of UP/DOWN reachability status, > is not sufficient for a correct and effieicently operating cluster. > This transitive availability is suboptimal, and I propose we drop the > heartbeat concept altogether. Instead, the dynamic snitch should become more > intelligent, and it's measurements ultimately become the input for > determining the reachability status of each peer(as fed into a revamped FD). > As we already capture latencies in the dsntich, we can reasonably extend it > to include timeouts/missed responses, and make that the basis for the UP/DOWN > decisioning. Thus we will have more accurate and relevant peer statueses that > is tailored to the local node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11709) Lock contention when large number of dead nodes come back within short time
[ https://issues.apache.org/jira/browse/CASSANDRA-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15670769#comment-15670769 ] Joel Knighton commented on CASSANDRA-11709: --- I've been unable to get much further on this without a comparably large cluster to test on. The branch I've linked above does help parts of the issue by reducing the invalidation of the cached ring in unnecessary circumstances; I think a patch addressing this issue will need that change as well as others. Unassigning so as to not block progress. > Lock contention when large number of dead nodes come back within short time > --- > > Key: CASSANDRA-11709 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11709 > Project: Cassandra > Issue Type: Improvement >Reporter: Dikang Gu >Assignee: Joel Knighton > Fix For: 2.2.x, 3.x > > Attachments: lock.jstack > > > We have a few hundreds nodes across 3 data centers, and we are doing a few > millions writes per second into the cluster. > We were trying to simulate a data center failure, by disabling the gossip on > all the nodes in one data center. After ~20mins, I re-enabled the gossip on > those nodes, was doing 5 nodes in each batch, and sleep 5 seconds between the > batch. > After that, I saw the latency of read/write requests increased a lot, and > client requests started to timeout. > On the node, I can see there are huge number of pending tasks in GossipStage. > = > 2016-05-02_23:55:08.99515 WARN 23:55:08 Gossip stage has 36337 pending > tasks; skipping status check (no nodes will be marked down) > 2016-05-02_23:55:09.36009 INFO 23:55:09 Node > /2401:db00:2020:717a:face:0:41:0 state jump to normal > 2016-05-02_23:55:09.99057 INFO 23:55:09 Node > /2401:db00:2020:717a:face:0:43:0 state jump to normal > 2016-05-02_23:55:10.09742 WARN 23:55:10 Gossip stage has 36421 pending > tasks; skipping status check (no nodes will be marked down) > 2016-05-02_23:55:10.91860 INFO 23:55:10 Node > /2401:db00:2020:717a:face:0:45:0 state jump to normal > 2016-05-02_23:55:11.20100 WARN 23:55:11 Gossip stage has 36558 pending > tasks; skipping status check (no nodes will be marked down) > 2016-05-02_23:55:11.57893 INFO 23:55:11 Node > /2401:db00:2030:612a:face:0:49:0 state jump to normal > 2016-05-02_23:55:12.23405 INFO 23:55:12 Node /2401:db00:2020:7189:face:0:7:0 > state jump to normal > > And I took jstack of the node, I found the read/write threads are blocked by > a lock, > read thread == > "Thrift:7994" daemon prio=10 tid=0x7fde91080800 nid=0x5255 waiting for > monitor entry [0x7fde6f8a1000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.cassandra.locator.TokenMetadata.cachedOnlyTokenMap(TokenMetadata.java:546) > - waiting to lock <0x7fe4faef4398> (a > org.apache.cassandra.locator.TokenMetadata) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:111) > at > org.apache.cassandra.service.StorageService.getLiveNaturalEndpoints(StorageService.java:3155) > at > org.apache.cassandra.service.StorageProxy.getLiveSortedEndpoints(StorageProxy.java:1526) > at > org.apache.cassandra.service.StorageProxy.getLiveSortedEndpoints(StorageProxy.java:1521) > at > org.apache.cassandra.service.AbstractReadExecutor.getReadExecutor(AbstractReadExecutor.java:155) > at > org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1328) > at > org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1270) > at > org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1195) > at > org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:118) > at > org.apache.cassandra.thrift.CassandraServer.getSlice(CassandraServer.java:275) > at > org.apache.cassandra.thrift.CassandraServer.multigetSliceInternal(CassandraServer.java:457) > at > org.apache.cassandra.thrift.CassandraServer.getSliceInternal(CassandraServer.java:346) > at > org.apache.cassandra.thrift.CassandraServer.get_slice(CassandraServer.java:325) > at > org.apache.cassandra.thrift.Cassandra$Processor$get_slice.getResult(Cassandra.java:3659) > at > org.apache.cassandra.thrift.Cassandra$Processor$get_slice.getResult(Cassandra.java:3643) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:205) > at >
[jira] [Updated] (CASSANDRA-11709) Lock contention when large number of dead nodes come back within short time
[ https://issues.apache.org/jira/browse/CASSANDRA-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-11709: -- Assignee: Dikang Gu (was: Joel Knighton) > Lock contention when large number of dead nodes come back within short time > --- > > Key: CASSANDRA-11709 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11709 > Project: Cassandra > Issue Type: Improvement >Reporter: Dikang Gu >Assignee: Dikang Gu > Fix For: 2.2.x, 3.x > > Attachments: lock.jstack > > > We have a few hundreds nodes across 3 data centers, and we are doing a few > millions writes per second into the cluster. > We were trying to simulate a data center failure, by disabling the gossip on > all the nodes in one data center. After ~20mins, I re-enabled the gossip on > those nodes, was doing 5 nodes in each batch, and sleep 5 seconds between the > batch. > After that, I saw the latency of read/write requests increased a lot, and > client requests started to timeout. > On the node, I can see there are huge number of pending tasks in GossipStage. > = > 2016-05-02_23:55:08.99515 WARN 23:55:08 Gossip stage has 36337 pending > tasks; skipping status check (no nodes will be marked down) > 2016-05-02_23:55:09.36009 INFO 23:55:09 Node > /2401:db00:2020:717a:face:0:41:0 state jump to normal > 2016-05-02_23:55:09.99057 INFO 23:55:09 Node > /2401:db00:2020:717a:face:0:43:0 state jump to normal > 2016-05-02_23:55:10.09742 WARN 23:55:10 Gossip stage has 36421 pending > tasks; skipping status check (no nodes will be marked down) > 2016-05-02_23:55:10.91860 INFO 23:55:10 Node > /2401:db00:2020:717a:face:0:45:0 state jump to normal > 2016-05-02_23:55:11.20100 WARN 23:55:11 Gossip stage has 36558 pending > tasks; skipping status check (no nodes will be marked down) > 2016-05-02_23:55:11.57893 INFO 23:55:11 Node > /2401:db00:2030:612a:face:0:49:0 state jump to normal > 2016-05-02_23:55:12.23405 INFO 23:55:12 Node /2401:db00:2020:7189:face:0:7:0 > state jump to normal > > And I took jstack of the node, I found the read/write threads are blocked by > a lock, > read thread == > "Thrift:7994" daemon prio=10 tid=0x7fde91080800 nid=0x5255 waiting for > monitor entry [0x7fde6f8a1000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.cassandra.locator.TokenMetadata.cachedOnlyTokenMap(TokenMetadata.java:546) > - waiting to lock <0x7fe4faef4398> (a > org.apache.cassandra.locator.TokenMetadata) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:111) > at > org.apache.cassandra.service.StorageService.getLiveNaturalEndpoints(StorageService.java:3155) > at > org.apache.cassandra.service.StorageProxy.getLiveSortedEndpoints(StorageProxy.java:1526) > at > org.apache.cassandra.service.StorageProxy.getLiveSortedEndpoints(StorageProxy.java:1521) > at > org.apache.cassandra.service.AbstractReadExecutor.getReadExecutor(AbstractReadExecutor.java:155) > at > org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1328) > at > org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1270) > at > org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1195) > at > org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:118) > at > org.apache.cassandra.thrift.CassandraServer.getSlice(CassandraServer.java:275) > at > org.apache.cassandra.thrift.CassandraServer.multigetSliceInternal(CassandraServer.java:457) > at > org.apache.cassandra.thrift.CassandraServer.getSliceInternal(CassandraServer.java:346) > at > org.apache.cassandra.thrift.CassandraServer.get_slice(CassandraServer.java:325) > at > org.apache.cassandra.thrift.Cassandra$Processor$get_slice.getResult(Cassandra.java:3659) > at > org.apache.cassandra.thrift.Cassandra$Processor$get_slice.getResult(Cassandra.java:3643) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:205) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > = writer === > "Thrift:7668" daemon prio=10 tid=0x7fde90d91000 nid=0x50e9 waiting for > monitor entry [0x7fde78bbc000]
[jira] [Updated] (CASSANDRA-9667) strongly consistent membership and ownership
[ https://issues.apache.org/jira/browse/CASSANDRA-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-9667: - Assignee: (was: Joel Knighton) > strongly consistent membership and ownership > > > Key: CASSANDRA-9667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9667 > Project: Cassandra > Issue Type: New Feature >Reporter: Jason Brown > Labels: LWT, membership, ownership > Fix For: 3.x > > > Currently, there is advice to users to "wait two minutes between adding new > nodes" in order for new node tokens, et al, to propagate. Further, as there's > no coordination amongst joining node wrt token selection, new nodes can end > up selecting ranges that overlap with other joining nodes. This causes a lot > of duplicate streaming from the existing source nodes as they shovel out the > bootstrap data for those new nodes. > This ticket proposes creating a mechanism that allows strongly consistent > membership and ownership changes in cassandra such that changes are performed > in a linearizable and safe manner. The basic idea is to use LWT operations > over a global system table, and leverage the linearizability of LWT for > ensuring the safety of cluster membership/ownership state changes. This work > is inspired by Riak's claimant module. > The existing workflows for node join, decommission, remove, replace, and > range move (there may be others I'm not thinking of) will need to be modified > to participate in this scheme, as well as changes to nodetool to enable them. > Note: we distinguish between membership and ownership in the following ways: > for membership we mean "a host in this cluster and it's state". For > ownership, we mean "what tokens (or ranges) does each node own"; these nodes > must already be a member to be assigned tokens. > A rough draft sketch of how the 'add new node' workflow might look like is: > new nodes would no longer create tokens themselves, but instead contact a > member of a Paxos cohort (via a seed). The cohort member will generate the > tokens and execute a LWT transaction, ensuring a linearizable change to the > membership/ownership state. The updated state will then be disseminated via > the existing gossip. > As for joining specifically, I think we could support two modes: auto-mode > and manual-mode. Auto-mode is for adding a single new node per LWT operation, > and would require no operator intervention (much like today). In manual-mode, > however, multiple new nodes could (somehow) signal their their intent to join > to the cluster, but will wait until an operator executes a nodetool command > that will trigger the token generation and LWT operation for all pending new > nodes. This will allow us better range partitioning and will make the > bootstrap streaming more efficient as we won't have overlapping range > requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12281) Gossip blocks on startup when another node is bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669481#comment-15669481 ] Joel Knighton commented on CASSANDRA-12281: --- Thanks - your changes and CI look good. I also ran CI on your CASSANDRA-12281-trunk branch. Note to committer: there are very slight differences in the 2.2/3.0/3.x branches (not in substantial comment, but in comments and other minor fixes). The 3.x branch should merge cleanly into trunk, I believe. > Gossip blocks on startup when another node is bootstrapping > --- > > Key: CASSANDRA-12281 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12281 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Eric Evans >Assignee: Stefan Podkowinski > Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-trunk.patch, > restbase1015-a_jstack.txt > > > In our cluster, normal node startup times (after a drain on shutdown) are > less than 1 minute. However, when another node in the cluster is > bootstrapping, the same node startup takes nearly 30 minutes to complete, the > apparent result of gossip blocking on pending range calculations. > {noformat} > $ nodetool-a tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 1840 0 > 0 > ReadStage 0 0 2350 0 > 0 > RequestResponseStage 0 0 53 0 > 0 > ReadRepairStage 0 0 1 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > HintedHandoff 0 0 44 0 > 0 > MiscStage 0 0 0 0 > 0 > CompactionExecutor3 3395 0 > 0 > MemtableReclaimMemory 0 0 30 0 > 0 > PendingRangeCalculator1 2 29 0 > 0 > GossipStage 1 5602164 0 > 0 > MigrationStage0 0 0 0 > 0 > MemtablePostFlush 0 0111 0 > 0 > ValidationExecutor0 0 0 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0 30 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {noformat} > A full thread dump is attached, but the relevant bit seems to be here: > {noformat} > [ ... ] > "GossipStage:1" #1801 daemon prio=5 os_prio=0 tid=0x7fe4cd54b000 > nid=0xea9 waiting on condition [0x7fddcf883000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0004c1e922c0> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at > org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:174) > at > org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:160) > at > org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2023) > at > org.apache.cassandra.service.StorageService.onChange(StorageService.java:1682) > at >
[jira] [Comment Edited] (CASSANDRA-12281) Gossip blocks on startup when another node is bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669481#comment-15669481 ] Joel Knighton edited comment on CASSANDRA-12281 at 11/16/16 5:25 AM: - Thanks - your changes and CI look good. I also ran CI on your CASSANDRA-12281-trunk branch. Note to committer: there are very slight differences in the 2.2/3.0/3.x branches (not in substantial comment, but in comments and other minor fixes). The 3.x branch should merge cleanly into trunk, I believe. +1 was (Author: jkni): Thanks - your changes and CI look good. I also ran CI on your CASSANDRA-12281-trunk branch. Note to committer: there are very slight differences in the 2.2/3.0/3.x branches (not in substantial comment, but in comments and other minor fixes). The 3.x branch should merge cleanly into trunk, I believe. > Gossip blocks on startup when another node is bootstrapping > --- > > Key: CASSANDRA-12281 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12281 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Eric Evans >Assignee: Stefan Podkowinski > Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-trunk.patch, > restbase1015-a_jstack.txt > > > In our cluster, normal node startup times (after a drain on shutdown) are > less than 1 minute. However, when another node in the cluster is > bootstrapping, the same node startup takes nearly 30 minutes to complete, the > apparent result of gossip blocking on pending range calculations. > {noformat} > $ nodetool-a tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 1840 0 > 0 > ReadStage 0 0 2350 0 > 0 > RequestResponseStage 0 0 53 0 > 0 > ReadRepairStage 0 0 1 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > HintedHandoff 0 0 44 0 > 0 > MiscStage 0 0 0 0 > 0 > CompactionExecutor3 3395 0 > 0 > MemtableReclaimMemory 0 0 30 0 > 0 > PendingRangeCalculator1 2 29 0 > 0 > GossipStage 1 5602164 0 > 0 > MigrationStage0 0 0 0 > 0 > MemtablePostFlush 0 0111 0 > 0 > ValidationExecutor0 0 0 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0 30 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {noformat} > A full thread dump is attached, but the relevant bit seems to be here: > {noformat} > [ ... ] > "GossipStage:1" #1801 daemon prio=5 os_prio=0 tid=0x7fe4cd54b000 > nid=0xea9 waiting on condition [0x7fddcf883000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0004c1e922c0> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at >
[jira] [Updated] (CASSANDRA-12281) Gossip blocks on startup when another node is bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12281: -- Status: Ready to Commit (was: Patch Available) > Gossip blocks on startup when another node is bootstrapping > --- > > Key: CASSANDRA-12281 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12281 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Eric Evans >Assignee: Stefan Podkowinski > Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-trunk.patch, > restbase1015-a_jstack.txt > > > In our cluster, normal node startup times (after a drain on shutdown) are > less than 1 minute. However, when another node in the cluster is > bootstrapping, the same node startup takes nearly 30 minutes to complete, the > apparent result of gossip blocking on pending range calculations. > {noformat} > $ nodetool-a tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 1840 0 > 0 > ReadStage 0 0 2350 0 > 0 > RequestResponseStage 0 0 53 0 > 0 > ReadRepairStage 0 0 1 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > HintedHandoff 0 0 44 0 > 0 > MiscStage 0 0 0 0 > 0 > CompactionExecutor3 3395 0 > 0 > MemtableReclaimMemory 0 0 30 0 > 0 > PendingRangeCalculator1 2 29 0 > 0 > GossipStage 1 5602164 0 > 0 > MigrationStage0 0 0 0 > 0 > MemtablePostFlush 0 0111 0 > 0 > ValidationExecutor0 0 0 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0 30 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {noformat} > A full thread dump is attached, but the relevant bit seems to be here: > {noformat} > [ ... ] > "GossipStage:1" #1801 daemon prio=5 os_prio=0 tid=0x7fe4cd54b000 > nid=0xea9 waiting on condition [0x7fddcf883000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0004c1e922c0> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at > org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:174) > at > org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:160) > at > org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2023) > at > org.apache.cassandra.service.StorageService.onChange(StorageService.java:1682) > at > org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1182) > at org.apache.cassandra.gms.Gossiper.applyNewStates(Gossiper.java:1165) > at > org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1128) > at >
[jira] [Comment Edited] (CASSANDRA-12273) Casandra stress graph: option to create directory for graph if it doesn't exist
[ https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654367#comment-15654367 ] Joel Knighton edited comment on CASSANDRA-12273 at 11/10/16 3:43 PM: - Very, very close - the only change is that the ticket number is included at the end of the patch by ...; reviewed by ... line, like "patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273" instead of including it at the end of the commit message. I might suggest changing the message to something like "Create log/artifact directories as needed for stress, handling symbolic links" to indicate that this changes behavior for the stress tool and not the core DB. {code} Create log directories as needed, handling symbolic links patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273 {code} Thanks again. was (Author: jkni): Very, very close - the only change is that the ticket number is included at the end of the patch by ...; reviewed by ... line, like "patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273" instead of including it at the end of the commit message. {code} Create log directories as needed, handling symbolic links patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273 {code} Thanks again. > Casandra stress graph: option to create directory for graph if it doesn't > exist > --- > > Key: CASSANDRA-12273 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12273 > Project: Cassandra > Issue Type: Improvement > Components: Tools >Reporter: Christopher Batey >Assignee: Murukesh Mohanan >Priority: Minor > Labels: lhf > Attachments: 12273.patch > > > I am running it in CI with ephemeral workspace / build dirs. It would be > nice if CS would create the directory so my build tool doesn't have to -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-12273) Casandra stress graph: option to create directory for graph if it doesn't exist
[ https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654367#comment-15654367 ] Joel Knighton edited comment on CASSANDRA-12273 at 11/10/16 3:42 PM: - Very, very close - the only change is that the ticket number is included at the end of the patch by ...; reviewed by ... line, like "patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273" instead of including it at the end of the commit message. {code} Create log directories as needed, handling symbolic links patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273 {code} Thanks again. was (Author: jkni): Very, very close - the only change is that the ticket number is included at the end of the patch by ...; reviewed by ... line, like "patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273" instead of including it at the end of the commit message. {code} Create log directories as needed, handling symbolic links patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273 {code} > Casandra stress graph: option to create directory for graph if it doesn't > exist > --- > > Key: CASSANDRA-12273 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12273 > Project: Cassandra > Issue Type: Improvement > Components: Tools >Reporter: Christopher Batey >Assignee: Murukesh Mohanan >Priority: Minor > Labels: lhf > Attachments: 12273.patch > > > I am running it in CI with ephemeral workspace / build dirs. It would be > nice if CS would create the directory so my build tool doesn't have to -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12273) Casandra stress graph: option to create directory for graph if it doesn't exist
[ https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654367#comment-15654367 ] Joel Knighton commented on CASSANDRA-12273: --- Very, very close - the only change is that the ticket number is included at the end of the patch by ...; reviewed by ... line, like "patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273" instead of including it at the end of the commit message. {code} Create log directories as needed, handling symbolic links patch by Murukesh Mohanan; reviewed by Joel Knighton for CASSANDRA-12273 {code} > Casandra stress graph: option to create directory for graph if it doesn't > exist > --- > > Key: CASSANDRA-12273 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12273 > Project: Cassandra > Issue Type: Improvement > Components: Tools >Reporter: Christopher Batey >Assignee: Murukesh Mohanan >Priority: Minor > Labels: lhf > Attachments: 12273.patch > > > I am running it in CI with ephemeral workspace / build dirs. It would be > nice if CS would create the directory so my build tool doesn't have to -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12273) Casandra stess graph: option to create directory for graph if it doesn't exist
[ https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12273: -- Status: Open (was: Patch Available) > Casandra stess graph: option to create directory for graph if it doesn't exist > -- > > Key: CASSANDRA-12273 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12273 > Project: Cassandra > Issue Type: Improvement > Components: Tools >Reporter: Christopher Batey >Assignee: Murukesh Mohanan >Priority: Minor > Labels: lhf > Attachments: 12273.patch > > > I am running it in CI with ephemeral workspace / build dirs. It would be > nice if CS would create the directory so my build tool doesn't have to -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12273) Casandra stress graph: option to create directory for graph if it doesn't exist
[ https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12273: -- Summary: Casandra stress graph: option to create directory for graph if it doesn't exist (was: Casandra stess graph: option to create directory for graph if it doesn't exist) > Casandra stress graph: option to create directory for graph if it doesn't > exist > --- > > Key: CASSANDRA-12273 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12273 > Project: Cassandra > Issue Type: Improvement > Components: Tools >Reporter: Christopher Batey >Assignee: Murukesh Mohanan >Priority: Minor > Labels: lhf > Attachments: 12273.patch > > > I am running it in CI with ephemeral workspace / build dirs. It would be > nice if CS would create the directory so my build tool doesn't have to -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12273) Casandra stess graph: option to create directory for graph if it doesn't exist
[ https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12273: -- Status: Awaiting Feedback (was: Open) > Casandra stess graph: option to create directory for graph if it doesn't exist > -- > > Key: CASSANDRA-12273 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12273 > Project: Cassandra > Issue Type: Improvement > Components: Tools >Reporter: Christopher Batey >Assignee: Murukesh Mohanan >Priority: Minor > Labels: lhf > Attachments: 12273.patch > > > I am running it in CI with ephemeral workspace / build dirs. It would be > nice if CS would create the directory so my build tool doesn't have to -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12273) Casandra stess graph: option to create directory for graph if it doesn't exist
[ https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651814#comment-15651814 ] Joel Knighton commented on CASSANDRA-12273: --- Thanks for the patch [~muru]! Your approach looks sound. A very similar issue exists on trunk with hdrfile logging if an {{hdrfile}} is specified in {{SettingsLog.java}}. If you're interested, I think it makes a lot of sense to also fix that problem as part of this ticket, as people affected by this issue will likely also be affected by the fact that hdrfile paths do not have their directory created. I also think it makes sense to canonicalize the path before {{Files.createDirectories}}, since this would avoid needing to special-case symlinks. This could be done by using {{getCanonicalPath}} instead of {{toURI}}. For future patches, it is easier to accept contributions if they include a CHANGES.txt entry and an appropriately formatted commit message in a patch created with {[git format-patch}}. The details on this are available in the [docs|http://cassandra.apache.org/doc/latest/development/patches.html]. If you aren't interested in updating the patch with these changes, I still think this patch is worth merging and will update this issue with an appropriately formatted commit and will approve it after CI. > Casandra stess graph: option to create directory for graph if it doesn't exist > -- > > Key: CASSANDRA-12273 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12273 > Project: Cassandra > Issue Type: Improvement > Components: Tools >Reporter: Christopher Batey >Assignee: Murukesh Mohanan >Priority: Minor > Labels: lhf > Attachments: 12273.patch > > > I am running it in CI with ephemeral workspace / build dirs. It would be > nice if CS would create the directory so my build tool doesn't have to -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12273) Casandra stess graph: option to create directory for graph if it doesn't exist
[ https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12273: -- Assignee: Murukesh Mohanan > Casandra stess graph: option to create directory for graph if it doesn't exist > -- > > Key: CASSANDRA-12273 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12273 > Project: Cassandra > Issue Type: Improvement > Components: Tools >Reporter: Christopher Batey >Assignee: Murukesh Mohanan >Priority: Minor > Labels: lhf > Attachments: 12273.patch > > > I am running it in CI with ephemeral workspace / build dirs. It would be > nice if CS would create the directory so my build tool doesn't have to -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12273) Casandra stess graph: option to create directory for graph if it doesn't exist
[ https://issues.apache.org/jira/browse/CASSANDRA-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12273: -- Assignee: (was: Christopher Batey) > Casandra stess graph: option to create directory for graph if it doesn't exist > -- > > Key: CASSANDRA-12273 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12273 > Project: Cassandra > Issue Type: Improvement > Components: Tools >Reporter: Christopher Batey >Priority: Minor > Labels: lhf > Attachments: 12273.patch > > > I am running it in CI with ephemeral workspace / build dirs. It would be > nice if CS would create the directory so my build tool doesn't have to -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests
[ https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-11381: -- Status: Awaiting Feedback (was: Open) > Node running with join_ring=false and authentication can not serve requests > --- > > Key: CASSANDRA-11381 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11381 > Project: Cassandra > Issue Type: Bug >Reporter: mck >Assignee: mck > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, > 11381-trunk.txt, dtest-11381-trunk.txt > > > Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has > authentication configured, eg PasswordAuthenticator, won't be able to serve > requests. This is because {{Auth.setup()}} never gets called during the > startup. > Without {{Auth.setup()}} having been called in {{StorageService}} clients > connecting to the node fail with the node throwing > {noformat} > java.lang.NullPointerException > at > org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119) > at > org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at com.thinkaurelius.thrift.Message.invoke(Message.java:314) > at > com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689) > at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The exception thrown from the > [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119] > {code} > ResultMessage.Rows rows = > authenticateStatement.execute(QueryState.forInternalCalls(), new > QueryOptions(consistencyForUser(username), > >Lists.newArrayList(ByteBufferUtil.bytes(username; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests
[ https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645280#comment-15645280 ] Joel Knighton edited comment on CASSANDRA-11381 at 11/7/16 8:10 PM: Thanks for pinging me on this - as you suspected, it slipped through the cracks. On reviewing the final version of this patch, I found one problem with the 2.2+ patches. The proposed patch technically breaks the documented {{IRoleManager}}, {{IAuthenticator}}, and {{IAuthorizer}} interfaces. With the implementation given, {{doAuthSetup}} will be called twice for a node started with {{join_ring=False}}, so {{setup()}} will be called twice for the role manager, authenticator, and authorizer. In the documentation for these three public interfaces, we state that {{setup()}} will only be called once after starting a node. I think we should preserve this documented behavior. While slightly less elegant, I think we should instead track whether we've run {{doAuthSetup}} and not repeat this call for a node started with {{join_ring=False}} that is asked to join. This means the parts of the patch implementing idempotency for the MigrationManager listener registration become unnecessary. was (Author: jkni): Thanks for pinging me on this - as you suspected, it slipped through the cracks. On reviewing the final version of this patch, I found one problem with the 2.2+ patches. The proposed patch technically breaks the documented {{IRoleManager}}, {{IAuthenticator}}, and {{IAuthorizer}} interfaces. With the implementation given, {{doAuthSetup}} will be called twice for a node started with {{join_ring=False}}, so {{setup()}} will be called twice for the role manager, authenticator, and authorizer. In the documentation for these three public interfaces, we state that {{setup()}} will only be called once after starting a node. I think we should preserve this documented behavior. While slightly less elegant, I think we should instead track whether we've run {{doAuthSetup}} and not repeat this call for a node started with {{join_ring=False}} that is asked to join. > Node running with join_ring=false and authentication can not serve requests > --- > > Key: CASSANDRA-11381 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11381 > Project: Cassandra > Issue Type: Bug >Reporter: mck >Assignee: mck > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, > 11381-trunk.txt, dtest-11381-trunk.txt > > > Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has > authentication configured, eg PasswordAuthenticator, won't be able to serve > requests. This is because {{Auth.setup()}} never gets called during the > startup. > Without {{Auth.setup()}} having been called in {{StorageService}} clients > connecting to the node fail with the node throwing > {noformat} > java.lang.NullPointerException > at > org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119) > at > org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at com.thinkaurelius.thrift.Message.invoke(Message.java:314) > at > com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689) > at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The exception thrown from the > [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119] > {code} > ResultMessage.Rows rows = > authenticateStatement.execute(QueryState.forInternalCalls(), new > QueryOptions(consistencyForUser(username), > >Lists.newArrayList(ByteBufferUtil.bytes(username; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests
[ https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-11381: -- Status: Open (was: Patch Available) > Node running with join_ring=false and authentication can not serve requests > --- > > Key: CASSANDRA-11381 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11381 > Project: Cassandra > Issue Type: Bug >Reporter: mck >Assignee: mck > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, > 11381-trunk.txt, dtest-11381-trunk.txt > > > Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has > authentication configured, eg PasswordAuthenticator, won't be able to serve > requests. This is because {{Auth.setup()}} never gets called during the > startup. > Without {{Auth.setup()}} having been called in {{StorageService}} clients > connecting to the node fail with the node throwing > {noformat} > java.lang.NullPointerException > at > org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119) > at > org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at com.thinkaurelius.thrift.Message.invoke(Message.java:314) > at > com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689) > at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The exception thrown from the > [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119] > {code} > ResultMessage.Rows rows = > authenticateStatement.execute(QueryState.forInternalCalls(), new > QueryOptions(consistencyForUser(username), > >Lists.newArrayList(ByteBufferUtil.bytes(username; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11381) Node running with join_ring=false and authentication can not serve requests
[ https://issues.apache.org/jira/browse/CASSANDRA-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645280#comment-15645280 ] Joel Knighton commented on CASSANDRA-11381: --- Thanks for pinging me on this - as you suspected, it slipped through the cracks. On reviewing the final version of this patch, I found one problem with the 2.2+ patches. The proposed patch technically breaks the documented {{IRoleManager}}, {{IAuthenticator}}, and {{IAuthorizer}} interfaces. With the implementation given, {{doAuthSetup}} will be called twice for a node started with {{join_ring=False}}, so {{setup()}} will be called twice for the role manager, authenticator, and authorizer. In the documentation for these three public interfaces, we state that {{setup()}} will only be called once after starting a node. I think we should preserve this documented behavior. While slightly less elegant, I think we should instead track whether we've run {{doAuthSetup}} and not repeat this call for a node started with {{join_ring=False}} that is asked to join. > Node running with join_ring=false and authentication can not serve requests > --- > > Key: CASSANDRA-11381 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11381 > Project: Cassandra > Issue Type: Bug >Reporter: mck >Assignee: mck > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > Attachments: 11381-2.1.txt, 11381-2.2.txt, 11381-3.0.txt, > 11381-trunk.txt, dtest-11381-trunk.txt > > > Starting up a node with {{-Dcassandra.join_ring=false}} in a cluster that has > authentication configured, eg PasswordAuthenticator, won't be able to serve > requests. This is because {{Auth.setup()}} never gets called during the > startup. > Without {{Auth.setup()}} having been called in {{StorageService}} clients > connecting to the node fail with the node throwing > {noformat} > java.lang.NullPointerException > at > org.apache.cassandra.auth.PasswordAuthenticator.authenticate(PasswordAuthenticator.java:119) > at > org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1471) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3505) > at > org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3489) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at com.thinkaurelius.thrift.Message.invoke(Message.java:314) > at > com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695) > at > com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689) > at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The exception thrown from the > [code|https://github.com/apache/cassandra/blob/cassandra-2.0.16/src/java/org/apache/cassandra/auth/PasswordAuthenticator.java#L119] > {code} > ResultMessage.Rows rows = > authenticateStatement.execute(QueryState.forInternalCalls(), new > QueryOptions(consistencyForUser(username), > >Lists.newArrayList(ByteBufferUtil.bytes(username; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12281) Gossip blocks on startup when another node is bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644323#comment-15644323 ] Joel Knighton commented on CASSANDRA-12281: --- Ah, good catch on the aggregate log message spanning the trace and debug cases. That makes a lot of sense - thanks for the explanation. I'll keep this at the top of my queue for when CI is available. > Gossip blocks on startup when another node is bootstrapping > --- > > Key: CASSANDRA-12281 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12281 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Eric Evans >Assignee: Stefan Podkowinski > Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-trunk.patch, > restbase1015-a_jstack.txt > > > In our cluster, normal node startup times (after a drain on shutdown) are > less than 1 minute. However, when another node in the cluster is > bootstrapping, the same node startup takes nearly 30 minutes to complete, the > apparent result of gossip blocking on pending range calculations. > {noformat} > $ nodetool-a tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 1840 0 > 0 > ReadStage 0 0 2350 0 > 0 > RequestResponseStage 0 0 53 0 > 0 > ReadRepairStage 0 0 1 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > HintedHandoff 0 0 44 0 > 0 > MiscStage 0 0 0 0 > 0 > CompactionExecutor3 3395 0 > 0 > MemtableReclaimMemory 0 0 30 0 > 0 > PendingRangeCalculator1 2 29 0 > 0 > GossipStage 1 5602164 0 > 0 > MigrationStage0 0 0 0 > 0 > MemtablePostFlush 0 0111 0 > 0 > ValidationExecutor0 0 0 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0 30 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {noformat} > A full thread dump is attached, but the relevant bit seems to be here: > {noformat} > [ ... ] > "GossipStage:1" #1801 daemon prio=5 os_prio=0 tid=0x7fe4cd54b000 > nid=0xea9 waiting on condition [0x7fddcf883000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0004c1e922c0> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at > org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:174) > at > org.apache.cassandra.locator.TokenMetadata.updateNormalTokens(TokenMetadata.java:160) > at > org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2023) > at > org.apache.cassandra.service.StorageService.onChange(StorageService.java:1682) > at > org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1182) > at
[jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637549#comment-15637549 ] Joel Knighton commented on CASSANDRA-12653: --- [~spo...@gmail.com] - yes! I sincerely apologize for the delay here. If anyone else is interested in reviewing this, they're welcome to pick it up, but it's near the top of my list and I hope to get to this soon. > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12281) Gossip blocks on startup when another node is bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637528#comment-15637528 ] Joel Knighton commented on CASSANDRA-12281: --- Thanks for the patch and your patience as I get to this for review! I've been quite busy lately. The approach overall seems sound. While calculating pending ranges can be a little slow, I don't think we risk falling too far behind, because the huge delays here appear to be a result of cascading delays to other tasks. The PendingRangeCalculatorService's restriction on one queued task that will reflect cluster state at time of execution helps with this. A few small questions/nits: - Is there a reason that the test is excluded from the 2.2 branch? Byteman is available for tests on the 2.2 branch since [CASSANDRA-12377], and I don't see anything else that stops the test from being useful there. - Generally, the tests are organized as a top-level class for some entity or fundamental operation in the codebase and then specific test methods for unit tests/regression tests. I think it would make sense to establish a {{PendingRangeCalculatorServiceTest}} and introduce the specific test for [CASSANDRA-12281] inside that class. - In the {{PendingRangeCalculatorService}}, I'm not sure we need to move the "Finished calculation for ..." log message to trace. Most Gossip/TokenMetadata state changes are logged at debug, especially when they reflect some detail about the aggregate state of an operation. - A few minor spelling fixes in the test "aquire" -> "acquire", "fist" -> "first". (Note that I normally wouldn't bother with these, but since the test could likely use a few other changes, I think it is worthwhile to fix these.) - In the test's setUp, the call to {{Keyspace.setInitialized}} is redundant. The call to {{SchemaLoader.prepareServer}} will already perform this. - CI looks good overall. The 3.0-dtest run has a few materialized view dtest failures that are likely unrelated, but it would be good if you could retrigger CI for at least this branch. - There's no CI/branch posted for the 3.X series. While this has barely diverged from trunk at this point, it'd be nice if you could run CI for this branch. Thanks again. > Gossip blocks on startup when another node is bootstrapping > --- > > Key: CASSANDRA-12281 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12281 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Eric Evans >Assignee: Stefan Podkowinski > Attachments: 12281-2.2.patch, 12281-3.0.patch, 12281-trunk.patch, > restbase1015-a_jstack.txt > > > In our cluster, normal node startup times (after a drain on shutdown) are > less than 1 minute. However, when another node in the cluster is > bootstrapping, the same node startup takes nearly 30 minutes to complete, the > apparent result of gossip blocking on pending range calculations. > {noformat} > $ nodetool-a tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 1840 0 > 0 > ReadStage 0 0 2350 0 > 0 > RequestResponseStage 0 0 53 0 > 0 > ReadRepairStage 0 0 1 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > HintedHandoff 0 0 44 0 > 0 > MiscStage 0 0 0 0 > 0 > CompactionExecutor3 3395 0 > 0 > MemtableReclaimMemory 0 0 30 0 > 0 > PendingRangeCalculator1 2 29 0 > 0 > GossipStage 1 5602164 0 > 0 > MigrationStage0 0 0 0 > 0 > MemtablePostFlush 0 0111 0 > 0 > ValidationExecutor0 0 0 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0 30 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > CacheCleanupExecutor 0 0