[jira] [Updated] (CASSANDRA-5025) Schema push/pull race

2012-12-09 Thread Chris Herron (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Herron updated CASSANDRA-5025:


Attachment: 5025-v5.txt

(Following up on IRC discussion)
* My patch 3 incorrectly hardcoded Schema.emptyVersion for the announcement in 
SS.joinTokenRing. For actual bootstrap scenario, the schema version should be 
Schema.emptyVersion anyway. 
* Since Schema.updateVersion actually reads rows, I wondered if this will be 
equivalent toSchema.emptyVersion (perhaps Schema tables themselves are 
represented already by this point in time?) Brandon said that he would check 
this.
* I had asked in a previous comment in this jira, and Brandon also noticed that 
SS.joinTokenRing had been calling Schema.updateVersionAndAnnounce and 
Schema.passiveAnnounce in quick succession. Brandon said that it should be 
removed.

I'm attaching patch 5 with these changes:
* Reverted my hardcoded Schema.emptyVersion in SS.joinTokenRing (back to 
original Schema.updateVersionAndAnnounce).
* Removed apparently redundant call to Schema.passiveAnnounce.

Brandon, could you please confirm whether it is safe to assume that 
Schema.updateVersionAndAnnounce would emit Schema.emptyVersion in a bootstrap 
scenario?
 

 Schema push/pull race
 -

 Key: CASSANDRA-5025
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5025
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.0
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1.8

 Attachments: 5025.txt, 5025-v2.txt, 5025-v3.txt, 5025-v4.txt, 
 5025-v5.txt


 When a schema change is made, the coordinator pushes the delta to the other 
 nodes in the cluster.  This is more efficient than sending the entire schema. 
  But the coordinator also announces the new schema version, so the other 
 nodes' reception of the new version races with processing the delta, and 
 usually seeing the new schema wins.  So the other nodes also issue a pull to 
 the coordinator for the entire schema.
 Thus, schema changes tend to become O(n) in the number of KS and CF present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5025) Schema push/pull race

2012-12-09 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13527694#comment-13527694
 ] 

Chris Herron commented on CASSANDRA-5025:
-

Thanks [~brandon.williams], [~xedin].

 Schema push/pull race
 -

 Key: CASSANDRA-5025
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5025
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.0
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1.8

 Attachments: 5025.txt, 5025-v2.txt, 5025-v3.txt, 5025-v4.txt, 
 5025-v5.txt


 When a schema change is made, the coordinator pushes the delta to the other 
 nodes in the cluster.  This is more efficient than sending the entire schema. 
  But the coordinator also announces the new schema version, so the other 
 nodes' reception of the new version races with processing the delta, and 
 usually seeing the new schema wins.  So the other nodes also issue a pull to 
 the coordinator for the entire schema.
 Thus, schema changes tend to become O(n) in the number of KS and CF present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5025) Schema push/pull race

2012-12-07 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13526467#comment-13526467
 ] 

Chris Herron commented on CASSANDRA-5025:
-

Could StorageServer.joinTokenRing wait max(RING_DELAY, 1min) (the 1 min being 
the delay in MigrationManager.maybeScheduleSchemaPull? Or could 
MigrationManager.maybeScheduleSchemaPull use some multiple of RING_DELAY?

Related: is it correct that StorageServer.joinTokenRing calls 
Schema.instance.updateVersionAndAnnounce and 
MigrationManager.passiveAnnounce(Schema.instance.getVersion()) in quick 
succession?

 Schema push/pull race
 -

 Key: CASSANDRA-5025
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5025
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.0
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1.8

 Attachments: 5025.txt, 5025-v2.txt


 When a schema change is made, the coordinator pushes the delta to the other 
 nodes in the cluster.  This is more efficient than sending the entire schema. 
  But the coordinator also announces the new schema version, so the other 
 nodes' reception of the new version races with processing the delta, and 
 usually seeing the new schema wins.  So the other nodes also issue a pull to 
 the coordinator for the entire schema.
 Thus, schema changes tend to become O(n) in the number of KS and CF present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5025) Schema push/pull race

2012-12-07 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13526533#comment-13526533
 ] 

Chris Herron commented on CASSANDRA-5025:
-

From discussion on #cassandra-dev with [~brandon.williams], 
StorageServer.joinTokenRing could use Schema.emptyVersion as Schema UUID in 
order to allow the maybeScheduleSchemaPull delay to be skipped. Patch to 
follow...

 Schema push/pull race
 -

 Key: CASSANDRA-5025
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5025
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.0
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1.8

 Attachments: 5025.txt, 5025-v2.txt


 When a schema change is made, the coordinator pushes the delta to the other 
 nodes in the cluster.  This is more efficient than sending the entire schema. 
  But the coordinator also announces the new schema version, so the other 
 nodes' reception of the new version races with processing the delta, and 
 usually seeing the new schema wins.  So the other nodes also issue a pull to 
 the coordinator for the entire schema.
 Thus, schema changes tend to become O(n) in the number of KS and CF present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-5025) Schema push/pull race

2012-12-07 Thread Chris Herron (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Herron updated CASSANDRA-5025:


Attachment: 5025-v3.txt

Attached patch 3 proposing the use of Schema.emptyVersion to differentiate 
StorageServer.joinTokenRing from other scenarios so that migration delay can be 
skipped for bootstrapping.

 Schema push/pull race
 -

 Key: CASSANDRA-5025
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5025
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.0
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1.8

 Attachments: 5025.txt, 5025-v2.txt, 5025-v3.txt


 When a schema change is made, the coordinator pushes the delta to the other 
 nodes in the cluster.  This is more efficient than sending the entire schema. 
  But the coordinator also announces the new schema version, so the other 
 nodes' reception of the new version races with processing the delta, and 
 usually seeing the new schema wins.  So the other nodes also issue a pull to 
 the coordinator for the entire schema.
 Thus, schema changes tend to become O(n) in the number of KS and CF present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5025) Schema push/pull race

2012-12-06 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13512078#comment-13512078
 ] 

Chris Herron commented on CASSANDRA-5025:
-

[~jbellis]: patch 5025-v2.txt works better. For the same test, after 60s, the 
CF creation time drops from sub-second to 5 seconds average. Delayed 
rectifySchema work will still interfere with coincident schema migrations, but 
I think this is the right compromise. Thank you!

Minor: import for {{Callable}} was dropped, but is still referenced at line 229.

[~xedin]: This test was not endorsing a high rate of CF creation for real world 
use, the goal was to investigate if/why CF creation time was {{O(N)}}.

 Schema push/pull race
 -

 Key: CASSANDRA-5025
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5025
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.0
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1.8

 Attachments: 5025.txt, 5025-v2.txt


 When a schema change is made, the coordinator pushes the delta to the other 
 nodes in the cluster.  This is more efficient than sending the entire schema. 
  But the coordinator also announces the new schema version, so the other 
 nodes' reception of the new version races with processing the delta, and 
 usually seeing the new schema wins.  So the other nodes also issue a pull to 
 the coordinator for the entire schema.
 Thus, schema changes tend to become O(n) in the number of KS and CF present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5025) Schema push/pull race

2012-12-05 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510585#comment-13510585
 ] 

Chris Herron commented on CASSANDRA-5025:
-

Clarifying for anyone else who encounters this issue:
* This problem was introduced in CASSANDRA-3931
* For use cases that involve creation/update/deletion of multiple keyspaces or 
column families, the symptom will be increasingly slow schema migrations as the 
KS/CF population grows. Depending on client RPC timeout config, schema change 
requests may fail. 
* In a test environment running stock C* 1.1.7, for a test that creates new CFs 
in sequence, we see the following CF creation times:
** Empty cluster: sub-second
** 200+ CFs: 15s ave.
** 400+ CFs: 30s+ with eventual failure due to 30s client side (Hector) RPC 
timeout.
* In the same test environment running 1.1.7 patched with 5025.txt:
** For the first 60s duration of the test, CF creation times are sub-second
** At 60s, the delayed rectifySchema migration calls kick in and creation times 
drop to 50s+ (including waits for schema agreement) with eventual failure due 
to 30s client side RPC timeout.



 Schema push/pull race
 -

 Key: CASSANDRA-5025
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5025
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.0
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1.8

 Attachments: 5025.txt


 When a schema change is made, the coordinator pushes the delta to the other 
 nodes in the cluster.  This is more efficient than sending the entire schema. 
  But the coordinator also announces the new schema version, so the other 
 nodes' reception of the new version races with processing the delta, and 
 usually seeing the new schema wins.  So the other nodes also issue a pull to 
 the coordinator for the entire schema.
 Thus, schema changes tend to become O(n) in the number of KS and CF present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-3931) gossipers notion of schema differs from reality as reported by the nodes in question

2012-12-05 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510589#comment-13510589
 ] 

Chris Herron commented on CASSANDRA-3931:
-

FYI the fixes for this issue introduced issue CASSANDRA-5025.

 gossipers notion of schema differs from reality as reported by the nodes in 
 question
 

 Key: CASSANDRA-3931
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3931
 Project: Cassandra
  Issue Type: Bug
Reporter: Peter Schuller
Assignee: Brandon Williams
 Fix For: 1.1.0

 Attachments: 3931.txt, 3931-v2.txt


 On a 1.1 cluster we happened to notice that {{nodetool gossipinfo | grep 
 SCHEMA}} reported disagreement:
 {code}
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:b0d7bab7-c13c-37d9-9adb-8ab8a5b7215d
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:bcdbd318-82df-3518-89e3-6b72227b3f66
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:bcdbd318-82df-3518-89e3-6b72227b3f66
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
 {code}
 However, the result of a thrift {{describe_ring}} on the cluster claims they 
 all agree and that {{b0d7bab7-c13c-37d9-9adb-8ab8a5b7215d}} is the schema 
 they have.
 The schemas seem to actually propagate; e.g. dropping a keyspace actually 
 drops the keyspace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5025) Schema push/pull race

2012-12-04 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510087#comment-13510087
 ] 

Chris Herron commented on CASSANDRA-5025:
-

For patch 5025.txt:

A single schema migration will result in N (num nodes) gossips of the new 
schema version (as before). Through 
MigrationManager.onChange()-rectifySchema(), those will each result in a 
delayed comparison of value 'theirVersion', but that value is now one minute 
old.

Further, if some new schema migration happens to be underway, the same effect 
of redundant repeat RowMutations will occur.

Schema migrations tend to happen in bursts - so this patch seems like it might 
reduce the problem but not eliminate it.

Would it not be better to have DefsTable.mergeSchema call 
Schema.instance.updateVersion instead of 
Schema.instance.updateVersionAndAnnounce and then deal with temporarily 
unavailable nodes by doing a MigrationManager.passiveAnnounce(version) if/when 
we see them come back online?

 Schema push/pull race
 -

 Key: CASSANDRA-5025
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5025
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.0
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1.8

 Attachments: 5025.txt


 When a schema change is made, the coordinator pushes the delta to the other 
 nodes in the cluster.  This is more efficient than sending the entire schema. 
  But the coordinator also announces the new schema version, so the other 
 nodes' reception of the new version races with processing the delta, and 
 usually seeing the new schema wins.  So the other nodes also issue a pull to 
 the coordinator for the entire schema.
 Thus, schema changes tend to become O(n) in the number of KS and CF present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4906) Avoid flushing other columnfamilies on truncate

2012-11-12 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495524#comment-13495524
 ] 

Chris Herron commented on CASSANDRA-4906:
-

Would it be possible to backport this to Cassandra 1.1?

 Avoid flushing other columnfamilies on truncate
 ---

 Key: CASSANDRA-4906
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4906
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.2.0

 Attachments: 4906.txt, 4906-v2.txt


 Currently truncate flushes *all* columnfamilies so it can get rid of the 
 commitlog segments containing truncated data.  Otherwise, it could be 
 replayed on restart since the replay position is contained in the sstables 
 we're trying to delete.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-23 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13482552#comment-13482552
 ] 

Chris Herron commented on CASSANDRA-4417:
-

bq. I'm really starting to think that CASSANDRA-4071 is likely the main cause 
for this and is very easy to reproduce in that case. The commit log we've 
discussed earlier can also trigger that error, but it's probably much harder to 
trigger.

In our case:
* We haven't made any topology changes
* Our test drops and recreates the affected CFs. No nodes die during the test 
(w.r.t. unclean shutdown and commit log)
* After previous load test runs under different configuration (see below), no 
nodes die, and we use nodetool drain before restarting with updated configs.

Note my earlier comment above I said:

bq. In investigating CASSANDRA-4687 we disabled key cache, repeated the 
load+upgradesstables test and these invalid counter shard warnings did not 
appear.

Given that we don't have a topology change, can you think of a scenario where a 
commitlog issue is still contributing?



 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-22 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13481729#comment-13481729
 ] 

Chris Herron commented on CASSANDRA-4417:
-

bq. Quick question: do you always increment by the same value by any chance?

No, sorry just happened to pick that example. We have many other log entries 
where both values are higher and don't differ by 1.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-4832) AssertionError: keys must not be empty

2012-10-20 Thread Chris Herron (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Herron updated CASSANDRA-4832:


Attachment: FlushWriterKeyAssertionBlock.txt

Came across this investigating an apparent deadlock in Schema Migrations.

If this assertion fails on the flushWriter executor, it blocks indefinitely. 
Anything upstream locking-wise gets stuck also. This was on 1.1.6.

Log output below, thread dump attached.

ERROR [FlushWriter:3] 2012-10-19 22:27:56,948 
org.apache.cassandra.service.AbstractCassandraDaemon Exception in thread 
Thread[FlushWriter:3,5,main]
java.lang.AssertionError: Keys must not be empty
at 
org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:133)
at 
org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:176)
at 
org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:295)
at org.apache.cassandra.db.Memtable.access$600(Memtable.java:48)
at org.apache.cassandra.db.Memtable$5.runMayThrow(Memtable.java:316)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)



 AssertionError: keys must not be empty
 --

 Key: CASSANDRA-4832
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4832
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.6
 Environment: Debian 6.0.5
Reporter: Tristan Seligmann
Assignee: Tristan Seligmann
Priority: Minor
  Labels: indexing
 Fix For: 1.1.7

 Attachments: FlushWriterKeyAssertionBlock.txt


 I'm getting errors like this logged:
  INFO 07:08:32,104 Compacting 
 [SSTableReader(path='/var/lib/cassandra/data/Fusion/quoteinfo/Fusion-quoteinfo.quoteinfo_search_value_idx-hf-114-Data.db'),
  
 SSTableReader(path='/var/lib/cassandra/data/Fusion/quoteinfo/Fusion-quoteinfo.quoteinfo_search_value_idx-hf-113-Data.db'),
  
 SSTableReader(path='/var/lib/cassandra/data/Fusion/quoteinfo/Fusion-quoteinfo.quoteinfo_search_value_idx-hf-110-Data.db'),
  
 SSTableReader(path='/var/lib/cassandra/data/Fusion/quoteinfo/Fusion-quoteinfo.quoteinfo_search_value_idx-hd-108-Data.db'),
  
 SSTableReader(path='/var/lib/cassandra/data/Fusion/quoteinfo/Fusion-quoteinfo.quoteinfo_search_value_idx-hd-106-Data.db'),
  
 SSTableReader(path='/var/lib/cassandra/data/Fusion/quoteinfo/Fusion-quoteinfo.quoteinfo_search_value_idx-hd-107-Data.db'),
  
 SSTableReader(path='/var/lib/cassandra/data/Fusion/quoteinfo/Fusion-quoteinfo.quoteinfo_search_value_idx-hf-112-Data.db'),
  
 SSTableReader(path='/var/lib/cassandra/data/Fusion/quoteinfo/Fusion-quoteinfo.quoteinfo_search_value_idx-hf-109-Data.db'),
  
 SSTableReader(path='/var/lib/cassandra/data/Fusion/quoteinfo/Fusion-quoteinfo.quoteinfo_search_value_idx-hf-111-Data.db')]
 ERROR 07:08:32,108 Exception in thread Thread[CompactionExecutor:5,1,main]
 java.lang.AssertionError: Keys must not be empty
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:133)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:154)
 at 
 org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159)
 at 
 org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154)
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 I'm not really sure when this started happening; they tend to be logged 
 during a repair but I can't reproduce the error 100% reliably.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-19 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13480121#comment-13480121
 ] 

Chris Herron commented on CASSANDRA-4417:
-

Another observation since: in previous runs with key cache disabled we were not 
seeing any errors. However I've since found some invalid counter shard errors 
that are occurring during normal compaction. 

{code}
ERROR [CompactionExecutor:6] 2012-10-19 15:43:50,920 
org.apache.cassandra.db.context.CounterContext invalid counter shard detected; 
(15b843e0-ff7c-11e0--07f4b18563ff, 1, 1) and 
(15b843e0-ff7c-11e0--07f4b18563ff, 1, 2) differ only
 in count; will pick highest to self-heal; this indicates a bug or corruption 
generated a bad counter shard
{code}

So to be clear, this particular scenario is:
* C* 1.1.6 with key cache disabled. 
* Load test ran earlier against this same setup; but no upgradesstables during 
that run; no errors under load during that test run.
* Later, some nightly jobs ran that read from Super CF counters, write to other 
CFs.
* Compaction activity occurs later after load test and nightly jobs complete. 
Invalid counter shard errors are seen for some CFs. Gleaning from the log 
output order, the affected CF's:
** *Did* have upgradesstables run upon them in previous configurations (1.1.6, 
key cache on)
** Have not been written to at all for the purpose of the load test I've been 
mentioning.
** Have been read from for these nightly jobs.









 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-18 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479561#comment-13479561
 ] 

Chris Herron commented on CASSANDRA-4417:
-

bq. Also, during that test, is there anything involving streaming going on (a 
repair, a node bootstrapping/moving/decommissioning)?

There are definitely no repairs or node bootstrapping/moving/decommissioning 
happening during the test.
Re-ran the test and the JMX stats for StreamStage indicated zero tasks on all 
nodes after the test completed.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-17 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477943#comment-13477943
 ] 

Chris Herron commented on CASSANDRA-4417:
-

bq. Unless you've been able to reproduce on a brand new cluster where the 
commit log was set to batch from the beginning (in which case, if you have an 
easy way to reproduce, that would be interesting to know)

In our test the affected Super CF is completely deleted and recreated - so in 
that sense the commit log was set to batch from the beginning. Is that 
equivalent?

This does reproduce for every test run. Unfortunately our test is non-trivial 
to share. It involves heavy writes and moderate reads to counters, while 
simultaneously running upgradesstables on all nodes upon multiple CF's 
(including the affected one). Interestingly, the symptom does appear even 
before compaction reaches the Super CF that's active during the test.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-17 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478505#comment-13478505
 ] 

Chris Herron commented on CASSANDRA-4417:
-

bq. Probably, how was it deleted/recreated. Did you drop and recreate?

Yes, dropped (the schema migration flavor) and recreated a CF of the same name.

bq. Perform the same test without the upgradesstables part (i.e. only the 
writes and reads). If so, does that change something?

Have already tested that scenario. Running this load test without the 
concurrent upgradesstables compaction activity, the problem does not exhibit.

bq. during that test, is there anything involving streaming going on (a repair, 
a node bootstrapping/moving/decommissioning)?

Not that I know of. I can test again and monitor for streaming activity to see.

By the way, as we've been testing in preparation for a 1.1.x upgrade, we were 
seeing symptoms of CASSANDRA-4571, CASSANDRA-4687 as well as this issue on C* 
1.1.6. In investigating CASSANDRA-4687 we disabled key cache, repeated the 
load+upgradesstables test and these invalid counter shard warnings did not 
appear.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4571) Strange permament socket descriptors increasing leads to Too many open files

2012-10-16 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477121#comment-13477121
 ] 

Chris Herron commented on CASSANDRA-4571:
-

We are also seeing errors similar to those reported in CASSANDRA-4687.
Could this be a side-effect of that problem? In {{SSTableSliceIterator}} as of 
commit {{e1b10590e84189b92af168e33a63c14c3ca1f5fa}}, if the constructor key 
equality assertion fails, {{fileToClose}} does not get closed.

 Strange permament socket descriptors increasing leads to Too many open files
 --

 Key: CASSANDRA-4571
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4571
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: CentOS 5.8 Linux 2.6.18-308.13.1.el5 #1 SMP Tue Aug 21 
 17:10:18 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux. 
 java version 1.6.0_33
 Java(TM) SE Runtime Environment (build 1.6.0_33-b03)
 Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixed mode)
Reporter: Serg Shnerson
Assignee: Jonathan Ellis
Priority: Critical
 Fix For: 1.1.5

 Attachments: 4571.txt


 On the two-node cluster there was found strange socket descriptors 
 increasing. lsof -n | grep java shows many rows like
 java   8380 cassandra  113r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  114r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  115r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  116r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  117r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  118r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  119r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  120r unix 0x8101a374a080
 938348482 socket
  And number of this rows constantly increasing. After about 24 hours this 
 situation leads to error.
 We use PHPCassa client. Load is not so high (aroud ~50kb/s on write). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4687) Exception: DecoratedKey(xxx, yyy) != DecoratedKey(zzz, kkk)

2012-10-16 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477136#comment-13477136
 ] 

Chris Herron commented on CASSANDRA-4687:
-

Has anybody experienced Linux socket FD leakage alongside these errors? (see 
CASSANDRA-4571).
To check this, you can run:
 {{watch -n 10 sudo lsof -n | grep java | grep unix | wc -l}}
This number should stay at 1. If you see growth towards your limits 
(/etc/security/limits.conf), then that suggests CASSANDRA-4571 might be a 
side-effect of this problem.

 Exception: DecoratedKey(xxx, yyy) != DecoratedKey(zzz, kkk)
 ---

 Key: CASSANDRA-4687
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4687
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.5
 Environment: CentOS 6.3 64-bit, Oracle JRE 1.6.0.33 64-bit, single 
 node cluster
Reporter: Leonid Shalupov
Assignee: Pavel Yaskevich
Priority: Critical
 Fix For: 1.1.7

 Attachments: 4687-debugging.txt


 Under heavy write load sometimes cassandra fails with assertion error.
 git bisect leads to commit 295aedb278e7a495213241b66bc46d763fd4ce66.
 works fine if global key/row caches disabled in code.
 {quote}
 java.lang.AssertionError: DecoratedKey(xxx, yyy) != DecoratedKey(zzz, kkk) in 
 /var/lib/cassandra/data/...-he-1-Data.db
   at 
 org.apache.cassandra.db.columniterator.SSTableSliceIterator.init(SSTableSliceIterator.java:60)
   at 
 org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:67)
   at 
 org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:79)
   at 
 org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:256)
   at 
 org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:64)
   at 
 org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1345)
   at 
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1207)
   at 
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1142)
   at org.apache.cassandra.db.Table.getRow(Table.java:378)
   at 
 org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69)
   at 
 org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:819)
   at 
 org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1253)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4571) Strange permament socket descriptors increasing leads to Too many open files

2012-10-16 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477176#comment-13477176
 ] 

Chris Herron commented on CASSANDRA-4571:
-

Yes, seeing the key equality AssertionErrors from two SSTable iterators: 
SSTableSliceIterator:60 and SSTableNamesIterator:72.
Also seeing same EOF error reported by [~tjake] in CASSANDRA-4687:
{code}
java.io.IOError: java.io.EOFException: unable to seek to position 61291844 in 
/redacted/cassandra/data/test1/redacted/test1-redacted-hf-1-Data.db (59874704 
bytes) in read-only mode
at 
org.apache.cassandra.io.util.CompressedSegmentedFile.getSegment(CompressedSegmentedFile.java:69)
at 
org.apache.cassandra.io.sstable.SSTableReader.getFileDataInput(SSTableReader.java:898)
at 
org.apache.cassandra.db.columniterator.SSTableSliceIterator.init(SSTableSliceIterator.java:50)
at 
org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:67)
at 
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:79)
at 
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:256)
at 
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:64)
at 
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1345)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1207)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1142)
at org.apache.cassandra.db.Table.getRow(Table.java:378)
at 
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69)
at 
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:51)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.EOFException: unable to seek to position 61291844 in 
/redacted/cassandra/data/test1/redacted/test1-redacted-hf-1-Data.db (59874704 
bytes) in read-only mode
at 
org.apache.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:253)
at 
org.apache.cassandra.io.util.CompressedSegmentedFile.getSegment(CompressedSegmentedFile.java:64)
... 16 more
{code}


 Strange permament socket descriptors increasing leads to Too many open files
 --

 Key: CASSANDRA-4571
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4571
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: CentOS 5.8 Linux 2.6.18-308.13.1.el5 #1 SMP Tue Aug 21 
 17:10:18 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux. 
 java version 1.6.0_33
 Java(TM) SE Runtime Environment (build 1.6.0_33-b03)
 Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixed mode)
Reporter: Serg Shnerson
Assignee: Jonathan Ellis
Priority: Critical
 Fix For: 1.1.5

 Attachments: 4571.txt


 On the two-node cluster there was found strange socket descriptors 
 increasing. lsof -n | grep java shows many rows like
 java   8380 cassandra  113r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  114r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  115r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  116r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  117r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  118r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  119r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  120r unix 0x8101a374a080
 938348482 socket
  And number of this rows constantly increasing. After about 24 hours this 
 situation leads to error.
 We use PHPCassa client. Load is not so high (aroud ~50kb/s on write). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-16 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477556#comment-13477556
 ] 

Chris Herron commented on CASSANDRA-4417:
-

We are seeing large volumes of this error on all nodes when running a load test 
while also running upgradesstables on multiple CF's on each node.

After reading Sylvain's comments above, tried running the same test with 
commitlog_sync: batch - we get a similar volume of the same errors.

(Running a build from branch cassandra-1.1 at commit 
4d2e5e73b127dc0b335176ddc1dec1f0244e7f6d, with Java 6u35 on Amazon Linux 2.6.35)

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4571) Strange permament socket descriptors increasing leads to Too many open files

2012-10-16 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477576#comment-13477576
 ] 

Chris Herron commented on CASSANDRA-4571:
-

Tested this patch: https://gist.github.com/2f10efd3922fab9a095e applied to a 
build from branch cassandra-1.1 at commit 
4d2e5e73b127dc0b335176ddc1dec1f0244e7f6d.

This definitely reduced the growth of socket FD handles, but there must be 
other scenarios like this in the codebase because it did grow beyond 2 which is 
where I've seen it at steady state under normal conditions.

The AssertionErrors from CASSANDRA-4687 were so spurious that they were pegging 
disk IO. When I ran the same test again with assertions disabled for the 
org.apache.cassandra.db.columniterator package, I saw many errors like those 
described in CASSANDRA-4417 (invalid counter shard detected). See my comments 
in that issue.

Shouldn't CASSANDRA-4571 be re-opened?






 Strange permament socket descriptors increasing leads to Too many open files
 --

 Key: CASSANDRA-4571
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4571
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: CentOS 5.8 Linux 2.6.18-308.13.1.el5 #1 SMP Tue Aug 21 
 17:10:18 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux. 
 java version 1.6.0_33
 Java(TM) SE Runtime Environment (build 1.6.0_33-b03)
 Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixed mode)
Reporter: Serg Shnerson
Assignee: Jonathan Ellis
Priority: Critical
 Fix For: 1.1.5

 Attachments: 4571.txt


 On the two-node cluster there was found strange socket descriptors 
 increasing. lsof -n | grep java shows many rows like
 java   8380 cassandra  113r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  114r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  115r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  116r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  117r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  118r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  119r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  120r unix 0x8101a374a080
 938348482 socket
  And number of this rows constantly increasing. After about 24 hours this 
 situation leads to error.
 We use PHPCassa client. Load is not so high (aroud ~50kb/s on write). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4571) Strange permament socket descriptors increasing leads to Too many open files

2012-10-15 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476344#comment-13476344
 ] 

Chris Herron commented on CASSANDRA-4571:
-

For anybody else encountering this unbounded socket growth problem on 1.1.5, 
note that while upgrading 1.6.0_35 seemed to help, a longer load test still 
reproduced the symptom. FWIW, upgradesstables ran for a period during this 
particular test - unclear if the increased compaction activity contributed.

 Strange permament socket descriptors increasing leads to Too many open files
 --

 Key: CASSANDRA-4571
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4571
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: CentOS 5.8 Linux 2.6.18-308.13.1.el5 #1 SMP Tue Aug 21 
 17:10:18 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux. 
 java version 1.6.0_33
 Java(TM) SE Runtime Environment (build 1.6.0_33-b03)
 Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixed mode)
Reporter: Serg Shnerson
Assignee: Jonathan Ellis
Priority: Critical
 Fix For: 1.1.5

 Attachments: 4571.txt


 On the two-node cluster there was found strange socket descriptors 
 increasing. lsof -n | grep java shows many rows like
 java   8380 cassandra  113r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  114r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  115r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  116r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  117r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  118r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  119r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  120r unix 0x8101a374a080
 938348482 socket
  And number of this rows constantly increasing. After about 24 hours this 
 situation leads to error.
 We use PHPCassa client. Load is not so high (aroud ~50kb/s on write). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4571) Strange permament socket descriptors increasing leads to Too many open files

2012-10-15 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476710#comment-13476710
 ] 

Chris Herron commented on CASSANDRA-4571:
-

FYI was able to reproduce the symptom on Cassandra 1.1.6.
@[~jbellis] Re: CASSANDRA-4740 and whether it relates to this: 
* Haven't looked across all nodes for phantom connections yet
* Have searched across all logs - found a single instance of Timed out 
replaying hints.
* Mina mentioned that Nodes running earlier kernels (2.6.39, 3.0, 3.1) haven't 
exhibited this. We are seeing this on Linux kernel 2.6.35 with Java 1.6.0_35.


 Strange permament socket descriptors increasing leads to Too many open files
 --

 Key: CASSANDRA-4571
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4571
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: CentOS 5.8 Linux 2.6.18-308.13.1.el5 #1 SMP Tue Aug 21 
 17:10:18 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux. 
 java version 1.6.0_33
 Java(TM) SE Runtime Environment (build 1.6.0_33-b03)
 Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixed mode)
Reporter: Serg Shnerson
Assignee: Jonathan Ellis
Priority: Critical
 Fix For: 1.1.5

 Attachments: 4571.txt


 On the two-node cluster there was found strange socket descriptors 
 increasing. lsof -n | grep java shows many rows like
 java   8380 cassandra  113r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  114r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  115r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  116r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  117r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  118r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  119r unix 0x8101a374a080
 938348482 socket
 java   8380 cassandra  120r unix 0x8101a374a080
 938348482 socket
  And number of this rows constantly increasing. After about 24 hours this 
 situation leads to error.
 We use PHPCassa client. Load is not so high (aroud ~50kb/s on write). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-3070) counter repair

2011-09-08 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13100744#comment-13100744
 ] 

Chris Herron commented on CASSANDRA-3070:
-

I have seen something similar. I recently grew an 0.8.4 test cluster from N 
nodes to N*2 nodes. After running nodetool repair on each node, found that some 
counters were out of sync (counter values would vary reading from different 
hosts).

 counter repair
 --

 Key: CASSANDRA-3070
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3070
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 0.8.4
Reporter: ivan
Assignee: Sylvain Lebresne
 Attachments: counter_local_quroum_maybeschedulerepairs.txt, 
 counter_local_quroum_maybeschedulerepairs_2.txt, 
 counter_local_quroum_maybeschedulerepairs_3.txt


 Hi!
 We have some counters out of sync but repair doesn't sync values.
 We tried nodetool repair.
 We use LOCAL_QUORUM for read. A repair row mutation is sent to other nodes 
 while reading a bad row but counters wasn't repaired by mutation.
 Output of two nodes were uploaded. (Some new debug messages were added.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira