subject:"\[jira\] \[Commented\] \(CASSANDRA\-7275\) Errors in FlushRunnable may leave threads hung"

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2015-04-28 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517366#comment-14517366
 ] 

Benedict commented on CASSANDRA-7275:
-

Since this has come up again a few times, I propose tha

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.15

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2015-04-06 Thread Jeremiah Jordan (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481349#comment-14481349
 ] 

Jeremiah Jordan commented on CASSANDRA-7275:


Just had a java.io.SyncFailedException cause this.  After the exception 
MemtablePostFlush was stuck.

{noformat}
ERROR [MemtableFlushWriter:6] 2015-04-03 01:57:06,973  CassandraDaemon.java:167 
- Exception in thread Thread[MemtableFlushWriter:6,5,main]
org.apache.cassandra.io.FSWriteError: java.io.SyncFailedException: sync failed
at 
org.apache.cassandra.io.util.SequentialWriter.syncDataOnlyInternal(SequentialWriter.java:254)
 ~[cassandra-all-2.1.3.329.jar:2.1.3.329]
at 
org.apache.cassandra.io.util.SequentialWriter.syncInternal(SequentialWriter.java:263)
 ~[cassandra-all-2.1.3.329.jar:2.1.3.329]
at 
org.apache.cassandra.io.util.SequentialWriter.close(SequentialWriter.java:451) 
~[cassandra-all-2.1.3.329.jar:2.1.3.329]
at 
org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:664)
 ~[cassandra-all-2.1.3.329.jar:2.1.3.329]
at 
org.apache.cassandra.io.sstable.SSTableWriter.close(SSTableWriter.java:495) 
~[cassandra-all-2.1.3.329.jar:2.1.3.329]
at 
org.apache.cassandra.io.sstable.SSTableWriter.finish(SSTableWriter.java:448) 
~[cassandra-all-2.1.3.329.jar:2.1.3.329]
at 
org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:440)
 ~[cassandra-all-2.1.3.329.jar:2.1.3.329]
at 
org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:435)
 ~[cassandra-all-2.1.3.329.jar:2.1.3.329]
at 
org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:377)
 ~[cassandra-all-2.1.3.329.jar:2.1.3.329]
at 
org.apache.cassandra.db.Memtable$FlushRunnable.runMayThrow(Memtable.java:327) 
~[cassandra-all-2.1.3.329.jar:2.1.3.329]
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
~[cassandra-all-2.1.3.329.jar:2.1.3.329]
at 
com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
 ~[guava-16.0.1.jar:na]
at 
org.apache.cassandra.db.ColumnFamilyStore$Flush.run(ColumnFamilyStore.java:1097)
 ~[cassandra-all-2.1.3.329.jar:2.1.3.329]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
~[na:1.8.0_40]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
~[na:1.8.0_40]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
Caused by: java.io.SyncFailedException: sync failed
at java.io.FileDescriptor.sync(Native Method) ~[na:1.8.0_40]
at 
org.apache.cassandra.io.util.SequentialWriter.syncDataOnlyInternal(SequentialWriter.java:250)
 ~[cassandra-all-2.1.3.329.jar:2.1.3.329]
... 15 common frames omitted
{noformat}

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.15

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2015-01-05 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264623#comment-14264623
 ] 

Benedict commented on CASSANDRA-7275:
-

Accidentally assigned myself this ticket. Sorry if that caused any confusion.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-21 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255125#comment-14255125
 ] 

Benedict commented on CASSANDRA-7275:
-

I think somebody with better knowledge of the bookkeeping tables needs to chime 
in here, to give an opinion on if we can safely do this to any we care about.

We still need to decide what to do about the non-whitelisted tables, though. 
They most likely want to not mark the CL clean at least for their affected 
segments, but possibly indefinitely for the affected table, until reboot, to 
guarantee no data bugs.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-21 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255133#comment-14255133
 ] 

Pavel Yaskevich commented on CASSANDRA-7275:


In case of non-whitelisted CFs we just follow on what disk_failure_policy 
dictates us to do, if we want to work on that in CASSANDRA-8498 we should 
probably move that discussion there.

[~slebresne] What do you think about the idea of white listing some of the 
system CFs which are not crucial to normal operation and ignoring flush errors 
in them with all of the exceptions being tracked in the log on ERROR level? I 
took a look into what some of them and I do think it's safe to have 
compactions-in-progress (failure in this one just blocks all of the compactions 
because finishCompaction blocks on switchMemtable result which, in case of 
failure, never counts down the latch), sstable_activity and compaction history 
at least.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-21 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255136#comment-14255136
 ] 

Benedict commented on CASSANDRA-7275:
-

CASSANDRA-8498 is only about replay. It can, in conjunction with repair, help 
avoid data corruption in a replicated table that did not experience a 
cross-cluster data-induced bug, but it won't help a corrupted system table that 
dropped intervening commit log records. So even with best_effort we need to 
take some remedial action to ensure non-whitelisted system tables are not 
corrupted.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Benedict
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-21 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255139#comment-14255139
 ] 

Pavel Yaskevich commented on CASSANDRA-7275:


Sure, I'm just trying to say that non-whitelisted and most luckily essential 
tables should be no different from user CFs for disk_failure_policy and 
probably have a special repair phrase dedicated to them, the only CFs I know 
for sure could be repaired properly by just removing data is schema_* operator 
just need to trigger that right now, other CFs like peers and NodeInfo can be 
auto-regenerated if missing or corrupted so maybe if we keep exception 
information from FlushInfo around we can do some of that work automatically on 
repair request.

Also I want to point out that repair within gc_grace_seconds at least once is 
a hard requirement, most (if not all of the people) who run in production are 
doing frequently e.g. once a week on Sunday night.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Benedict
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-21 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255145#comment-14255145
 ] 

Benedict commented on CASSANDRA-7275:
-

That all sounds eminently reasonable, although I don't know much about the 
system tables. So long as we make sure there aren't any that could be corrupted 
and be non-recoverable, we're good in my book, and I'll let others more 
qualified make the decision about which tables that's true of. We should 
probably introduce some tests to ensure this is indeed safe for each table, 
though.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Benedict
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-20 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254652#comment-14254652
 ] 

Benedict commented on CASSANDRA-7275:
-

bq. WDYT about previous idea of keeping commit logs even for failed flushes?

I'm worried about replaying these records after gc_grace expires. For 
local-only replication this could be non-recoverable, which is why I favour 
taking remedial action and just dropping the records if possible to do so 
safely; if it isn't possible it means corruption has to be avoided, so we would 
likely have to do one of:

# Keep all CL records since the failure for the table (possibly only necessary 
for local-only replication)
## By normal mechanism, but this could retain every CL segment
## By copying relevant CL records out into their own stream, so the rest can be 
expired
## By giving each table its own CL, or perhaps only on failure, or only system 
tables?
# Periodically replay into new memtables, merge with existing data, reinsert 
into CL and reattempt flush logic, to ensure we are never older than gc_grace 
(basically just rolling the problem forward each time)
# Replay CL records only, periodically, sans any deleted items

All of the above are a bit clunky or complicated though, and might have their 
own bugs or encounter the same hardware problems. I'm sure some other actions 
are also possible, but probably equally ugly.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-20 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254694#comment-14254694
 ] 

Pavel Yaskevich commented on CASSANDRA-7275:


I'm fine with dropping the records if we only consider system cf which are 
acceptable to fail to flush, maybe CL for those is not that important e.g. 
compactions_in_progress, sstable_activity and compaction_history. I am also 
thinking maybe we could add additional checks to the CLReplayer class, so when 
it picks up RM for replay it could actually drop what ever records are past 
gc_grace (RM has information about RangeTombstone and DeletedColumn), because 
there is no real point of replaying them anyway, just to be safe.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-20 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254703#comment-14254703
 ] 

Benedict commented on CASSANDRA-7275:
-

Yes, see CASSANDRA-8498 I filed to address that since it's potentially a more 
general concern. The problem here is that by not replaying you could be 
corrupting your data just as much as you might through replay. The difference 
is you can (hopefully) fix it through repair, but you cannot do this for 
local-only tables, and if the problem is a data-induced bug it could be broken 
cluster wide.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-20 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255045#comment-14255045
 ] 

Pavel Yaskevich commented on CASSANDRA-7275:


Ok, so how about we change CL behavior in CASSANDRA-8498 and maybe in scope of 
this ticket we just do failure FlushInfo (from my previous) patch and change 
which is going to while list some of the system CFs othewise check 
disk_failure_policy and fail if needed?

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-19 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253271#comment-14253271
 ] 

Benedict commented on CASSANDRA-7275:
-

I should clarify, since it sounds like we are not too far in disagreement on 
this point: I'm suggesting only that the failure is reported to the flush call 
site, not so the callsite can specialised on the kind of exception, but so that 
if this specific callsite can safely cope with _any_ failure, it can be 
specialised to do so, with remedial action if necessary. A whitelist would be a 
subset of this approach, and hence simpler - but only if it's genuinely safe to 
just drop the problem on the floor; I'm not sufficiently familiar with these 
system tables to say for sure, but I do recollect problems safely starting a 
node when compactions_in_progress was not properly maintained, so I expect 
_some_ remedial action will probably be necessary, perhaps on a case-by-case 
basis.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-19 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254342#comment-14254342
 ] 

Pavel Yaskevich commented on CASSANDRA-7275:


Sure, it sounds like we need conceptually similar to multimap of (cfId, 
ListException) which is going to get checked when there is an exception 
returned from the FlushInfo, also FDYT about previous idea of keeping commit 
logs even for failed flushes?

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-18 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251586#comment-14251586
 ] 

Benedict commented on CASSANDRA-7275:
-

I agree with Pavel that _if we can do so safely_ we should not crash on failing 
to update internal book-keeping. But only if we can guarantee that failing to 
keep the bookkeeping up-to-date won't cause other problems. Which is why I 
suggest:

bq. 2) We can report these exceptions back to the waiter on the Future result, 
and this waiter can choose how to behave. If, say, the memtable of a system 
column family that can be worked-around fails to flush (for instance, 
compactions_in_progress) then instead of retrying, it can simply take some 
other action to ensure the system continues to make safe progress.


 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-18 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253086#comment-14253086
 ] 

Pavel Yaskevich commented on CASSANDRA-7275:


Benedict, as I mentioned - I agree with #2 with one correction - there is no 
actual way to tell what exactly went wrong from FS\{Read, Write\}Error alone so 
it's probably not a lot of sense to give back exception (as it was logged 
already by thread pool) for processing but rather have a set of white listed 
system CFs fail to flush of which is acceptable... Would be nice of course to 
separate commitlogs so it's less error prone when it comes to flushing/recovery 
but this is going to be way off this ticket scope.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-17 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249718#comment-14249718
 ] 

Benedict commented on CASSANDRA-7275:
-

I've filed CASSANDRA-8496, which would help with this problem in 2.1 only. It 
isn't sufficient to ensure the server stays stable, but would both avoid 
forward progress being stopped by errors on the post flusher, and that the 
affected commit log records would be retained indefinitely without resulting in 
infinite commit log growth.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-17 Thread Sylvain Lebresne (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249719#comment-14249719
 ] 

Sylvain Lebresne commented on CASSANDRA-7275:
-

The current behavior is that an unexpected flush error blocks any flush 
thereon. It does seems to me that changing it so that it blocks only flushes 
for the column family on which there was a problem (which is not exactly what 
the patch does, and I do agree with Benedict that it does need to do that) is 
an improvement: if the problem happens for every CF then we're no worst than 
currently, but if it's a one-time event it might leave time for operators to 
take proper actions (of course, we should log a scary error, it's not something 
that should be ignored). So maybe we can start there since we don't seem to 
agree on whether crashing the node is an even better improvement?

As far as my own opinion goes, I do am not in favor of crashing in that case 
because again, if you hold enough memtables in memory that your node become 
unresponsive, you're not really worth off that if you had crashed it right 
away, but if the problem ends up impacting a low traffic table (for instance a 
system table), you might be able to fix the problem in a way that is less 
impactful for your cluster.

I'll note however that I would agree that if the error is a IO one, we should 
respect the disk_failure_policy. And I don't know, maybe we need another 
failure policy (best_effort/crash) for unexpected errors (aka bugs) that have 
the potential of destabilizing a node (I would agree that adding this is 
pushing the problem to our users, but it appears not everyone has the same idea 
on what is the best strategy, and there is maybe not a single good answer).


 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-17 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249733#comment-14249733
 ] 

Benedict commented on CASSANDRA-7275:
-

Just to add to what Sylvain says about the size of the memtable, to hopefully 
help target a solution (spoken agnostically): in 2.1 we could become almost 
immediately unusable for writes if the memtable(s) we are retaining after this 
(or multiple exceptions) exceed a certain proportion of memory, as we will stop 
even trying to flush. So for 2.1 at least if we're going to try and stay alive 
we need to consider if we would prefer to drop writes on the floor 
(agressively, to avoid build up in the queue) if the set of memtables in limbo 
is too large, or if we drop memtables until we reclaim enough space to proceed, 
or if we introduce some special logic for flushing in this event.

In 2.0, conversely, we may flush millions of tiny sstables in the wrong 
scenario, but this would not prevent function unless it permitted excess heap 
growth, or a compaction death spiral. 

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-17 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250118#comment-14250118
 ] 

Jonathan Ellis commented on CASSANDRA-7275:
---

bq. if you hold enough memtables in memory that your node become unresponsive, 
you're not really worse off than if you had crashed it right away

I disagree: we have a ton of evidence to date that a node that slowly falls 
over as it OOMs is much worse than a node that dies and gets marked down 
quickly by the FD.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-17 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250519#comment-14250519
 ] 

Pavel Yaskevich commented on CASSANDRA-7275:


Just to re-iterate, I still don't understand why we would prefer to crash the 
process if error happens on the system CF flush e.g. at the end of compaction 
which is not even essential for the operation like compactions_in_progress and 
still there is no clear answer how do we distinguish between FS{Read, 
Write}Error which is generated as a response to FS or system failure and the 
one which is generated as a response to incorrect call that Cassandra made e.g. 
duplicate hard-link? 

I would prefer that if the failure was in the system CF we log the message, 
leave commitlog and let everything carry on instead of just crashing because it 
could essentially result in dropping incoming data, the story is different for 
actual user memtables tho, as I mentioned couple of times in my previous 
comments, I'm total fine crashing if normal memtable flush fails and 
disk_failure_policy says so.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-17 Thread Tupshin Harper (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250554#comment-14250554
 ] 

Tupshin Harper commented on CASSANDRA-7275:
---

Strongly in favor of the opt in policy based approach that [~jbellis] 
mentioned.  There isn't a one size fits all approach to deal with this 

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-16 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248554#comment-14248554
 ] 

Benedict commented on CASSANDRA-7275:
-

bq. This is not going to help if the problem data driven or external, you just 
going to trash flusher threads without doing any useful work.

Well, let's try and address each problem independently. A data induced bug that 
can occur across many nodes simultaneously is likely to occur repeatedly and 
cause the cluster to degrade probably quite rapidly, and will likely occur on 
all owners of a given token at once. Coupled with the stop-gap measures we're 
discussing might well run the risk of actual data loss or data corruption 
cross-cluster. Read repair would _not_ help for such a data bug, since none of 
the nodes would be in a safe state.

However the transient file system problems you're encountering would be helped 
by reattempting the flush. So, an initial and completely safe approach would be 
to retry a few times and _then_ crash the server (possibly with some random 
waiting involved to avoid a disastrous cascade of cluster-wide death). Wasting 
work isn't really a big problem if the system cannot make progress without this 
success, so I don't see a downside on that front. It's possible if, once this 
fails, we could negotiate a safe crash with our peers, so that if there is a 
data bug at most one replica dies, the operator is well aware of the problem, 
but the cluster continues to operate. Although this is difficult with vnodes, 
and perhaps a little contrived for the current state of c*.

Separately, we can look into perhaps weakening our constraints in various ways. 
The big issue you raise is that compaction is specifically held up. There seem 
to be two things we can do to help this:

1) We can make the dependency queue for marking commit log records unused 
table-specific, so that compactions only get held up if there has been an error 
on the compaction queue;
2) We can report these exceptions back to the waiter on the Future result, and 
this waiter can choose how to behave. If, say, the memtable of a system column 
family that can be worked-around fails to flush (for instance, 
compactions_in_progress) then instead of retrying, it can simply take some 
other action to ensure the system continues to make safe progress. If a data 
table fails to flush it can attempt to retry. 

Eventually, if it cannot recover safely, it should die though, as there will 
need to be some operator involvement and the reality is not everybody monitors 
their log files. I am very -1 on introducing a change that knowingly produces a 
complex failure condition that will not be widely known or understood, but I 
may be alone on that.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-16 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249033#comment-14249033
 ] 

Jonathan Ellis commented on CASSANDRA-7275:
---

bq. Killing C* process is harmful as if we have code problem in 
writeSortedContents or replaceFlushed code it would potentially result in 
shutdown of the whole cluster or at least of all of the neighbors sharing 
replica range.

I'm much more comfortable with things die if something goes catastrophically 
wrong than things start returning nonsense on reads which is what happens if 
we mark something flushed that actually wasn't.

That said, I'd be okay using disk failure policy as a guide.  If people opt 
into best effort behavior and are okay with those implications, so be it.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-16 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249053#comment-14249053
 ] 

Pavel Yaskevich commented on CASSANDRA-7275:


I understand it might be hard for you, Benedict, but just consider there could 
be a programming error in the flush of the memtable or replacing flushed one, 
which is only triggered when metadata about compaction is written back at the 
end of that compaction e.g. CompactionTask.runMayThrow() L225, e.g. error 
mentioned in the description or duplicate hard-link failure or something 
similar which has nothing to do with the underlaying (file-)system which means 
that #1 suggestion is not going to help because compaction is blocked in 
SystemKeyspace.finishCompaction() and flush retry is not going to help because 
it will just fail again and again trying to flush the same data. As an end user 
I would prefer that nobody actually takes a decision to fail on the floor for 
me except me because it means data loss even when problem is not affecting 
actual write/read path, I would be fine though to fail on FS\{Read, 
Write\}Error if user explicitly sets it to fail on I/O errors (e.g. 
disk_failure_policy, it is like of your #2 but not exactly) otherwise I would 
rather get notified in the log and carry on so I can take informed decision on 
my next actions.

bq. Eventually, if it cannot recover safely, it should die though, as there 
will need to be some operator involvement and the reality is not everybody 
monitors their log files.

I'm going to ignore this argument until you actually have experience of running 
Cassandra in production, otherwise it's the same as talking to the wall.

bq. I'm much more comfortable with things die if something goes 
catastrophically wrong than things start returning nonsense on reads which 
is what happens if we mark something flushed that actually wasn't.

I remember it was already the same when the disk is full in the DSE, did people 
actually have fun restoring cluster after it went completely dark? I'm also 
*not* saying that we shouldn't fail on FS\{Read, Write\}Error if 
disk_failure_policy says otherwise.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-16 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249077#comment-14249077
 ] 

Jonathan Ellis commented on CASSANDRA-7275:
---

bq. consider there could be a programming error in the flush of the memtable or 
replacing flushed one

I don't know that hand waving about potential bugs gets us anywhere.  There 
could be programming errors anywhere, including in mark the segment flushed 
when it wasn't panic mode.  The right solution to bugs is QA, not hoping that 
you can guess where unexpected exceptions will happen and provide a safety net.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-16 Thread Jason Brown (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249148#comment-14249148
 ] 

Jason Brown commented on CASSANDRA-7275:


 The right solution to bugs is QA

I kind of agree with this, but QA can only confirm that we've fixed what we've 
discovered as faulty. Strange things will always happen in the real world, and 
unfortunately QA cannot discover (all of) those.

This is a problem we have today - now, in fact. Unfortunately, blindly shutting 
down nodes for us (and, I suspect, most installations) isn't a viable solution 
as it could result in an uncontrolled cascade of shutdowns. I'm not saying we 
shouldn't shut down on real file system problems (especially if the operator 
has set disk_failure_policy properly), but here's our situation: all 
compactions completely shut down when we fail to create the hard link for 
incremental backups, simply on a system CF with only metadata. This could be a 
legit file system problem, that affects the entire system, or it could be 
something minor, but perhaps we can be smarter about the known things that can 
fail that we deem not fatal (and then choose how we want to react to those). In 
our case, while it's unfortunate that some incremental backup data might be 
lost, it would be (and is) much worse to crash the system. If it's a 
programming bug, perhaps we should follow what the operator sets up for the 
disk_failure_policy, but it seems a shame to shutdown on something trivial like 
failing to create a hard link, especially on system metadata CFs. 



 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-16 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249160#comment-14249160
 ] 

Jonathan Ellis commented on CASSANDRA-7275:
---

bq. it seems a shame to shutdown on something trivial like failing to create a 
hard link, especially on system metadata CFs

Well, if the disk is erroring out, failing to create a hard link is only the 
first problem you'll have. :)  But again, disk failure policy covers this, so 
sure, go ahead and take evasive action if you prefer.  I think Pavel's 
suggestion of dropping the memtable but leaving the commitlog marked to-replay 
is reasonable for that scenario.

If it's our bug, then you may need a temporary patch while we figure out the 
cause, but I still don't think that kind of {{// this shouldn't happen}} code 
should be shipped officially.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-16 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249174#comment-14249174
 ] 

Pavel Yaskevich commented on CASSANDRA-7275:


The problem is that there is no way to tell if hard-link problem is a actual 
fs/disk problem or programming error, right now it looks like a programming 
error because it snapshot tries to create duplicate hard-link to the same file 
as I mentioned in the CASSANDRA-8476 so if there is no way to tell how 
reasonable is it to enforce shutdown or any rule from disk_failure_policy?

bq. If it's our bug, then you may need a temporary patch while we figure out 
the cause, but I still don't think that kind of // this shouldn't happen code 
should be shipped officially.

If it's your problem it's my problem as well, we can work around for now (as I 
guess most of the people do) but my intention in this ticket to fix this 
problem for good instead of just fixing the symptom of it (being aforementioned 
duplicate hard-link problem).

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-15 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246780#comment-14246780
 ] 

Jonathan Ellis commented on CASSANDRA-7275:
---

([~benedict] to review)

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-15 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246852#comment-14246852
 ] 

Benedict commented on CASSANDRA-7275:
-

I'm not sure this really improves the current state of affairs very much, and 
introduces a bug as well.

If we fail to flush and simply carry on without clearing the CL, then the host 
will retain the memtable it wanted to flush forever, leaving it in a 
potentially severely degraded state (increasing risk of exceeding heap limit, 
or possible failing to ever accept writes in 2.1 due to insufficient memory). 
If the same table has another flush backed up (or another is later scheduled) 
then we will also end up expiring the commit log records anyway, despite not 
having flushed successfully.

Either we need to reattempt the flush, prevent any further flushes on that 
column family from ever succeeding, or - more simply - kill the C* process.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-15 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247074#comment-14247074
 ] 

Pavel Yaskevich commented on CASSANDRA-7275:


bq. I'm not sure this really improves the current state of affairs very much, 
and introduces a bug as well.

It improve the current state of affairs in the way that failures in the flush 
are not going to incur compaction freeze anymore.

bq. If we fail to flush and simply carry on without clearing the CL, then the 
host will retain the memtable it wanted to flush forever, leaving it in a 
potentially severely degraded state (increasing risk of exceeding heap limit, 
or possible failing to ever accept writes in 2.1 due to insufficient memory). 
If the same table has another flush backed up (or another is later scheduled) 
then we will also end up expiring the commit log records anyway, despite not 
having flushed successfully.

I think what we can do to prevent CL expiry is to mark it as discarded but 
without deleting actual file on disk, this way it can be replayed on start up 
and memtable flushes that follow are not going to delete any potentially 
unflushed data. 

bq. Either we need to reattempt the flush, prevent any further flushes on that 
column family from ever succeeding, or - more simply - kill the C* process.

Let's start with reattempting flush - we don't really have enough information 
to make a decision to re-attempt flushing as it can fail for the number of 
reasons, I/O error is just being one of them.
Killing C* process is harmful as if we have code problem in writeSortedContents 
or replaceFlushed code it would potentially result in shutdown of the whole 
cluster or at least of all of the neighbors sharing replica range.


 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-15 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247461#comment-14247461
 ] 

Benedict commented on CASSANDRA-7275:
-

What happens if the process stays up longer than tombstone grace period? Or we 
introduce CASSANDRA-6434? This approach seems like a minefield to me, ignoring 
the operational risks of running a very degraded server without the operators 
realising it. 

Generally we take the approach of dying if a non-recoverable error occurs, and 
while I agree the risk of killing a whole cluster through a bug is suboptimal, 
we already run that risk in a number of places in the codebase (current 
behaviour here included, just with less alacrity). In my opinion this is 
preferable to potentially re-introducing dead data, or having the complexity of 
safely keeping the process alive as a zombie, and ensuring that zombie doesn't 
degrade cluster performance by hobbling instead of dying.

Other than dying, periodically trying to reflush and only keeling over when we 
run out of room or have failed for a long period (possibly random? to avoid the 
tiny risk of bunching) seems like a good idea.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-12-15 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247538#comment-14247538
 ] 

Pavel Yaskevich commented on CASSANDRA-7275:


There is an option to die on the I/O error and I'm happy to make it so we die 
if we got FSWriteError or similar if requested by config.

bq. Generally we take the approach of dying if a non-recoverable error occurs, 
and while I agree the risk of killing a whole cluster through a bug is 
suboptimal, we already run that risk in a number of places in the codebase 
(current behaviour here included, just with less alacrity). In my opinion this 
is preferable to potentially re-introducing dead data, or having the complexity 
of safely keeping the process alive as a zombie, and ensuring that zombie 
doesn't degrade cluster performance by hobbling instead of dying.

Here is your real world scenario, which we are hitting from time to time, right 
now if I/O error occurs in the replaceFlushed (e.g. trying to create hard-link 
for system.compactions_in_progress) all of the compaction threads are going to 
get blocked and performance is going to gradually degrade until it gets to the 
point when alerts from compaction pending trigger, at that time somebody has to 
(most luckily wake up) figure out what is going on and restart the node, once 
it starts back up the amount of catching up it has to do in terms of the 
compaction is substantial. This problem happens on the number of machines at 
the same time so if we were to kill the nodes right when aforementioned error 
occurs (although it's not affecting actual flush or compaction) that would mean 
that part of the ring just went dark and one just has to pray that those nodes 
weren't neighbors, so in this case serve some stale reads (which is not even 
the case if failure in in bookeeping CF) with error in the log is much better 
than loose portion of the cluster for (possibly tens) minutes without any idea 
of what is going on.

In this situation I would rather ignore problems with book-keeping CFs or save 
CL segments forget about it and/or bumping up read-repair chance at the same 
time.
 
Everybody who is running Cassandra or any other database/system wants a peace 
of mind that's why regular repairs and all sorts of the alerting/monitoring 
systems are in-place, if there is something in the log which indicates a 
problem it gives people time to think about their next steps instead of 
chaotically trying to fix what ever mess we left on failure.

bq. Other than dying, periodically trying to re-flush and only keeling over 
when we run out of room or have failed for a long period (possibly random? to 
avoid the tiny risk of bunching) seems like a good idea.

This is not going to help if the problem data driven or external, you just 
going to trash flusher threads without doing any useful work.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Pavel Yaskevich
Priority: Minor
 Fix For: 2.0.12

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-06-20 Thread Mikhail Stepura (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038506#comment-14038506
 ] 

Mikhail Stepura commented on CASSANDRA-7275:


{code}
[javac] Compiling 865 source files to 
/Users/mikhail/Documents/workspace/cassandra/build/classes/main
[javac] 
/Users/mikhail/Documents/workspace/cassandra/src/java/org/apache/cassandra/db/ColumnFamilyStore.java:29:
 error: cannot find symbol
[javac] import javax.annotation.Nullable;
[javac]^
[javac]   symbol:   class Nullable
[javac]   location: package javax.annotation
[javac] 
/Users/mikhail/Documents/workspace/cassandra/src/java/org/apache/cassandra/db/ColumnFamilyStore.java:807:
 error: cannot find symbol
[javac] public void onSuccess(@Nullable Object result)
[javac]^
[javac]   symbol: class Nullable
{code}
Otherwise +1

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Yuki Morishita
Priority: Minor
 Fix For: 1.2.17, 2.0.9

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-06-20 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038635#comment-14038635
 ] 

Benedict commented on CASSANDRA-7275:
-

This will break the commit log, by causing records to be discarded out of 
order. The future does not get placed on the executor until the onSuccess is 
called now, so run order is simply in order of flush completion and no longer 
in order of submission, but the only point of separating the work onto the 
postFlush is to ensure it is run in submission order (but not before the flush 
is finished - see the comment that is now attached to onSuccess). 

The simplest correct solution is probably to annotate the post flush runnable 
with a state variable indicating success/failure, which is set before the latch 
is triggered.

If you're modifying these parts of the code where correctness is paramount and 
not always obvious, it would be great if you could explicitly run it past a 
third set of eyes, as I only happened to spot this in the commits@ feeds, and 
as it's a concurrency bug could easily have not been spotted. Although we could 
no doubt craft a specific test to look for this scenario, and perhaps we should.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Yuki Morishita
Priority: Minor
 Fix For: 1.2.17, 2.0.9

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-06-20 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039316#comment-14039316
 ] 

Jonathan Ellis commented on CASSANDRA-7275:
---

I've reverted the original commit pending a new fix, so we don't block 1.2.17 
or 2.0.9 in the meantime.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Yuki Morishita
Priority: Minor
 Fix For: 1.2.17, 2.0.9

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
 7252-2.0-v2.txt


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-06-18 Thread Tyler Hobbs (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14036106#comment-14036106
 ] 

Tyler Hobbs commented on CASSANDRA-7275:


[~mishail] yes, 1.2 has the same problem.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Mikhail Stepura
Priority: Minor
 Fix For: 2.0.9

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-06-18 Thread Yuki Morishita (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14036941#comment-14036941
 ] 

Yuki Morishita commented on CASSANDRA-7275:
---

If we countdown latch when writing SSTable failed, then we shuold not proceed 
to discard commit log in postFlushExecutor. I think we need to check exception 
somehow.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Mikhail Stepura
Priority: Minor
 Fix For: 1.2.17, 2.0.9

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-06-12 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029737#comment-14029737
 ] 

Jonathan Ellis commented on CASSANDRA-7275:
---

Attaching Mikhail's patch.

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Priority: Minor
 Fix For: 2.0.9

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-06-12 Thread Tyler Hobbs (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029923#comment-14029923
 ] 

Tyler Hobbs commented on CASSANDRA-7275:


+1

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Mikhail Stepura
Priority: Minor
 Fix For: 2.0.9

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung

2014-06-12 Thread Mikhail Stepura (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030172#comment-14030172
 ] 

Mikhail Stepura commented on CASSANDRA-7275:


[~jbellis] we need this for 1.2 as well, right?

 Errors in FlushRunnable may leave threads hung
 --

 Key: CASSANDRA-7275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Mikhail Stepura
Priority: Minor
 Fix For: 2.0.9

 Attachments: 0001-Move-latch.countDown-into-finally-block.patch


 In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
 there are errors, which results in hanging any threads that are waiting for 
 the flush to complete.  For example, an error like this causes the problem:
 {noformat}
 ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
 198) Exception in thread Thread[FlushWriter:474,5,main]
 java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Unknown Source)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
 at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
 at 
 org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
 at 
 org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

42 matches

Mail list logo