[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517366#comment-14517366 ] Benedict commented on CASSANDRA-7275: - Since this has come up again a few times, I propose tha Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.15 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481349#comment-14481349 ] Jeremiah Jordan commented on CASSANDRA-7275: Just had a java.io.SyncFailedException cause this. After the exception MemtablePostFlush was stuck. {noformat} ERROR [MemtableFlushWriter:6] 2015-04-03 01:57:06,973 CassandraDaemon.java:167 - Exception in thread Thread[MemtableFlushWriter:6,5,main] org.apache.cassandra.io.FSWriteError: java.io.SyncFailedException: sync failed at org.apache.cassandra.io.util.SequentialWriter.syncDataOnlyInternal(SequentialWriter.java:254) ~[cassandra-all-2.1.3.329.jar:2.1.3.329] at org.apache.cassandra.io.util.SequentialWriter.syncInternal(SequentialWriter.java:263) ~[cassandra-all-2.1.3.329.jar:2.1.3.329] at org.apache.cassandra.io.util.SequentialWriter.close(SequentialWriter.java:451) ~[cassandra-all-2.1.3.329.jar:2.1.3.329] at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:664) ~[cassandra-all-2.1.3.329.jar:2.1.3.329] at org.apache.cassandra.io.sstable.SSTableWriter.close(SSTableWriter.java:495) ~[cassandra-all-2.1.3.329.jar:2.1.3.329] at org.apache.cassandra.io.sstable.SSTableWriter.finish(SSTableWriter.java:448) ~[cassandra-all-2.1.3.329.jar:2.1.3.329] at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:440) ~[cassandra-all-2.1.3.329.jar:2.1.3.329] at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:435) ~[cassandra-all-2.1.3.329.jar:2.1.3.329] at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:377) ~[cassandra-all-2.1.3.329.jar:2.1.3.329] at org.apache.cassandra.db.Memtable$FlushRunnable.runMayThrow(Memtable.java:327) ~[cassandra-all-2.1.3.329.jar:2.1.3.329] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[cassandra-all-2.1.3.329.jar:2.1.3.329] at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.1.jar:na] at org.apache.cassandra.db.ColumnFamilyStore$Flush.run(ColumnFamilyStore.java:1097) ~[cassandra-all-2.1.3.329.jar:2.1.3.329] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_40] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_40] at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40] Caused by: java.io.SyncFailedException: sync failed at java.io.FileDescriptor.sync(Native Method) ~[na:1.8.0_40] at org.apache.cassandra.io.util.SequentialWriter.syncDataOnlyInternal(SequentialWriter.java:250) ~[cassandra-all-2.1.3.329.jar:2.1.3.329] ... 15 common frames omitted {noformat} Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.15 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264623#comment-14264623 ] Benedict commented on CASSANDRA-7275: - Accidentally assigned myself this ticket. Sorry if that caused any confusion. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255125#comment-14255125 ] Benedict commented on CASSANDRA-7275: - I think somebody with better knowledge of the bookkeeping tables needs to chime in here, to give an opinion on if we can safely do this to any we care about. We still need to decide what to do about the non-whitelisted tables, though. They most likely want to not mark the CL clean at least for their affected segments, but possibly indefinitely for the affected table, until reboot, to guarantee no data bugs. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255133#comment-14255133 ] Pavel Yaskevich commented on CASSANDRA-7275: In case of non-whitelisted CFs we just follow on what disk_failure_policy dictates us to do, if we want to work on that in CASSANDRA-8498 we should probably move that discussion there. [~slebresne] What do you think about the idea of white listing some of the system CFs which are not crucial to normal operation and ignoring flush errors in them with all of the exceptions being tracked in the log on ERROR level? I took a look into what some of them and I do think it's safe to have compactions-in-progress (failure in this one just blocks all of the compactions because finishCompaction blocks on switchMemtable result which, in case of failure, never counts down the latch), sstable_activity and compaction history at least. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255136#comment-14255136 ] Benedict commented on CASSANDRA-7275: - CASSANDRA-8498 is only about replay. It can, in conjunction with repair, help avoid data corruption in a replicated table that did not experience a cross-cluster data-induced bug, but it won't help a corrupted system table that dropped intervening commit log records. So even with best_effort we need to take some remedial action to ensure non-whitelisted system tables are not corrupted. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Benedict Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255139#comment-14255139 ] Pavel Yaskevich commented on CASSANDRA-7275: Sure, I'm just trying to say that non-whitelisted and most luckily essential tables should be no different from user CFs for disk_failure_policy and probably have a special repair phrase dedicated to them, the only CFs I know for sure could be repaired properly by just removing data is schema_* operator just need to trigger that right now, other CFs like peers and NodeInfo can be auto-regenerated if missing or corrupted so maybe if we keep exception information from FlushInfo around we can do some of that work automatically on repair request. Also I want to point out that repair within gc_grace_seconds at least once is a hard requirement, most (if not all of the people) who run in production are doing frequently e.g. once a week on Sunday night. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Benedict Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255145#comment-14255145 ] Benedict commented on CASSANDRA-7275: - That all sounds eminently reasonable, although I don't know much about the system tables. So long as we make sure there aren't any that could be corrupted and be non-recoverable, we're good in my book, and I'll let others more qualified make the decision about which tables that's true of. We should probably introduce some tests to ensure this is indeed safe for each table, though. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Benedict Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254652#comment-14254652 ] Benedict commented on CASSANDRA-7275: - bq. WDYT about previous idea of keeping commit logs even for failed flushes? I'm worried about replaying these records after gc_grace expires. For local-only replication this could be non-recoverable, which is why I favour taking remedial action and just dropping the records if possible to do so safely; if it isn't possible it means corruption has to be avoided, so we would likely have to do one of: # Keep all CL records since the failure for the table (possibly only necessary for local-only replication) ## By normal mechanism, but this could retain every CL segment ## By copying relevant CL records out into their own stream, so the rest can be expired ## By giving each table its own CL, or perhaps only on failure, or only system tables? # Periodically replay into new memtables, merge with existing data, reinsert into CL and reattempt flush logic, to ensure we are never older than gc_grace (basically just rolling the problem forward each time) # Replay CL records only, periodically, sans any deleted items All of the above are a bit clunky or complicated though, and might have their own bugs or encounter the same hardware problems. I'm sure some other actions are also possible, but probably equally ugly. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254694#comment-14254694 ] Pavel Yaskevich commented on CASSANDRA-7275: I'm fine with dropping the records if we only consider system cf which are acceptable to fail to flush, maybe CL for those is not that important e.g. compactions_in_progress, sstable_activity and compaction_history. I am also thinking maybe we could add additional checks to the CLReplayer class, so when it picks up RM for replay it could actually drop what ever records are past gc_grace (RM has information about RangeTombstone and DeletedColumn), because there is no real point of replaying them anyway, just to be safe. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254703#comment-14254703 ] Benedict commented on CASSANDRA-7275: - Yes, see CASSANDRA-8498 I filed to address that since it's potentially a more general concern. The problem here is that by not replaying you could be corrupting your data just as much as you might through replay. The difference is you can (hopefully) fix it through repair, but you cannot do this for local-only tables, and if the problem is a data-induced bug it could be broken cluster wide. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255045#comment-14255045 ] Pavel Yaskevich commented on CASSANDRA-7275: Ok, so how about we change CL behavior in CASSANDRA-8498 and maybe in scope of this ticket we just do failure FlushInfo (from my previous) patch and change which is going to while list some of the system CFs othewise check disk_failure_policy and fail if needed? Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253271#comment-14253271 ] Benedict commented on CASSANDRA-7275: - I should clarify, since it sounds like we are not too far in disagreement on this point: I'm suggesting only that the failure is reported to the flush call site, not so the callsite can specialised on the kind of exception, but so that if this specific callsite can safely cope with _any_ failure, it can be specialised to do so, with remedial action if necessary. A whitelist would be a subset of this approach, and hence simpler - but only if it's genuinely safe to just drop the problem on the floor; I'm not sufficiently familiar with these system tables to say for sure, but I do recollect problems safely starting a node when compactions_in_progress was not properly maintained, so I expect _some_ remedial action will probably be necessary, perhaps on a case-by-case basis. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254342#comment-14254342 ] Pavel Yaskevich commented on CASSANDRA-7275: Sure, it sounds like we need conceptually similar to multimap of (cfId, ListException) which is going to get checked when there is an exception returned from the FlushInfo, also FDYT about previous idea of keeping commit logs even for failed flushes? Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251586#comment-14251586 ] Benedict commented on CASSANDRA-7275: - I agree with Pavel that _if we can do so safely_ we should not crash on failing to update internal book-keeping. But only if we can guarantee that failing to keep the bookkeeping up-to-date won't cause other problems. Which is why I suggest: bq. 2) We can report these exceptions back to the waiter on the Future result, and this waiter can choose how to behave. If, say, the memtable of a system column family that can be worked-around fails to flush (for instance, compactions_in_progress) then instead of retrying, it can simply take some other action to ensure the system continues to make safe progress. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253086#comment-14253086 ] Pavel Yaskevich commented on CASSANDRA-7275: Benedict, as I mentioned - I agree with #2 with one correction - there is no actual way to tell what exactly went wrong from FS\{Read, Write\}Error alone so it's probably not a lot of sense to give back exception (as it was logged already by thread pool) for processing but rather have a set of white listed system CFs fail to flush of which is acceptable... Would be nice of course to separate commitlogs so it's less error prone when it comes to flushing/recovery but this is going to be way off this ticket scope. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249718#comment-14249718 ] Benedict commented on CASSANDRA-7275: - I've filed CASSANDRA-8496, which would help with this problem in 2.1 only. It isn't sufficient to ensure the server stays stable, but would both avoid forward progress being stopped by errors on the post flusher, and that the affected commit log records would be retained indefinitely without resulting in infinite commit log growth. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249719#comment-14249719 ] Sylvain Lebresne commented on CASSANDRA-7275: - The current behavior is that an unexpected flush error blocks any flush thereon. It does seems to me that changing it so that it blocks only flushes for the column family on which there was a problem (which is not exactly what the patch does, and I do agree with Benedict that it does need to do that) is an improvement: if the problem happens for every CF then we're no worst than currently, but if it's a one-time event it might leave time for operators to take proper actions (of course, we should log a scary error, it's not something that should be ignored). So maybe we can start there since we don't seem to agree on whether crashing the node is an even better improvement? As far as my own opinion goes, I do am not in favor of crashing in that case because again, if you hold enough memtables in memory that your node become unresponsive, you're not really worth off that if you had crashed it right away, but if the problem ends up impacting a low traffic table (for instance a system table), you might be able to fix the problem in a way that is less impactful for your cluster. I'll note however that I would agree that if the error is a IO one, we should respect the disk_failure_policy. And I don't know, maybe we need another failure policy (best_effort/crash) for unexpected errors (aka bugs) that have the potential of destabilizing a node (I would agree that adding this is pushing the problem to our users, but it appears not everyone has the same idea on what is the best strategy, and there is maybe not a single good answer). Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249733#comment-14249733 ] Benedict commented on CASSANDRA-7275: - Just to add to what Sylvain says about the size of the memtable, to hopefully help target a solution (spoken agnostically): in 2.1 we could become almost immediately unusable for writes if the memtable(s) we are retaining after this (or multiple exceptions) exceed a certain proportion of memory, as we will stop even trying to flush. So for 2.1 at least if we're going to try and stay alive we need to consider if we would prefer to drop writes on the floor (agressively, to avoid build up in the queue) if the set of memtables in limbo is too large, or if we drop memtables until we reclaim enough space to proceed, or if we introduce some special logic for flushing in this event. In 2.0, conversely, we may flush millions of tiny sstables in the wrong scenario, but this would not prevent function unless it permitted excess heap growth, or a compaction death spiral. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250118#comment-14250118 ] Jonathan Ellis commented on CASSANDRA-7275: --- bq. if you hold enough memtables in memory that your node become unresponsive, you're not really worse off than if you had crashed it right away I disagree: we have a ton of evidence to date that a node that slowly falls over as it OOMs is much worse than a node that dies and gets marked down quickly by the FD. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250519#comment-14250519 ] Pavel Yaskevich commented on CASSANDRA-7275: Just to re-iterate, I still don't understand why we would prefer to crash the process if error happens on the system CF flush e.g. at the end of compaction which is not even essential for the operation like compactions_in_progress and still there is no clear answer how do we distinguish between FS{Read, Write}Error which is generated as a response to FS or system failure and the one which is generated as a response to incorrect call that Cassandra made e.g. duplicate hard-link? I would prefer that if the failure was in the system CF we log the message, leave commitlog and let everything carry on instead of just crashing because it could essentially result in dropping incoming data, the story is different for actual user memtables tho, as I mentioned couple of times in my previous comments, I'm total fine crashing if normal memtable flush fails and disk_failure_policy says so. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250554#comment-14250554 ] Tupshin Harper commented on CASSANDRA-7275: --- Strongly in favor of the opt in policy based approach that [~jbellis] mentioned. There isn't a one size fits all approach to deal with this Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248554#comment-14248554 ] Benedict commented on CASSANDRA-7275: - bq. This is not going to help if the problem data driven or external, you just going to trash flusher threads without doing any useful work. Well, let's try and address each problem independently. A data induced bug that can occur across many nodes simultaneously is likely to occur repeatedly and cause the cluster to degrade probably quite rapidly, and will likely occur on all owners of a given token at once. Coupled with the stop-gap measures we're discussing might well run the risk of actual data loss or data corruption cross-cluster. Read repair would _not_ help for such a data bug, since none of the nodes would be in a safe state. However the transient file system problems you're encountering would be helped by reattempting the flush. So, an initial and completely safe approach would be to retry a few times and _then_ crash the server (possibly with some random waiting involved to avoid a disastrous cascade of cluster-wide death). Wasting work isn't really a big problem if the system cannot make progress without this success, so I don't see a downside on that front. It's possible if, once this fails, we could negotiate a safe crash with our peers, so that if there is a data bug at most one replica dies, the operator is well aware of the problem, but the cluster continues to operate. Although this is difficult with vnodes, and perhaps a little contrived for the current state of c*. Separately, we can look into perhaps weakening our constraints in various ways. The big issue you raise is that compaction is specifically held up. There seem to be two things we can do to help this: 1) We can make the dependency queue for marking commit log records unused table-specific, so that compactions only get held up if there has been an error on the compaction queue; 2) We can report these exceptions back to the waiter on the Future result, and this waiter can choose how to behave. If, say, the memtable of a system column family that can be worked-around fails to flush (for instance, compactions_in_progress) then instead of retrying, it can simply take some other action to ensure the system continues to make safe progress. If a data table fails to flush it can attempt to retry. Eventually, if it cannot recover safely, it should die though, as there will need to be some operator involvement and the reality is not everybody monitors their log files. I am very -1 on introducing a change that knowingly produces a complex failure condition that will not be widely known or understood, but I may be alone on that. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249033#comment-14249033 ] Jonathan Ellis commented on CASSANDRA-7275: --- bq. Killing C* process is harmful as if we have code problem in writeSortedContents or replaceFlushed code it would potentially result in shutdown of the whole cluster or at least of all of the neighbors sharing replica range. I'm much more comfortable with things die if something goes catastrophically wrong than things start returning nonsense on reads which is what happens if we mark something flushed that actually wasn't. That said, I'd be okay using disk failure policy as a guide. If people opt into best effort behavior and are okay with those implications, so be it. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249053#comment-14249053 ] Pavel Yaskevich commented on CASSANDRA-7275: I understand it might be hard for you, Benedict, but just consider there could be a programming error in the flush of the memtable or replacing flushed one, which is only triggered when metadata about compaction is written back at the end of that compaction e.g. CompactionTask.runMayThrow() L225, e.g. error mentioned in the description or duplicate hard-link failure or something similar which has nothing to do with the underlaying (file-)system which means that #1 suggestion is not going to help because compaction is blocked in SystemKeyspace.finishCompaction() and flush retry is not going to help because it will just fail again and again trying to flush the same data. As an end user I would prefer that nobody actually takes a decision to fail on the floor for me except me because it means data loss even when problem is not affecting actual write/read path, I would be fine though to fail on FS\{Read, Write\}Error if user explicitly sets it to fail on I/O errors (e.g. disk_failure_policy, it is like of your #2 but not exactly) otherwise I would rather get notified in the log and carry on so I can take informed decision on my next actions. bq. Eventually, if it cannot recover safely, it should die though, as there will need to be some operator involvement and the reality is not everybody monitors their log files. I'm going to ignore this argument until you actually have experience of running Cassandra in production, otherwise it's the same as talking to the wall. bq. I'm much more comfortable with things die if something goes catastrophically wrong than things start returning nonsense on reads which is what happens if we mark something flushed that actually wasn't. I remember it was already the same when the disk is full in the DSE, did people actually have fun restoring cluster after it went completely dark? I'm also *not* saying that we shouldn't fail on FS\{Read, Write\}Error if disk_failure_policy says otherwise. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249077#comment-14249077 ] Jonathan Ellis commented on CASSANDRA-7275: --- bq. consider there could be a programming error in the flush of the memtable or replacing flushed one I don't know that hand waving about potential bugs gets us anywhere. There could be programming errors anywhere, including in mark the segment flushed when it wasn't panic mode. The right solution to bugs is QA, not hoping that you can guess where unexpected exceptions will happen and provide a safety net. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249148#comment-14249148 ] Jason Brown commented on CASSANDRA-7275: The right solution to bugs is QA I kind of agree with this, but QA can only confirm that we've fixed what we've discovered as faulty. Strange things will always happen in the real world, and unfortunately QA cannot discover (all of) those. This is a problem we have today - now, in fact. Unfortunately, blindly shutting down nodes for us (and, I suspect, most installations) isn't a viable solution as it could result in an uncontrolled cascade of shutdowns. I'm not saying we shouldn't shut down on real file system problems (especially if the operator has set disk_failure_policy properly), but here's our situation: all compactions completely shut down when we fail to create the hard link for incremental backups, simply on a system CF with only metadata. This could be a legit file system problem, that affects the entire system, or it could be something minor, but perhaps we can be smarter about the known things that can fail that we deem not fatal (and then choose how we want to react to those). In our case, while it's unfortunate that some incremental backup data might be lost, it would be (and is) much worse to crash the system. If it's a programming bug, perhaps we should follow what the operator sets up for the disk_failure_policy, but it seems a shame to shutdown on something trivial like failing to create a hard link, especially on system metadata CFs. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249160#comment-14249160 ] Jonathan Ellis commented on CASSANDRA-7275: --- bq. it seems a shame to shutdown on something trivial like failing to create a hard link, especially on system metadata CFs Well, if the disk is erroring out, failing to create a hard link is only the first problem you'll have. :) But again, disk failure policy covers this, so sure, go ahead and take evasive action if you prefer. I think Pavel's suggestion of dropping the memtable but leaving the commitlog marked to-replay is reasonable for that scenario. If it's our bug, then you may need a temporary patch while we figure out the cause, but I still don't think that kind of {{// this shouldn't happen}} code should be shipped officially. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249174#comment-14249174 ] Pavel Yaskevich commented on CASSANDRA-7275: The problem is that there is no way to tell if hard-link problem is a actual fs/disk problem or programming error, right now it looks like a programming error because it snapshot tries to create duplicate hard-link to the same file as I mentioned in the CASSANDRA-8476 so if there is no way to tell how reasonable is it to enforce shutdown or any rule from disk_failure_policy? bq. If it's our bug, then you may need a temporary patch while we figure out the cause, but I still don't think that kind of // this shouldn't happen code should be shipped officially. If it's your problem it's my problem as well, we can work around for now (as I guess most of the people do) but my intention in this ticket to fix this problem for good instead of just fixing the symptom of it (being aforementioned duplicate hard-link problem). Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246780#comment-14246780 ] Jonathan Ellis commented on CASSANDRA-7275: --- ([~benedict] to review) Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246852#comment-14246852 ] Benedict commented on CASSANDRA-7275: - I'm not sure this really improves the current state of affairs very much, and introduces a bug as well. If we fail to flush and simply carry on without clearing the CL, then the host will retain the memtable it wanted to flush forever, leaving it in a potentially severely degraded state (increasing risk of exceeding heap limit, or possible failing to ever accept writes in 2.1 due to insufficient memory). If the same table has another flush backed up (or another is later scheduled) then we will also end up expiring the commit log records anyway, despite not having flushed successfully. Either we need to reattempt the flush, prevent any further flushes on that column family from ever succeeding, or - more simply - kill the C* process. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247074#comment-14247074 ] Pavel Yaskevich commented on CASSANDRA-7275: bq. I'm not sure this really improves the current state of affairs very much, and introduces a bug as well. It improve the current state of affairs in the way that failures in the flush are not going to incur compaction freeze anymore. bq. If we fail to flush and simply carry on without clearing the CL, then the host will retain the memtable it wanted to flush forever, leaving it in a potentially severely degraded state (increasing risk of exceeding heap limit, or possible failing to ever accept writes in 2.1 due to insufficient memory). If the same table has another flush backed up (or another is later scheduled) then we will also end up expiring the commit log records anyway, despite not having flushed successfully. I think what we can do to prevent CL expiry is to mark it as discarded but without deleting actual file on disk, this way it can be replayed on start up and memtable flushes that follow are not going to delete any potentially unflushed data. bq. Either we need to reattempt the flush, prevent any further flushes on that column family from ever succeeding, or - more simply - kill the C* process. Let's start with reattempting flush - we don't really have enough information to make a decision to re-attempt flushing as it can fail for the number of reasons, I/O error is just being one of them. Killing C* process is harmful as if we have code problem in writeSortedContents or replaceFlushed code it would potentially result in shutdown of the whole cluster or at least of all of the neighbors sharing replica range. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247461#comment-14247461 ] Benedict commented on CASSANDRA-7275: - What happens if the process stays up longer than tombstone grace period? Or we introduce CASSANDRA-6434? This approach seems like a minefield to me, ignoring the operational risks of running a very degraded server without the operators realising it. Generally we take the approach of dying if a non-recoverable error occurs, and while I agree the risk of killing a whole cluster through a bug is suboptimal, we already run that risk in a number of places in the codebase (current behaviour here included, just with less alacrity). In my opinion this is preferable to potentially re-introducing dead data, or having the complexity of safely keeping the process alive as a zombie, and ensuring that zombie doesn't degrade cluster performance by hobbling instead of dying. Other than dying, periodically trying to reflush and only keeling over when we run out of room or have failed for a long period (possibly random? to avoid the tiny risk of bunching) seems like a good idea. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247538#comment-14247538 ] Pavel Yaskevich commented on CASSANDRA-7275: There is an option to die on the I/O error and I'm happy to make it so we die if we got FSWriteError or similar if requested by config. bq. Generally we take the approach of dying if a non-recoverable error occurs, and while I agree the risk of killing a whole cluster through a bug is suboptimal, we already run that risk in a number of places in the codebase (current behaviour here included, just with less alacrity). In my opinion this is preferable to potentially re-introducing dead data, or having the complexity of safely keeping the process alive as a zombie, and ensuring that zombie doesn't degrade cluster performance by hobbling instead of dying. Here is your real world scenario, which we are hitting from time to time, right now if I/O error occurs in the replaceFlushed (e.g. trying to create hard-link for system.compactions_in_progress) all of the compaction threads are going to get blocked and performance is going to gradually degrade until it gets to the point when alerts from compaction pending trigger, at that time somebody has to (most luckily wake up) figure out what is going on and restart the node, once it starts back up the amount of catching up it has to do in terms of the compaction is substantial. This problem happens on the number of machines at the same time so if we were to kill the nodes right when aforementioned error occurs (although it's not affecting actual flush or compaction) that would mean that part of the ring just went dark and one just has to pray that those nodes weren't neighbors, so in this case serve some stale reads (which is not even the case if failure in in bookeeping CF) with error in the log is much better than loose portion of the cluster for (possibly tens) minutes without any idea of what is going on. In this situation I would rather ignore problems with book-keeping CFs or save CL segments forget about it and/or bumping up read-repair chance at the same time. Everybody who is running Cassandra or any other database/system wants a peace of mind that's why regular repairs and all sorts of the alerting/monitoring systems are in-place, if there is something in the log which indicates a problem it gives people time to think about their next steps instead of chaotically trying to fix what ever mess we left on failure. bq. Other than dying, periodically trying to re-flush and only keeling over when we run out of room or have failed for a long period (possibly random? to avoid the tiny risk of bunching) seems like a good idea. This is not going to help if the problem data driven or external, you just going to trash flusher threads without doing any useful work. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Pavel Yaskevich Priority: Minor Fix For: 2.0.12 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038506#comment-14038506 ] Mikhail Stepura commented on CASSANDRA-7275: {code} [javac] Compiling 865 source files to /Users/mikhail/Documents/workspace/cassandra/build/classes/main [javac] /Users/mikhail/Documents/workspace/cassandra/src/java/org/apache/cassandra/db/ColumnFamilyStore.java:29: error: cannot find symbol [javac] import javax.annotation.Nullable; [javac]^ [javac] symbol: class Nullable [javac] location: package javax.annotation [javac] /Users/mikhail/Documents/workspace/cassandra/src/java/org/apache/cassandra/db/ColumnFamilyStore.java:807: error: cannot find symbol [javac] public void onSuccess(@Nullable Object result) [javac]^ [javac] symbol: class Nullable {code} Otherwise +1 Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Yuki Morishita Priority: Minor Fix For: 1.2.17, 2.0.9 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038635#comment-14038635 ] Benedict commented on CASSANDRA-7275: - This will break the commit log, by causing records to be discarded out of order. The future does not get placed on the executor until the onSuccess is called now, so run order is simply in order of flush completion and no longer in order of submission, but the only point of separating the work onto the postFlush is to ensure it is run in submission order (but not before the flush is finished - see the comment that is now attached to onSuccess). The simplest correct solution is probably to annotate the post flush runnable with a state variable indicating success/failure, which is set before the latch is triggered. If you're modifying these parts of the code where correctness is paramount and not always obvious, it would be great if you could explicitly run it past a third set of eyes, as I only happened to spot this in the commits@ feeds, and as it's a concurrency bug could easily have not been spotted. Although we could no doubt craft a specific test to look for this scenario, and perhaps we should. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Yuki Morishita Priority: Minor Fix For: 1.2.17, 2.0.9 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039316#comment-14039316 ] Jonathan Ellis commented on CASSANDRA-7275: --- I've reverted the original commit pending a new fix, so we don't block 1.2.17 or 2.0.9 in the meantime. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Yuki Morishita Priority: Minor Fix For: 1.2.17, 2.0.9 Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14036106#comment-14036106 ] Tyler Hobbs commented on CASSANDRA-7275: [~mishail] yes, 1.2 has the same problem. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Mikhail Stepura Priority: Minor Fix For: 2.0.9 Attachments: 0001-Move-latch.countDown-into-finally-block.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14036941#comment-14036941 ] Yuki Morishita commented on CASSANDRA-7275: --- If we countdown latch when writing SSTable failed, then we shuold not proceed to discard commit log in postFlushExecutor. I think we need to check exception somehow. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Mikhail Stepura Priority: Minor Fix For: 1.2.17, 2.0.9 Attachments: 0001-Move-latch.countDown-into-finally-block.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029737#comment-14029737 ] Jonathan Ellis commented on CASSANDRA-7275: --- Attaching Mikhail's patch. Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Priority: Minor Fix For: 2.0.9 Attachments: 0001-Move-latch.countDown-into-finally-block.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029923#comment-14029923 ] Tyler Hobbs commented on CASSANDRA-7275: +1 Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Mikhail Stepura Priority: Minor Fix For: 2.0.9 Attachments: 0001-Move-latch.countDown-into-finally-block.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
[ https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030172#comment-14030172 ] Mikhail Stepura commented on CASSANDRA-7275: [~jbellis] we need this for 1.2 as well, right? Errors in FlushRunnable may leave threads hung -- Key: CASSANDRA-7275 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Mikhail Stepura Priority: Minor Fix For: 2.0.9 Attachments: 0001-Move-latch.countDown-into-finally-block.patch In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are errors, which results in hanging any threads that are waiting for the flush to complete. For example, an error like this causes the problem: {noformat} ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:474,5,main] java.lang.IllegalArgumentException at java.nio.Buffer.position(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138) at org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103) at org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)