[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249148#comment-14249148
 ] 

Jason Brown commented on CASSANDRA-7275:
----------------------------------------

>> The right solution to bugs is QA

I kind of agree with this, but QA can only confirm that we've fixed what we've 
discovered as faulty. Strange things will always happen in the real world, and 
unfortunately QA cannot discover (all of) those.

This is a problem we have today - now, in fact. Unfortunately, blindly shutting 
down nodes for us (and, I suspect, most installations) isn't a viable solution 
as it could result in an uncontrolled cascade of shutdowns. I'm not saying we 
shouldn't shut down on real file system problems (especially if the operator 
has set disk_failure_policy properly), but here's our situation: all 
compactions completely shut down when we fail to create the hard link for 
incremental backups, simply on a system CF with only metadata. This could be a 
legit file system problem, that affects the entire system, or it could be 
something minor, but perhaps we can be smarter about the known things that can 
fail that we deem not fatal (and then choose how we want to react to those). In 
our case, while it's unfortunate that some incremental backup data might be 
lost, it would be (and is) much worse to crash the system. If it's a 
programming bug, perhaps we should follow what the operator sets up for the 
disk_failure_policy, but it seems a shame to shutdown on something trivial like 
failing to create a hard link, especially on system metadata CFs. 



> Errors in FlushRunnable may leave threads hung
> ----------------------------------------------
>
>                 Key: CASSANDRA-7275
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Tyler Hobbs
>            Assignee: Pavel Yaskevich
>            Priority: Minor
>             Fix For: 2.0.12
>
>         Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
> 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch
>
>
> In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
> there are errors, which results in hanging any threads that are waiting for 
> the flush to complete.  For example, an error like this causes the problem:
> {noformat}
> ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
> 198) Exception in thread Thread[FlushWriter:474,5,main]
> java.lang.IllegalArgumentException
>     at java.nio.Buffer.position(Unknown Source)
>     at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
>     at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
>     at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
>     at 
> org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
>     at 
> org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
>     at 
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
>     at 
> org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
>     at 
> org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
>     at 
> org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
>     at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>     at java.lang.Thread.run(Unknown Source)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to