[
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249148#comment-14249148
]
Jason Brown commented on CASSANDRA-7275:
----------------------------------------
>> The right solution to bugs is QA
I kind of agree with this, but QA can only confirm that we've fixed what we've
discovered as faulty. Strange things will always happen in the real world, and
unfortunately QA cannot discover (all of) those.
This is a problem we have today - now, in fact. Unfortunately, blindly shutting
down nodes for us (and, I suspect, most installations) isn't a viable solution
as it could result in an uncontrolled cascade of shutdowns. I'm not saying we
shouldn't shut down on real file system problems (especially if the operator
has set disk_failure_policy properly), but here's our situation: all
compactions completely shut down when we fail to create the hard link for
incremental backups, simply on a system CF with only metadata. This could be a
legit file system problem, that affects the entire system, or it could be
something minor, but perhaps we can be smarter about the known things that can
fail that we deem not fatal (and then choose how we want to react to those). In
our case, while it's unfortunate that some incremental backup data might be
lost, it would be (and is) much worse to crash the system. If it's a
programming bug, perhaps we should follow what the operator sets up for the
disk_failure_policy, but it seems a shame to shutdown on something trivial like
failing to create a hard link, especially on system metadata CFs.
> Errors in FlushRunnable may leave threads hung
> ----------------------------------------------
>
> Key: CASSANDRA-7275
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Tyler Hobbs
> Assignee: Pavel Yaskevich
> Priority: Minor
> Fix For: 2.0.12
>
> Attachments: 0001-Move-latch.countDown-into-finally-block.patch,
> 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch
>
>
> In Memtable.FlushRunnable, the CountDownLatch will never be counted down if
> there are errors, which results in hanging any threads that are waiting for
> the flush to complete. For example, an error like this causes the problem:
> {noformat}
> ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line
> 198) Exception in thread Thread[FlushWriter:474,5,main]
> java.lang.IllegalArgumentException
> at java.nio.Buffer.position(Unknown Source)
> at
> org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
> at
> org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
> at
> org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
> at
> org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
> at
> org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
> at
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
> at
> org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
> at
> org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
> at
> org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
> at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)