[ 
https://issues.apache.org/jira/browse/CASSANDRA-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254652#comment-14254652
 ] 

Benedict commented on CASSANDRA-7275:
-------------------------------------

bq. WDYT about previous idea of keeping commit logs even for failed flushes?

I'm worried about replaying these records after gc_grace expires. For 
local-only replication this could be non-recoverable, which is why I favour 
taking remedial action and just dropping the records if possible to do so 
safely; if it isn't possible it means corruption has to be avoided, so we would 
likely have to do one of:

# Keep all CL records since the failure for the table (possibly only necessary 
for local-only replication)
## By normal mechanism, but this could retain every CL segment
## By copying relevant CL records out into their own stream, so the rest can be 
expired
## By giving each table its own CL, or perhaps only on failure, or only system 
tables?
# Periodically replay into new memtables, merge with existing data, reinsert 
into CL and reattempt flush logic, to ensure we are never older than gc_grace 
(basically just rolling the problem forward each time)
# Replay CL records only, periodically, sans any deleted items

All of the above are a bit clunky or complicated though, and might have their 
own bugs or encounter the same hardware problems. I'm sure some other actions 
are also possible, but probably equally ugly.

> Errors in FlushRunnable may leave threads hung
> ----------------------------------------------
>
>                 Key: CASSANDRA-7275
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7275
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Tyler Hobbs
>            Assignee: Pavel Yaskevich
>            Priority: Minor
>             Fix For: 2.0.12
>
>         Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 
> 7252-2.0-v2.txt, CASSANDRA-7275-flush-info.patch
>
>
> In Memtable.FlushRunnable, the CountDownLatch will never be counted down if 
> there are errors, which results in hanging any threads that are waiting for 
> the flush to complete.  For example, an error like this causes the problem:
> {noformat}
> ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 CassandraDaemon.java (line 
> 198) Exception in thread Thread[FlushWriter:474,5,main]
> java.lang.IllegalArgumentException
>     at java.nio.Buffer.position(Unknown Source)
>     at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:64)
>     at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
>     at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:138)
>     at 
> org.apache.cassandra.io.sstable.ColumnNameHelper.minComponents(ColumnNameHelper.java:103)
>     at 
> org.apache.cassandra.db.ColumnFamily.getColumnStats(ColumnFamily.java:439)
>     at 
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:194)
>     at 
> org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:397)
>     at 
> org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350)
>     at 
> org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
>     at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>     at java.lang.Thread.run(Unknown Source)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to