[
https://issues.apache.org/jira/browse/CASSANDRA-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000094#comment-15000094
]
Stefania commented on CASSANDRA-10682:
--------------------------------------
The timeout is caused by {{waitOnFuture}} in this method of {{SchemaKeyspace}}
which is flushing schema tables:
{code}
static void flush()
{
if (!Boolean.getBoolean("cassandra.unsafesystem"))
ALL.forEach(table ->
FBUtilities.waitOnFuture(getSchemaCFS(table).forceFlush()));
}
{code}
The flush writer thread terminates due to an uncaught exception, it appears the
table folder for _keyspaces_ does not exist (errno 2):
{code}
WARN [MemtableFlushWriter:1] 2015-11-06 22:57:41,299
open(build/test/cassandra/data:110/system_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6,
O_RDONLY) failed, errno (2).
ERROR [MemtableFlushWriter:1] 2015-11-06 22:57:41,374 Fatal exception in thread
Thread[MemtableFlushWriter:1,5,main]
java.lang.RuntimeException: java.nio.file.NoSuchFileException:
build/test/cassandra/data:110/system_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6/ma_txn_flush_c88c3670-84d9
-11e5-a504-21b90eeb2985.log
at org.apache.cassandra.io.util.FileUtils.write(FileUtils.java:607)
~[main/:na]
at
org.apache.cassandra.io.util.FileUtils.appendAndSync(FileUtils.java:588)
~[main/:na]
at
org.apache.cassandra.db.lifecycle.LogReplica.append(LogReplica.java:66)
~[main/:na]
at
org.apache.cassandra.db.lifecycle.LogReplicaSet.lambda$null$151(LogReplicaSet.java:186)
~[main/:na]
at
org.apache.cassandra.db.lifecycle.LogReplicaSet$$Lambda$77/463946743.perform(Unknown
Source) ~[na:na]
at org.apache.cassandra.utils.Throwables.perform(Throwables.java:112)
~[main/:na]
at org.apache.cassandra.utils.Throwables.perform(Throwables.java:102)
~[main/:na]
at
org.apache.cassandra.db.lifecycle.LogReplicaSet.append(LogReplicaSet.java:186)
~[main/:na]
at
org.apache.cassandra.db.lifecycle.LogFile.addRecord(LogFile.java:280)
~[main/:na]
at org.apache.cassandra.db.lifecycle.LogFile.abort(LogFile.java:234)
~[main/:na]
at
org.apache.cassandra.db.lifecycle.LogTransaction.doAbort(LogTransaction.java:376)
~[main/:na]
at
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.abort(Transactional.java:144)
~[main/:na]
at
org.apache.cassandra.db.lifecycle.LifecycleTransaction.doAbort(LifecycleTransaction.java:234)
~[main/:na]
at
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.abort(Transactional.java:144)
~[main/:na]
at
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.abort(Transactional.java:193)
~[main/:na]
at
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.close(Transactional.java:158)
~[main/:na]
at
org.apache.cassandra.db.Memtable$FlushRunnable.createFlushWriter(Memtable.java:442)
~[main/:na]
at
org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:366)
~[main/:na]
at
org.apache.cassandra.db.Memtable$FlushRunnable.runMayThrow(Memtable.java:352)
~[main/:na]
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
~[main/:na]
at
com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299)
~[guava-18.0.jar:na]
at
org.apache.cassandra.db.ColumnFamilyStore$Flush.run(ColumnFamilyStore.java:1037)
~[main/:na]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
~[na:1.8.0_45]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
~[na:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_45]
{code}
CASSANDRA-10538 will fix the exception in {{LogTransaction.doAbort()}}. However
we then have another uncaught exception in {{Transactional.close()}}, which I
propose to address. Further, we then throw the original exception that caused
the transaction to be aborted. {{FlushRunnable}} does not seem to attempt to
handle exceptions, which means the post flush executor latch is not decreased
and the main thread hangs waiting on the futures. I am not sure if this patch
should try to handle exceptions in {{FlushRunnable}} so that the post flush
executor then runs, I guess the memtable should not be reclaimed either in this
case. cc [~aweisberg] and [~slebresne].
Another interesting thing to observe is that there is a commit log replayed in
the test log, perhaps we are picking up commit logs of unrelated tests?
As for the table folder not existing, because it is created when the CFS is
opened, I am guessing we need to sync the parent folder. I'm going to run the
unit tests on Jenkins a few more times to see if the root cause is solved by
sync-ing the parent folder whenever we create a table folder.
> Fix timeouts in BeforeFirstTest
> -------------------------------
>
> Key: CASSANDRA-10682
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10682
> Project: Cassandra
> Issue Type: Sub-task
> Reporter: Stefania
> Assignee: Stefania
> Fix For: 3.1
>
> Attachments:
> TEST-org.apache.cassandra.db.SinglePartitionSliceCommandTest.log
>
>
> Some unit tests fail with a timeout in {{BeforeFirstTest}}, see for example
> [here|http://cassci.datastax.com/job/cassandra-3.0_testall/242/testReport/org.apache.cassandra.db/SinglePartitionSliceCommandTest/BeforeFirstTest/].
>
> In the corresponding log file, attached, there is a {{NoSuchFileException}}
> which might be the cause.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)