[ 
https://issues.apache.org/jira/browse/CASSANDRA-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000094#comment-15000094
 ] 

Stefania commented on CASSANDRA-10682:
--------------------------------------

The timeout is caused by {{waitOnFuture}} in this method of {{SchemaKeyspace}} 
which is flushing schema tables:

{code}
    static void flush()
    {
        if (!Boolean.getBoolean("cassandra.unsafesystem"))
            ALL.forEach(table -> 
FBUtilities.waitOnFuture(getSchemaCFS(table).forceFlush()));
    }
{code}

The flush writer thread terminates due to an uncaught exception, it appears the 
table folder for _keyspaces_ does not exist (errno 2):

{code}
WARN  [MemtableFlushWriter:1] 2015-11-06 22:57:41,299 
open(build/test/cassandra/data:110/system_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6,
 O_RDONLY) failed, errno (2).
ERROR [MemtableFlushWriter:1] 2015-11-06 22:57:41,374 Fatal exception in thread 
Thread[MemtableFlushWriter:1,5,main]
java.lang.RuntimeException: java.nio.file.NoSuchFileException: 
build/test/cassandra/data:110/system_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6/ma_txn_flush_c88c3670-84d9
-11e5-a504-21b90eeb2985.log
        at org.apache.cassandra.io.util.FileUtils.write(FileUtils.java:607) 
~[main/:na]
        at 
org.apache.cassandra.io.util.FileUtils.appendAndSync(FileUtils.java:588) 
~[main/:na]
        at 
org.apache.cassandra.db.lifecycle.LogReplica.append(LogReplica.java:66) 
~[main/:na]
        at 
org.apache.cassandra.db.lifecycle.LogReplicaSet.lambda$null$151(LogReplicaSet.java:186)
 ~[main/:na]
        at 
org.apache.cassandra.db.lifecycle.LogReplicaSet$$Lambda$77/463946743.perform(Unknown
 Source) ~[na:na]
        at org.apache.cassandra.utils.Throwables.perform(Throwables.java:112) 
~[main/:na]
        at org.apache.cassandra.utils.Throwables.perform(Throwables.java:102) 
~[main/:na]
        at 
org.apache.cassandra.db.lifecycle.LogReplicaSet.append(LogReplicaSet.java:186) 
~[main/:na]
        at 
org.apache.cassandra.db.lifecycle.LogFile.addRecord(LogFile.java:280) 
~[main/:na]
        at org.apache.cassandra.db.lifecycle.LogFile.abort(LogFile.java:234) 
~[main/:na]
        at 
org.apache.cassandra.db.lifecycle.LogTransaction.doAbort(LogTransaction.java:376)
 ~[main/:na]
        at 
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.abort(Transactional.java:144)
 ~[main/:na]
        at 
org.apache.cassandra.db.lifecycle.LifecycleTransaction.doAbort(LifecycleTransaction.java:234)
 ~[main/:na]
        at 
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.abort(Transactional.java:144)
 ~[main/:na]
        at 
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.abort(Transactional.java:193)
 ~[main/:na]
        at 
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.close(Transactional.java:158)
 ~[main/:na]
        at 
org.apache.cassandra.db.Memtable$FlushRunnable.createFlushWriter(Memtable.java:442)
 ~[main/:na]
        at 
org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:366)
 ~[main/:na]
        at 
org.apache.cassandra.db.Memtable$FlushRunnable.runMayThrow(Memtable.java:352) 
~[main/:na]
        at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
~[main/:na]
        at 
com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299)
 ~[guava-18.0.jar:na]
        at 
org.apache.cassandra.db.ColumnFamilyStore$Flush.run(ColumnFamilyStore.java:1037)
 ~[main/:na]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
~[na:1.8.0_45]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
~[na:1.8.0_45]
        at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_45]
{code}

CASSANDRA-10538 will fix the exception in {{LogTransaction.doAbort()}}. However 
we then have another uncaught exception in {{Transactional.close()}}, which I 
propose to address. Further, we then throw the original exception that caused 
the transaction to be aborted. {{FlushRunnable}} does not seem to attempt to 
handle exceptions, which means the post flush executor latch is not decreased 
and the main thread hangs waiting on the futures. I am not sure if this patch 
should try to handle exceptions in {{FlushRunnable}} so that the post flush 
executor then runs, I guess the memtable should not be reclaimed either in this 
case. cc [~aweisberg] and [~slebresne].

Another interesting thing to observe is that there is a commit log replayed in 
the test log, perhaps we are picking up commit logs of unrelated tests? 

As for the table folder not existing, because it is created when the CFS is 
opened, I am guessing we need to sync the parent folder. I'm going to run the 
unit tests on Jenkins a few more times to see if the root cause is solved by 
sync-ing the parent folder whenever we create a table folder.

> Fix timeouts in BeforeFirstTest
> -------------------------------
>
>                 Key: CASSANDRA-10682
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10682
>             Project: Cassandra
>          Issue Type: Sub-task
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 3.1
>
>         Attachments: 
> TEST-org.apache.cassandra.db.SinglePartitionSliceCommandTest.log
>
>
> Some unit tests fail with a timeout in {{BeforeFirstTest}}, see for example 
> [here|http://cassci.datastax.com/job/cassandra-3.0_testall/242/testReport/org.apache.cassandra.db/SinglePartitionSliceCommandTest/BeforeFirstTest/].
>  
> In the corresponding log file, attached, there is a {{NoSuchFileException}} 
> which might be the cause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to