[ 
https://issues.apache.org/jira/browse/CASSANDRA-18963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788537#comment-17788537
 ] 

Jacek Lewandowski commented on CASSANDRA-18963:
-----------------------------------------------

I think I can see how it can happen...

When the sstable writer prepares for committing the transaction, it does the 
following stuff in {{doPrepare}} method:

{code:java}
        protected void doPrepare()
        {
            transactionals.get().forEach(Transactional::prepareToCommit);
            new StatsComponent(finalizeMetadata()).save(descriptor);
...
{code}

When entering this method, there is only data and index components in the 
directory. After the first line, the rest of the components are created - but 
stats component. There is also a transaction log file which seems to say that 
the sstable is in-progress of writing. In the problematic scenario, the 
execution is interrupted between the first and the second line of doPrepare, 
which means that we do not have stats component but we do have transaction log. 

The problem is, those files are moved to a different directory on startup (as 
mentioned in the previous comment) and the transaction log identifies sstables 
by full absolute path, which is not updated on move. That's way the directory 
lister does not recognize this sstable as incomplete and tries to load it, 
which leads to the problem.

It does not look new to me. It seems like the problem is there probably as long 
as we use {{migrateSystemDataIfNeeded}} which was probably introduced in 
CASSANDRA-14793, thus the problem is there for a while.  I don't think it is a 
blocker for any release and the workaround is simple - just delete the affected 
sstables - those data are duplicated in the uncompacted sstables or in the 
commit log because since the transaction is not finished, those should not be 
wiped as well. 

For the solution I think we should use directory lister (or at least the same 
filter) when moving sstables around so that we do not move incomplete ones. /cc 
[~blerer]




> Test Failure: sstableutil_test.TestSSTableUtil
> ----------------------------------------------
>
>                 Key: CASSANDRA-18963
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18963
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/unit
>            Reporter: Berenguer Blasi
>            Assignee: Jacek Lewandowski
>            Priority: Normal
>             Fix For: 4.1.x, 5.0-rc
>
>
> Seems flaky 
> [https://app.circleci.com/pipelines/github/driftx/cassandra/1342/workflows/9554c7b5-dd60-4f08-8db5-c954febc8ad6/jobs/58957/tests]
>  * 
> h4. test_abortedcompaction
> sstableutil_test.TestSSTableUtil 
>  
> {code:java}
> self = <sstableutil_test.TestSSTableUtil object at 0x7fc099ecea58> def 
> test_abortedcompaction(self): """ @jira_ticket CASSANDRA-7066 @jira_ticket 
> CASSANDRA-11497 Check that we can cleanup temporary files after a compaction 
> is aborted. """ log_file_name = 'debug.log' cluster = self.cluster 
> cluster.populate(1).start() node = cluster.nodelist()[0] numrecords = 250000 
> self._create_data(node, KeyspaceName, TableName, numrecords) finalfiles, 
> tmpfiles = self._check_files(node, KeyspaceName, TableName) assert 
> len(finalfiles) > 0, "Expected to find some final files" assert 0 == 
> len(tmpfiles), "Expected no tmp files" t = InterruptCompaction(node, 
> TableName, filename=log_file_name, delay=2) t.start() try: 
> logger.debug("Compacting...") node.compact() except ToolError: pass # 
> expected to fail t.join() finalfiles = 
> _normcase_all(self._invoke_sstableutil(KeyspaceName, TableName, 
> type='final')) tmpfiles = 
> _normcase_all(self._invoke_sstableutil(KeyspaceName, TableName, type='tmp')) 
> # In most cases we should end up with some temporary files to clean up, but 
> it may happen # that no temporary files are created if compaction finishes 
> too early or starts too late # see CASSANDRA-11497 logger.debug("Got {} final 
> files and {} tmp files after compaction was interrupted" 
> .format(len(finalfiles), len(tmpfiles))) 
> self._invoke_sstableutil(KeyspaceName, TableName, cleanup=True) 
> self._check_files(node, KeyspaceName, TableName, finalfiles, []) # restart to 
> make sure not data is lost logger.debug("Restarting node...") > 
> node.start(wait_for_binary_proto=True) sstableutil_test.py:97: _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:914: in start 
> self.wait_for_binary_interface(from_mark=self.mark) 
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:702: in 
> wait_for_binary_interface self.watch_log_for("Starting listening for CQL 
> clients", **kwargs) ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:599: 
> in watch_log_for self.raise_node_error_if_cassandra_process_is_terminated() _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> self = <ccmlib.node.Node object at 0x7fc08a583be0> def 
> raise_node_error_if_cassandra_process_is_terminated(self): if not 
> self._is_pid_running(): msg = "C* process with {pid} is 
> terminated".format(pid=self.pid) common.debug(msg) > raise NodeError(msg) E 
> ccmlib.node.NodeError: C* process with 19530 is terminated 
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:683: NodeError{code}
> {code:java}
> failed on teardown with "Unexpected error found in node logs (see stdout for 
> full details). Errors: [[node1] 'ERROR [SSTableBatchOpen:1] 2023-10-19 
> 22:25:07,449 DefaultFSErrorHandler.java:129 - Exiting forcefully due to file 
> system exception on startup, disk failure policy 
> "stop"\norg.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /tmp/dtest-ahz16wrh/test/node1/data0/system/prepared_statements-18a9c2576a0c3841ba718cd529849fef/nc-4-big\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReaderLoadingBuilder.build(SSTableReaderLoadingBuilder.java:111)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:397)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:353)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.lambda$openAll$4(SSTableReader.java:414)\n\tat
>  org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96)\n\tat 
> org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)\n\tat 
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)\n\tat 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat
>  
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat
>  
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat
>  java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: 
> java.lang.IllegalArgumentException: null\n\tat 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)\n\tat
>  
> org.apache.cassandra.io.sstable.format.big.BigSSTableReaderLoadingBuilder.openComponents(BigSSTableReaderLoadingBuilder.java:79)\n\tat
>  
> org.apache.cassandra.io.sstable.format.big.BigSSTableReaderLoadingBuilder.openComponents(BigSSTableReaderLoadingBuilder.java:58)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReaderLoadingBuilder.build(SSTableReaderLoadingBuilder.java:92)\n\t...
>  10 common frames omitted', [node1] 'ERROR [SSTableBatchOpen:4] 2023-10-19 
> 22:25:07,449 DefaultFSErrorHandler.java:129 - Exiting forcefully due to file 
> system exception on startup, disk failure policy 
> "stop"\norg.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /tmp/dtest-ahz16wrh/test/node1/data0/system/prepared_statements-18a9c2576a0c3841ba718cd529849fef/nc-5-big\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReaderLoadingBuilder.build(SSTableReaderLoadingBuilder.java:111)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:397)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:353)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.lambda$openAll$4(SSTableReader.java:414)\n\tat
>  org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96)\n\tat 
> org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)\n\tat 
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)\n\tat 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat
>  
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat
>  
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat
>  java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: 
> java.lang.IllegalArgumentException: null\n\tat 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)\n\tat
>  
> org.apache.cassandra.io.sstable.format.big.BigSSTableReaderLoadingBuilder.openComponents(BigSSTableReaderLoadingBuilder.java:79)\n\tat
>  
> org.apache.cassandra.io.sstable.format.big.BigSSTableReaderLoadingBuilder.openComponents(BigSSTableReaderLoadingBuilder.java:58)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReaderLoadingBuilder.build(SSTableReaderLoadingBuilder.java:92)\n\t...
>  10 common frames omitted', [node1] 'ERROR [SSTableBatchOpen:4] 2023-10-19 
> 22:25:07,450 SSTableReader.java:419 - Corrupt sstable 
> /tmp/dtest-ahz16wrh/test/node1/data0/system/prepared_statements-18a9c2576a0c3841ba718cd529849fef/nc-5-big=[Index.db,
>  Data.db]; skipping 
> table\norg.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /tmp/dtest-ahz16wrh/test/node1/data0/system/prepared_statements-18a9c2576a0c3841ba718cd529849fef/nc-5-big\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReaderLoadingBuilder.build(SSTableReaderLoadingBuilder.java:111)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:397)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:353)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.lambda$openAll$4(SSTableReader.java:414)\n\tat
>  org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96)\n\tat 
> org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)\n\tat 
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)\n\tat 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat
>  
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat
>  
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat
>  java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: 
> java.lang.IllegalArgumentException: null\n\tat 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)\n\tat
>  
> org.apache.cassandra.io.sstable.format.big.BigSSTableReaderLoadingBuilder.openComponents(BigSSTableReaderLoadingBuilder.java:79)\n\tat
>  
> org.apache.cassandra.io.sstable.format.big.BigSSTableReaderLoadingBuilder.openComponents(BigSSTableReaderLoadingBuilder.java:58)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReaderLoadingBuilder.build(SSTableReaderLoadingBuilder.java:92)\n\t...
>  10 common frames omitted']" Unexpected error found in node logs (see stdout 
> for full details). Errors: [[node1] 'ERROR [SSTableBatchOpen:1] 2023-10-19 
> 22:25:07,449 DefaultFSErrorHandler.java:129 - Exiting forcefully due to file 
> system exception on startup, disk failure policy 
> "stop"\norg.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /tmp/dtest-ahz16wrh/test/node1/data0/system/prepared_statements-18a9c2576a0c3841ba718cd529849fef/nc-4-big\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReaderLoadingBuilder.build(SSTableReaderLoadingBuilder.java:111)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:397)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:353)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.lambda$openAll$4(SSTableReader.java:414)\n\tat
>  org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96)\n\tat 
> org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)\n\tat 
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)\n\tat 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat
>  
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat
>  
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat
>  java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: 
> java.lang.IllegalArgumentException: null\n\tat 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)\n\tat
>  
> org.apache.cassandra.io.sstable.format.big.BigSSTableReaderLoadingBuilder.openComponents(BigSSTableReaderLoadingBuilder.java:79)\n\tat
>  
> org.apache.cassandra.io.sstable.format.big.BigSSTableReaderLoadingBuilder.openComponents(BigSSTableReaderLoadingBuilder.java:58)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReaderLoadingBuilder.build(SSTableReaderLoadingBuilder.java:92)\n\t...
>  10 common frames omitted', [node1] 'ERROR [SSTableBatchOpen:4] 2023-10-19 
> 22:25:07,449 DefaultFSErrorHandler.java:129 - Exiting forcefully due to file 
> system exception on startup, disk failure policy 
> "stop"\norg.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /tmp/dtest-ahz16wrh/test/node1/data0/system/prepared_statements-18a9c2576a0c3841ba718cd529849fef/nc-5-big\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReaderLoadingBuilder.build(SSTableReaderLoadingBuilder.java:111)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:397)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:353)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.lambda$openAll$4(SSTableReader.java:414)\n\tat
>  org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96)\n\tat 
> org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)\n\tat 
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)\n\tat 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat
>  
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat
>  
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat
>  java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: 
> java.lang.IllegalArgumentException: null\n\tat 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)\n\tat
>  
> org.apache.cassandra.io.sstable.format.big.BigSSTableReaderLoadingBuilder.openComponents(BigSSTableReaderLoadingBuilder.java:79)\n\tat
>  
> org.apache.cassandra.io.sstable.format.big.BigSSTableReaderLoadingBuilder.openComponents(BigSSTableReaderLoadingBuilder.java:58)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReaderLoadingBuilder.build(SSTableReaderLoadingBuilder.java:92)\n\t...
>  10 common frames omitted', [node1] 'ERROR [SSTableBatchOpen:4] 2023-10-19 
> 22:25:07,450 SSTableReader.java:419 - Corrupt sstable 
> /tmp/dtest-ahz16wrh/test/node1/data0/system/prepared_statements-18a9c2576a0c3841ba718cd529849fef/nc-5-big=[Index.db,
>  Data.db]; skipping 
> table\norg.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /tmp/dtest-ahz16wrh/test/node1/data0/system/prepared_statements-18a9c2576a0c3841ba718cd529849fef/nc-5-big\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReaderLoadingBuilder.build(SSTableReaderLoadingBuilder.java:111)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:397)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:353)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReader.lambda$openAll$4(SSTableReader.java:414)\n\tat
>  org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96)\n\tat 
> org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)\n\tat 
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)\n\tat 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat
>  
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat
>  
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat
>  java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: 
> java.lang.IllegalArgumentException: null\n\tat 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)\n\tat
>  
> org.apache.cassandra.io.sstable.format.big.BigSSTableReaderLoadingBuilder.openComponents(BigSSTableReaderLoadingBuilder.java:79)\n\tat
>  
> org.apache.cassandra.io.sstable.format.big.BigSSTableReaderLoadingBuilder.openComponents(BigSSTableReaderLoadingBuilder.java:58)\n\tat
>  
> org.apache.cassandra.io.sstable.format.SSTableReaderLoadingBuilder.build(SSTableReaderLoadingBuilder.java:92)\n\t...
>  10 common frames omitted']{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to