[ 
https://issues.apache.org/jira/browse/CASSANDRA-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362042#comment-17362042
 ] 

Stefania Alborghetti commented on CASSANDRA-12519:
--------------------------------------------------

{quote} 
 I'm not familiarized with the lifecycle package so I'm not sure whether 
skipping the temporary sstables when resetting the levels is right, or whether 
the validation error that happens after changing the metadata is caused by a 
deeper problem.
{quote}
I would need to see the full reason why the transaction rejected a record and I 
wasn't able to find a full failure, but it must have failed the checksum 
verification because the metadata file is changed by the standalone tools, 
{{sstablelevelreset}} in our case.

The transaction is checking if anything has tampered with a file guarded by it. 
This is done by {{LogFile.verify()}} and would also prevent a main Cassandra 
process from starting up. This is because there is some automated cleanup done 
on startup when {{LogTransaction.removeUnfinishedLeftovers()}} is called. Since 
we don't want to mistakenly delete files restored by users for example, we 
check using a checksum which is calculated from the files that existed when the 
transaction record was created. There are more checks but this is the main one 
and the one that I believe must have failed.

So if anything changes any of these files, temporary or permanent, the 
transaction detects it. These two standalone tools change the sstable metadata 
and hence probably triggered it.

I think it's reasonable to change {{sstablelevelreset}} to skip temporary 
files, because if the transaction did not complete, it's as if these files 
never existed. However, I don't think this is sufficient to fix the problem, 
because changing the old existing metadata files could also trigger a checksum 
error. So I may be wrong, but it seems to me that the real fix is to use the 
cleanup utility in the test, before running {{sstablelevelreset}} so that there 
are no left over transactions.

If these two tools are likely to be used directly from users when the process 
is offline, as they seem to be, I believe that they should cleanup leftover 
transactions first, or at least issue a warning if there are any. Otherwise the 
main process may refuse to start for the same reason explained above. To 
cleanup leftovers we can simply call 
{{LifecycleTransaction.removeUnfinishedLeftovers(cfs)}} from the tool itself, 
before doing any work. We should consider a follow up to do this, or fix this 
directly in this ticket. If we fix this here, then we don't need to do this in 
the test.

So you can either merge what you have and open a follow up, or add 
{{LifecycleTransaction.removeUnfinishedLeftovers(cfs)}}, as well as kipping the 
temporary files (which seems more correct to me), and see if this fixes it 
without changing the test.

> dtest failure in 
> offline_tools_test.TestOfflineTools.sstableofflinerelevel_test
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-12519
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12519
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Test/dtest/python
>            Reporter: Sean McCarthy
>            Assignee: Andres de la Peña
>            Priority: Normal
>             Fix For: 4.0-rc2, 4.0, 3.0.x, 3.11.x, 4.0-rc, 4.x
>
>         Attachments: node1.log, node1_debug.log, node1_gc.log
>
>
> example failure: 
> http://cassci.datastax.com/job/trunk_offheap_dtest/379/testReport/offline_tools_test/TestOfflineTools/sstableofflinerelevel_test/
> {code}
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
>     testMethod()
>   File "/home/automaton/cassandra-dtest/offline_tools_test.py", line 209, in 
> sstableofflinerelevel_test
>     self.assertGreater(max(final_levels), 1)
>   File "/usr/lib/python2.7/unittest/case.py", line 942, in assertGreater
>     self.fail(self._formatMessage(msg, standardMsg))
>   File "/usr/lib/python2.7/unittest/case.py", line 410, in fail
>     raise self.failureException(msg)
> "1 not greater than 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to