[ 
https://issues.apache.org/jira/browse/CASSANDRA-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14716257#comment-14716257
 ] 

Benedict edited comment on CASSANDRA-10109 at 8/27/15 8:14 AM:
---------------------------------------------------------------

So, thinking about it, all we really want to do is ensure that clients don't 
see temporary files (i.e. incomplete files, or files that we may yet abort). On 
startup we don't have to worry about this; we shouldn't ever have to retry on 
startup, since the state on disk will not be changing, and we should avoid any 
necessity on reads. Retries worry me. So, I propose the following:

{noformat}
Online listings:
- List data files
- List txn logs  (must be after to ensure we have seen all txn logs covering 
the files - this step should only be done if not SecureDirectoryStream)
- Read txn logs
- If the commit/abort record is present, just apply that; don't worry about 
missing files, since we're actively mutating the state and it should be 
expected that some may be involved in later transactions
- otherwise, if all tracked files are present (or only the last entry is 
missing, but is NEW), txn is in progress (so treat as aborted)
- If some files are missing (note, here we should check disk rather than our in 
memory listing IFF the listing does not contain a file), that implies the 
transaction has since completed
-- re-read the transaction and look for its current state
-- if the txn log is missing, we can safely do nothing
-- if the last record is now present, apply the logic
-- if none of these hold we must have a bug, so throw an exception
{noformat}

We can do this because we require that clients safely cope with missing files 
however we perform listings, since they're actively being mutated and can 
disappear at any time; new or old.

On startup, however, our current logic is fine. But we don't need to retry; we 
should just fail if we encounter an unrecoverable exception. This should 
simplify things.


was (Author: benedict):
So, thinking about it, all we really want to do is ensure that clients don't 
see temporary files (i.e. incomplete files, or files that we may yet abort). On 
startup we don't have to worry about this; we shouldn't ever have to retry on 
startup, since the state on disk will not be changing, and we should avoid any 
necessity on reads. Retries worry me. So, I propose the following:

{noformat}
Online listings:
- List data files
- List txn logs  (must be after to ensure we have seen all txn logs covering 
the files - this step should only be done if not SecureDirectoryStream)
- Read txn logs
- If the commit/abort record is present, just apply that; don't worry about 
missing files, since we're actively mutating the state and it should be 
expected that some may be involved in later transactions
- otherwise, if all tracked files are present, txn is in progress (so treat as 
aborted)
- If some files are missing (note, here we should check disk rather than our in 
memory listing IFF the listing does not contain a file), that implies the 
transaction has since completed
-- re-read the transaction and look for its current state
-- if the txn log is missing, we can safely do nothing
-- if the last record is now present, apply the logic
-- if neither of these two hold we must have a bug, so throw an exception
{noformat}

We can do this because we require that clients safely cope with missing files 
however we perform listings, since they're actively being mutated and can 
disappear at any time; new or old.

On startup, however, our current logic is fine. But we don't need to retry; we 
should just fail if we encounter an unrecoverable exception. This should 
simplify things.

> Windows dtest 3.0: ttl_test.py failures
> ---------------------------------------
>
>                 Key: CASSANDRA-10109
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10109
>             Project: Cassandra
>          Issue Type: Sub-task
>            Reporter: Joshua McKenzie
>            Assignee: Stefania
>              Labels: Windows
>             Fix For: 3.0.0 rc1
>
>
> ttl_test.py:TestTTL.update_column_ttl_with_default_ttl_test2
> ttl_test.py:TestTTL.update_multiple_columns_ttl_test
> ttl_test.py:TestTTL.update_single_column_ttl_test
> Errors locally are different than CI from yesterday. Yesterday on CI we have 
> timeouts and general node hangs. Today on all 3 tests when run locally I see:
> {noformat}
> Traceback (most recent call last):
>   File "c:\src\cassandra-dtest\dtest.py", line 532, in tearDown
>     raise AssertionError('Unexpected error in %s node log: %s' % (node.name, 
> errors))
> AssertionError: Unexpected error in node1 node log: ['ERROR [main] 2015-08-17 
> 16:53:43,120 NoSpamLogger.java:97 - This platform does not support atomic 
> directory streams (SecureDirectoryStream); race conditions when loading 
> sstable files could occurr']
> {noformat}
> This traces back to the commit for CASSANDRA-7066 today by [~Stefania] and 
> [~benedict].  Stefania - care to take this ticket and also look further into 
> whether or not we're going to have issues with 7066 on Windows? That error 
> message certainly *sounds* like it's not a good thing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to