[jira] [Updated] (CASSANDRA-8019) Windows Unit tests and Dtests erroring due to sstable deleting task error

Joshua McKenzie (JIRA) Wed, 01 Oct 2014 14:41:20 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-8019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joshua McKenzie updated CASSANDRA-8019:
---------------------------------------
    Attachment: 8019_conservative_v1.txt
                8019_aggressive_v1.txt

We open handles to the index file and data file within an SSTableScanner and 
aren't participating in SSTR refcounting. SSTR.releaseReference attempts to 
delete these two files within the SSTableDeletingTask, giving rise to the error 
messages we're seeing when there's still SSTableScanners with those files open. 
 This doesn't present on trunk in windows as FILE_SHARE_DELETE w/nio.2 allows 
files to be deleted w/other handles open to them transparently as in linux.

w/regards to this particular patch introducing (more) errors related to 
compaction and sstable deletion, it's a timing issue and there's a few ways 
I've experimented with to address this. All of them work but have different 
compromises.  In order of least to most invasive:
# roll back the scoped-resource access in CompactionTask and target a 
scanners.close() right before 
{code}cfs.getDataTracker().markCompactedSSTablesReplaced(oldSStables, 
newSStables, compactionType);{code} This partially defeats the purpose of the 
earlier patch.
# move the call to markCompactedSSTablesReplaced to the end of 
CompactionTask.runWith after closing scope on the try block that opened them.  
This potentially delays said marking until stat calculation and logging is 
completed but scopes the resource access.
# have SSTableScanners acquireReference() and releaseReference() on their 
parent sstables on ctor and close() calls respectively

I have mixed feelings about all three options. I feel like option #3 is the 
most "correct" approach but it's also the most heavyweight by far and most 
prone to introduction of subtle bugs as it's imposing a new constraint on ref 
counting w/regards to SSTableScanner lifetimes. With regard to us trying to 
tighten up and reduce manual ref-counting this approach also feels like a 
significant step back.

For now (in 2.1.X land), I'm comfortable with rolling back the try w/resource 
usage on CompactionTask.runWith and then we think about this further, perhaps 
in a separate ticket if we feel it warranted.  Digging into the unit tests on 
the 2.1 branch, it's apparent that this type of problem is pretty widespread on 
Windows as I get the same failures in many other unit tests (as far back as pre 
CASSANDRA-6916 days).  We may need to file these errors under "Cassandra-2.1 on 
Windows is beta, this is fixed on trunk", keeping in mind that these issues 
resolve once the scanners are closed.  Might be worth considering dropping the 
logging severity on deletion failure from error to warn as it's going to be 
common on 2.1 currently.

Having said that - option 3 has a pretty profound impact on the 2.1 branch 
w/regards to Windows:
{noformat}
$ grep 'Unable to delete' 2.1_head.testresults | wc -l
7860

$ grep 'Unable to delete' opt3_heavy.testresults | wc -l
2
{noformat}
We go from 9 actual test failures to 8 w/this change on Windows, and all 
currently passing unit tests continue to pass as expected on linux.  I put in 
some logic to check if we use an SSTableScanner after the corresponding SSTR 
goes through the tidy process and, at least in all our unit tests, we don't 
appear to try and use a scanner after deleting the underlying SSTR.

Attaching both a conservative (option 1) and invasive (option 3) patch.  My 
vote's for 1 but think there's merit in considering the implications of 3 and 
our current refcounting (and the meta purpose it serves).

> Windows Unit tests and Dtests erroring due to sstable deleting task error
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8019
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8019
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Windows 7
>            Reporter: Philip Thompson
>            Assignee: Joshua McKenzie
>              Labels: windows
>             Fix For: 2.1.1
>
>         Attachments: 8019_aggressive_v1.txt, 8019_conservative_v1.txt
>
>
> Currently a large number of dtests and unit tests are erroring on windows 
> with the following error in the node log:
> {code}
> ERROR [NonPeriodicTasks:1] 2014-09-29 11:05:04,383 
> SSTableDeletingTask.java:89 - Unable to delete 
> c:\\users\\username\\appdata\\local\\temp\\dtest-vr6qgw\\test\\node1\\data\\system\\local-7ad54392bcdd35a684174e047860b377\\system-local-ka-4-Data.db
>  (it will be removed on server restart; we'll also retry after GC)\n
> {code}
> git bisect points to the following commit:
> {code}
> 0e831007760bffced8687f51b99525b650d7e193 is the first bad commit
> commit 0e831007760bffced8687f51b99525b650d7e193
> Author: Benedict Elliott Smith <[email protected]>
> Date:  Fri Sep 19 18:17:19 2014 +0100
>     Fix resource leak in event of corrupt sstable
>     patch by benedict; review by yukim for CASSANDRA-7932
> :100644 100644 d3ee7d99179dce03307503a8093eb47bd0161681 
> f55e5d27c1c53db3485154cd16201fc5419f32df M      CHANGES.txt
> :040000 040000 194f4c0569b6be9cc9e129c441433c5c14de7249 
> 3c62b53b2b2bd4b212ab6005eab38f8a8e228923 M  src
> :040000 040000 64f49266e328b9fdacc516c52ef1921fe42e994f 
> de2ca38232bee6d2a6a5e068ed9ee0fbbc5aaebe M  test
> {code}
> You can reproduce this by running simple_bootstrap_test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-8019) Windows Unit tests and Dtests erroring due to sstable deleting task error

Reply via email to