[
https://issues.apache.org/jira/browse/CASSANDRA-8019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joshua McKenzie updated CASSANDRA-8019:
---------------------------------------
Attachment: 8019_conservative_v1.txt
8019_aggressive_v1.txt
We open handles to the index file and data file within an SSTableScanner and
aren't participating in SSTR refcounting. SSTR.releaseReference attempts to
delete these two files within the SSTableDeletingTask, giving rise to the error
messages we're seeing when there's still SSTableScanners with those files open.
This doesn't present on trunk in windows as FILE_SHARE_DELETE w/nio.2 allows
files to be deleted w/other handles open to them transparently as in linux.
w/regards to this particular patch introducing (more) errors related to
compaction and sstable deletion, it's a timing issue and there's a few ways
I've experimented with to address this. All of them work but have different
compromises. In order of least to most invasive:
# roll back the scoped-resource access in CompactionTask and target a
scanners.close() right before
{code}cfs.getDataTracker().markCompactedSSTablesReplaced(oldSStables,
newSStables, compactionType);{code} This partially defeats the purpose of the
earlier patch.
# move the call to markCompactedSSTablesReplaced to the end of
CompactionTask.runWith after closing scope on the try block that opened them.
This potentially delays said marking until stat calculation and logging is
completed but scopes the resource access.
# have SSTableScanners acquireReference() and releaseReference() on their
parent sstables on ctor and close() calls respectively
I have mixed feelings about all three options. I feel like option #3 is the
most "correct" approach but it's also the most heavyweight by far and most
prone to introduction of subtle bugs as it's imposing a new constraint on ref
counting w/regards to SSTableScanner lifetimes. With regard to us trying to
tighten up and reduce manual ref-counting this approach also feels like a
significant step back.
For now (in 2.1.X land), I'm comfortable with rolling back the try w/resource
usage on CompactionTask.runWith and then we think about this further, perhaps
in a separate ticket if we feel it warranted. Digging into the unit tests on
the 2.1 branch, it's apparent that this type of problem is pretty widespread on
Windows as I get the same failures in many other unit tests (as far back as pre
CASSANDRA-6916 days). We may need to file these errors under "Cassandra-2.1 on
Windows is beta, this is fixed on trunk", keeping in mind that these issues
resolve once the scanners are closed. Might be worth considering dropping the
logging severity on deletion failure from error to warn as it's going to be
common on 2.1 currently.
Having said that - option 3 has a pretty profound impact on the 2.1 branch
w/regards to Windows:
{noformat}
$ grep 'Unable to delete' 2.1_head.testresults | wc -l
7860
$ grep 'Unable to delete' opt3_heavy.testresults | wc -l
2
{noformat}
We go from 9 actual test failures to 8 w/this change on Windows, and all
currently passing unit tests continue to pass as expected on linux. I put in
some logic to check if we use an SSTableScanner after the corresponding SSTR
goes through the tidy process and, at least in all our unit tests, we don't
appear to try and use a scanner after deleting the underlying SSTR.
Attaching both a conservative (option 1) and invasive (option 3) patch. My
vote's for 1 but think there's merit in considering the implications of 3 and
our current refcounting (and the meta purpose it serves).
> Windows Unit tests and Dtests erroring due to sstable deleting task error
> -------------------------------------------------------------------------
>
> Key: CASSANDRA-8019
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8019
> Project: Cassandra
> Issue Type: Bug
> Environment: Windows 7
> Reporter: Philip Thompson
> Assignee: Joshua McKenzie
> Labels: windows
> Fix For: 2.1.1
>
> Attachments: 8019_aggressive_v1.txt, 8019_conservative_v1.txt
>
>
> Currently a large number of dtests and unit tests are erroring on windows
> with the following error in the node log:
> {code}
> ERROR [NonPeriodicTasks:1] 2014-09-29 11:05:04,383
> SSTableDeletingTask.java:89 - Unable to delete
> c:\\users\\username\\appdata\\local\\temp\\dtest-vr6qgw\\test\\node1\\data\\system\\local-7ad54392bcdd35a684174e047860b377\\system-local-ka-4-Data.db
> (it will be removed on server restart; we'll also retry after GC)\n
> {code}
> git bisect points to the following commit:
> {code}
> 0e831007760bffced8687f51b99525b650d7e193 is the first bad commit
> commit 0e831007760bffced8687f51b99525b650d7e193
> Author: Benedict Elliott Smith <[email protected]>
> Date: Fri Sep 19 18:17:19 2014 +0100
> Fix resource leak in event of corrupt sstable
> patch by benedict; review by yukim for CASSANDRA-7932
> :100644 100644 d3ee7d99179dce03307503a8093eb47bd0161681
> f55e5d27c1c53db3485154cd16201fc5419f32df M CHANGES.txt
> :040000 040000 194f4c0569b6be9cc9e129c441433c5c14de7249
> 3c62b53b2b2bd4b212ab6005eab38f8a8e228923 M src
> :040000 040000 64f49266e328b9fdacc516c52ef1921fe42e994f
> de2ca38232bee6d2a6a5e068ed9ee0fbbc5aaebe M test
> {code}
> You can reproduce this by running simple_bootstrap_test.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)