[ 
https://issues.apache.org/jira/browse/CASSANDRA-8019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206856#comment-14206856
 ] 

Joshua McKenzie edited comment on CASSANDRA-8019 at 11/11/14 7:17 PM:
----------------------------------------------------------------------

v3 attached.  Refcounting on SSTR from within SSTableScanner, updated 
SSTableRewriterTest to try-with-resource CompactionControllers and Scanners.  
Passes all unit tests on linux and dtest failures match CI environment, and 
"Unable to delete" errors on windows unit tests on 2.1 branch are greatly 
reduced.  I still see some "Unable to delete" messages during runtime while 
attempting to force compaction on a loaded system but those are also reduced 
and I'll track them down in a separate effort.

I chose to go with refcounting rather than simply changing the ordering in 
CompactionTask as we need some codification of the ordering relationship 
between scanners and sstables in order to prevent this type of "error" in the 
future.

The SSTableScanner relies on internal data structures within the SSTR and, 
while the previous code will hold the reference open and prevent GC due to the 
pointer it has internally as well as the ifile and dfile references, our 
previous logical structure of there being no relationship between 
SSTableScanners being open and SSTR deletion was misleading.  While we 
replicate some of the references in the scanner so the SSTR can technically be 
deleted out of order and we rely on the filesystem to keep the file open if we 
have a handle to it, a more clear relationship between the components is 
preferable IMO.

[~jbellis]: I threw you on this as reviewer when I was leaning towards log 
suppression route as it was a trivial effort; [~krummas]: would you be willing 
to review this as you've been in the compaction and SSTableRewriter space 
recently?

Edit: I should note: While this is a symptom that we see on Windows on the 2.1 
branch specifically, this isn't so much a Windows issue as resource ordering 
issue centered around the compaction process and SSTableScanners.


was (Author: joshuamckenzie):
v3 attached.  Refcounting on SSTR from within SSTableScanner, updated 
SSTableRewriterTest to try-with-resource CompactionControllers and Scanners.  
Passes all unit tests on linux and dtest failures match CI environment, and 
"Unable to delete" errors on windows unit tests on 2.1 branch are greatly 
reduced.  I still see some "Unable to delete" messages during runtime while 
attempting to force compaction on a loaded system but those are also reduced 
and I'll track them down in a separate effort.

I chose to go with refcounting rather than simply changing the ordering in 
CompactionTask as we need some codification of the ordering relationship 
between scanners and sstables in order to prevent this type of "error" in the 
future.

The SSTableScanner relies on internal data structures within the SSTR and, 
while the previous code will hold the reference open and prevent GC due to the 
pointer it has internally as well as the ifile and dfile references, our 
previous logical structure of there being no relationship between 
SSTableScanners being open and SSTR deletion was misleading.  While we 
replicate some of the references in the scanner so the SSTR can technically be 
deleted out of order and we rely on the filesystem to keep the file open if we 
have a handle to it, a more clear relationship between the components is 
preferable IMO.

[~jbellis]: I threw you on this as reviewer when I was leaning towards log 
suppression route as it was a trivial effort; [~krummas]: would you be willing 
to review this as you've been in the compaction and SSTableRewriter space 
recently?

> Windows Unit tests and Dtests erroring due to sstable deleting task error
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8019
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8019
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Windows 7
>            Reporter: Philip Thompson
>            Assignee: Joshua McKenzie
>              Labels: windows
>             Fix For: 2.1.3
>
>         Attachments: 8019_aggressive_v1.txt, 8019_conservative_v1.txt, 
> 8019_v2.txt, 8019_v3.txt
>
>
> Currently a large number of dtests and unit tests are erroring on windows 
> with the following error in the node log:
> {code}
> ERROR [NonPeriodicTasks:1] 2014-09-29 11:05:04,383 
> SSTableDeletingTask.java:89 - Unable to delete 
> c:\\users\\username\\appdata\\local\\temp\\dtest-vr6qgw\\test\\node1\\data\\system\\local-7ad54392bcdd35a684174e047860b377\\system-local-ka-4-Data.db
>  (it will be removed on server restart; we'll also retry after GC)\n
> {code}
> git bisect points to the following commit:
> {code}
> 0e831007760bffced8687f51b99525b650d7e193 is the first bad commit
> commit 0e831007760bffced8687f51b99525b650d7e193
> Author: Benedict Elliott Smith <bened...@apache.org>
> Date:  Fri Sep 19 18:17:19 2014 +0100
>     Fix resource leak in event of corrupt sstable
>     patch by benedict; review by yukim for CASSANDRA-7932
> :100644 100644 d3ee7d99179dce03307503a8093eb47bd0161681 
> f55e5d27c1c53db3485154cd16201fc5419f32df M      CHANGES.txt
> :040000 040000 194f4c0569b6be9cc9e129c441433c5c14de7249 
> 3c62b53b2b2bd4b212ab6005eab38f8a8e228923 M  src
> :040000 040000 64f49266e328b9fdacc516c52ef1921fe42e994f 
> de2ca38232bee6d2a6a5e068ed9ee0fbbc5aaebe M  test
> {code}
> You can reproduce this by running simple_bootstrap_test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to