[ 
https://issues.apache.org/jira/browse/CASSANDRA-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960278#comment-14960278
 ] 

Benedict edited comment on CASSANDRA-10534 at 10/16/15 7:24 AM:
----------------------------------------------------------------

Hi [~sharvanath]: thanks for taking the time to strace this and find our (my) 
mistake. 

[~stefania]: thanks for providing a patch. -LGTM. I'll commit once we have 
clean CI results.-  Why did you put the {{sync()}} call in the finally block 
before the close, instead of in the try block? At the very least we should 
ensure the close is called after, but it seems easiest to place it in the try 
block.


was (Author: benedict):
Hi [~sharvanath]: thanks for taking the time to strace this and find our (my) 
mistake. 

[~stefania]: thanks for providing a patch. LGTM. I'll commit once we have clean 
CI results.

> CompressionInfo not being fsynced on close
> ------------------------------------------
>
>                 Key: CASSANDRA-10534
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10534
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sharvanath Pathak
>            Assignee: Stefania
>             Fix For: 2.1.x
>
>
> I was seeing SSTable corruption due to a CompressionInfo.db file of size 0, 
> this happened multiple times in our testing with hard node reboots. After 
> some investigation it seems like these file is not being fsynced, and that 
> can potentially lead to data corruption. I am working with version 2.1.9.
> I checked for fsync calls using strace, and found them happening for all but 
> the following components: CompressionInfo, TOC.txt and digest.sha1. All of 
> these but the CompressionInfo seem tolerable. Also a quick look through the 
> code did not reveal any fsync calls. Moreover, I suspect the commit  
> 4e95953f29d89a441dfe06d3f0393ed7dd8586df 
> (https://github.com/apache/cassandra/commit/4e95953f29d89a441dfe06d3f0393ed7dd8586df#diff-b7e48a1398e39a936c11d0397d5d1966R344)
>  has caused the regression, which removed the line
> {noformat}
>  getChannel().force(true);
> {noformat}
> from CompressionMetadata.Writer.close.
> Following is the trace I saw in system.log:
> {noformat}
> INFO  [SSTableBatchOpen:1] 2015-09-29 19:24:39,170 SSTableReader.java:478 - 
> Opening 
> /var/lib/cassandra/data/system/compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-ka-13368
>  (79 bytes)
> ERROR [SSTableBatchOpen:1] 2015-09-29 19:24:39,177 FileUtils.java:447 - 
> Exiting forcefully due to file system exception on startup, disk failure 
> policy "stop"
> org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException
>         at 
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:131)
>  ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:85)
>  ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> org.apache.cassandra.io.util.CompressedSegmentedFile$Builder.metadata(CompressedSegmentedFile.java:79)
>  ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:72)
>  ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> org.apache.cassandra.io.util.SegmentedFile$Builder.complete(SegmentedFile.java:168)
>  ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:752) 
> ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:703) 
> ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:491) 
> ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:387) 
> ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> org.apache.cassandra.io.sstable.SSTableReader$4.run(SSTableReader.java:534) 
> ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
> [na:1.7.0_80]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
> [na:1.7.0_80]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_80]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_80]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]
> Caused by: java.io.EOFException: null
>         at 
> java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340) 
> ~[na:1.7.0_80]
>         at java.io.DataInputStream.readUTF(DataInputStream.java:589) 
> ~[na:1.7.0_80]
>         at java.io.DataInputStream.readUTF(DataInputStream.java:564) 
> ~[na:1.7.0_80]
>         at 
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:106)
>  ~[apache-cassandra-2.1.9.jar:2.1.9]
>         ... 14 common frames omitted
> {noformat}
> Following is the result of ls on the data directory of a corrupted SSTable 
> after the hard reboot:
> {noformat}
> $ ls -l 
> /var/lib/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/
> total 60
> -rw-r--r-- 1 cassandra cassandra     0 Oct 15 09:31 
> system-sstable_activity-ka-1-CompressionInfo.db
> -rw-r--r-- 1 cassandra cassandra  9740 Oct 15 09:31 
> system-sstable_activity-ka-1-Data.db
> -rw-r--r-- 1 cassandra cassandra     0 Oct 15 09:31 
> system-sstable_activity-ka-1-Digest.sha1
> -rw-r--r-- 1 cassandra cassandra   880 Oct 15 09:31 
> system-sstable_activity-ka-1-Filter.db
> -rw-r--r-- 1 cassandra cassandra 34000 Oct 15 09:31 
> system-sstable_activity-ka-1-Index.db
> -rw-r--r-- 1 cassandra cassandra  7338 Oct 15 09:31 
> system-sstable_activity-ka-1-Statistics.db
> -rw-r--r-- 1 cassandra cassandra     0 Oct 15 09:31 
> system-sstable_activity-ka-1-TOC.txt
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to