[ 
https://issues.apache.org/jira/browse/CASSANDRA-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448022#comment-13448022
 ] 

Aaron Morton commented on CASSANDRA-4601:
-----------------------------------------

Thanks
                
> Ensure unique commit log file names
> -----------------------------------
>
>                 Key: CASSANDRA-4601
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4601
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.0
>         Environment: Sun JVM 1.6.33 / Ubuntu 10.04.4 LTS 
>            Reporter: Aaron Morton
>            Assignee: Aaron Morton
>            Priority: Critical
>             Fix For: 1.1.5
>
>         Attachments: cassandra-1.1-4601.patch
>
>
> The commit log segment name uses System.nanoTime() as part of the file name. 
> There is no guarantee that successive calls to nanoTime() will return 
> different values. And on less than optimal hypervisors this happens a lot. 
> I observed the following in the wild:
> {code:java}
> ERROR [COMMIT-LOG-ALLOCATOR] 2012-08-31 15:56:49,815 
> AbstractCassandraDaemon.java (line 134) Exception in thread 
> Thread[COMMIT-LOG-ALLOCATOR,5,main]
> java.lang.AssertionError: attempted to delete non-existing file 
> CommitLog-13926764209796414.log
>         at 
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:68)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogSegment.discard(CommitLogSegment.java:172)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogAllocator$4.run(CommitLogAllocator.java:223)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:95)
>         at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>         at java.lang.Thread.run(Unknown Source)
> {code}
> My _assumption_ is that it was because of duplicate file names. As this is on 
> a hypervisor that is less than optimal. 
>  
> After a while (about 30 minutes) mutations stopped being processed and the 
> pending count sky rocketed. I _think_ this was because log writing was 
> blocked trying to get a new segment and writers could not submit to the 
> commit log queue. The only way to stop the affected nodes was kill -9. 
> Over about 24 hours this happened 5 times. I have deployed a patch that has 
> been running for 12 hours without incident, will attach. 
> The affected nodes could still read, and I'm checking logs to see how the 
> other nodes handled the situation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to