[ 
https://issues.apache.org/jira/browse/CASSANDRA-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992551#comment-12992551
 ] 

Jonathan Ellis commented on CASSANDRA-2128:
-------------------------------------------

Paul sent me one of the CommitLog files exhibiting this problem.  It is 
99614720 bytes long.

I added one more log entry when replaying it:

{code}
                    logger.debug("checksum of " + bytes.length + " successful: 
" + claimedCRC32);
                    /* deserialize the commit log entry */
{code}

The last entries in the log before dying are

{noformat}
DEBUG [main] 2011-02-09 09:39:15,577 CommitLog.java (line 213) Reading mutation 
at 97750254
DEBUG [main] 2011-02-09 09:39:15,577 CommitLog.java (line 240) checksum of 1209 
successful: 2281213824
DEBUG [main] 2011-02-09 09:39:15,578 CommitLog.java (line 244) replaying 
mutation for MonitorApp...
DEBUG [main] 2011-02-09 09:39:15,578 CommitLog.java (line 213) Reading mutation 
at 97751479
DEBUG [main] 2011-02-09 09:39:15,580 CommitLog.java (line 240) checksum of 0 
successful: 0
 INFO [main] 2011-02-09 09:39:15,604 CommitLog.java (line 253) Finished reading 
/var/lib/cassandra/commitlog/CommitLog-1297099954252.log
ERROR [main] 2011-02-09 09:39:15,652 CassandraDaemon.java (line 242) Exception 
encountered during startup.
java.io.EOFException
        at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323)
{noformat}

This is a good 2MB before the end of the file.  So I looked in a hex editor to 
see what was there, starting at offset 97751479.  There's about 8KB of pure 00 
bytes, followed by nonsense that occasionally looks like a RowMutation (some 
appearances of column names cnt, avg, dev that are present in the early, intact 
part of the commitlog) but mostly does not: there are no more appearances of 
the keyspace name, MonitorApp, which is the first part of any real RowMutation 
in the 0.6 serialization format, and in one place there is even the string 
"java.util.BitSet" visible, which should only appear in the commitlog segment 
header at the start of the file (that we rewrite at each flush).

To me it looks like "EC2 can write a bunch of nonsense in your commitlog during 
a hard failure," and the fix is to make the checksum process more robust 
against nonsense that happens to conform to the first couple bytes we read for 
a RowMutation.

> Corrupted Commit logs
> ---------------------
>
>                 Key: CASSANDRA-2128
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2128
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.6.11
>         Environment: cassandra-0.6 @ r1064246 (0.6.11)
> Ubuntu 9.10
> Rackspace Cloud
>            Reporter: Paul Querna
>            Assignee: Gary Dusbabek
>
> Two of our nodes had a hard failure.
> They both came up with a corrupted commit log.
> On startup we get this:
> {quote}
> 011-02-07_19:34:03.95124 INFO - Finished reading 
> /var/lib/cassandra/commitlog/CommitLog-1297099954252.log
> 2011-02-07_19:34:03.95400 ERROR - Exception encountered during startup.
> 2011-02-07_19:34:03.95403 java.io.EOFException
> 2011-02-07_19:34:03.95403     at 
> java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323)
> 2011-02-07_19:34:03.95404     at 
> java.io.DataInputStream.readUTF(DataInputStream.java:572)
> 2011-02-07_19:34:03.95405     at 
> java.io.DataInputStream.readUTF(DataInputStream.java:547)
> 2011-02-07_19:34:03.95406     at 
> org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:363)
> 2011-02-07_19:34:03.95407     at 
> org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:318)
> 2011-02-07_19:34:03.95408     at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:240)
> 2011-02-07_19:34:03.95409     at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:172)
> 2011-02-07_19:34:03.95409     at 
> org.apache.cassandra.thrift.CassandraDaemon.setup(CassandraDaemon.java:115)
> 2011-02-07_19:34:03.95410     at 
> org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:224)
> 2011-02-07_19:34:03.95422 Exception encountered during startup.
> 2011-02-07_19:34:03.95436 java.io.EOFException
> 2011-02-07_19:34:03.95447     at 
> java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323)
> 2011-02-07_19:34:03.95458     at 
> java.io.DataInputStream.readUTF(DataInputStream.java:572)
> 2011-02-07_19:34:03.95468     at 
> java.io.DataInputStream.readUTF(DataInputStream.java:547)
> 2011-02-07_19:34:03.95478     at 
> org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:363)
> 2011-02-07_19:34:03.95489     at 
> org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:318)
> 2011-02-07_19:34:03.95499     at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:240)
> 2011-02-07_19:34:03.95510     at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:172)
> 2011-02-07_19:34:03.95521     at 
> org.apache.cassandra.thrift.CassandraDaemon.setup(CassandraDaemon.java:115)
> 2011-02-07_19:34:03.95531     at 
> org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:224)
> {quote}
> On node A, the commit log in question is 100mb.
> On node B, the commit log in question is 60mb.
> An ideal resolution would be if EOF is hit early, log something, but don't 
> stop the startup.  Instead process everything that we have done so far, and 
> keep going.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to