column.name.remaining() <= 0 deep in mutation stage (+ maybe other effects)
---------------------------------------------------------------------------
Key: CASSANDRA-1860
URL: https://issues.apache.org/jira/browse/CASSANDRA-1860
Project: Cassandra
Issue Type: Bug
Reporter: Peter Schuller
I suspect there is a lingering ByteBuffer consumption issue somewhere. I'll
start with the most direct indication of this, which is that I see this (on
yesterday's 0.7):
2010-12-14T08:11:43.656+00:00 127.0.0.1 java.lang.AssertionError
2010-12-14T08:11:43.656+00:00 127.0.0.1 at
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:44)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:35)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at
org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at
org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at
org.apache.cassandra.db.RowMutationSerializer.freezeTheMaps(RowMutation.java:367)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at
org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:378)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at
org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:356)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at
org.apache.cassandra.db.RowMutation.makeRowMutationMessage(RowMutation.java:223)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at
org.apache.cassandra.db.RowMutation.makeRowMutationMessage(RowMutation.java:216)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at
org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:136)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at
org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:442)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at
org.apache.cassandra.thrift.CassandraServer.insert(CassandraServer.java:379)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at
org.apache.cassandra.thrift.Cassandra$Processor$insert.process(Cassandra.java:2952)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at java.lang.Thread.run(Thread.java:662)
That assertion failure is:
assert column.name().remaining() > 0;
While I have not absolutely confirmed that we are not accidentally sending
empty column names, I don't believe that is the case and regardless Cassandra
does try to validate RowMutations and part of that validation is that column
names are non-empty (ThriftValidation.validateColumns()).
Some more information:
This is on a three-node test cluster. We have some low number of processes
doing nothing but insertions using insert() (via pycassa). As far as I can tell
the column validation code is triggering regardless of whether you use
batch_mutate() or insert() (pycassa optimizes the single-column case by using
insert()).
The interesting thing is that we have three nodes in the cluster and they
trigger different issues. On the two remaining nodes we had already observed
two different stack traces:
org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find
cfId=798579
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:117)
at
org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps(RowMutation.java:388)
at
org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:398)
at
org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:356)
at
org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:52)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
And the other one:
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at
org.apache.cassandra.utils.FBUtilities.readByteArray(FBUtilities.java:265)
at
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:95)
at
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35)
at
org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129)
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120)
at
org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps(RowMutation.java:388)
at
org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:398)
at
org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:356)
at
org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:52)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Originally I was only looking at these later two wondering whether it might
have been a result of a schema mismatch due to a bootstrapping mistake w.r.t.
seed nodes joining the cluster. However, the schema looks matching on all nodes
in the cluster and we keep seeing those exceptions even after a full
repair/compact on the entire cluster (and those don't fail; so all data is read
without problems) in addition to removing of the hints sstables to remove any
lingering mutations.
Today I noticed that the third node, that does NOT see the above two issues,
reports quite a number of exceptions of the assertion failure kind. I am not
sure why the behavior does not seem to be consistent; we have three hosts
running the python/pycassa clients spread over several processes. tcpdump
indicates thrift RPC calls are definitely flowing to hosts other than the third
one (that sees the assertion failure). They are all running the same version of
Cassandra (same debian package, definitely restarted since the latest upgrade -
I know that since I restarted nodes when testing the schema change issue I have
filed separately).
At this moment I don't have better information than the above and I have not
reproduced it on e.g. an independent cluster.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.