Re: unnecessary tombstone's transmission during repair process
Gus, we've found the cause. It was a problem in Cassandra, but it has been already fixed in cassandra 1.1.6. Commit with the problem: 2c69e2ea757be9492a095aa22b5d51234c4b4102 You can see it at https://issues.apache.org/jira/secure/attachment/12544204/CASSANDRA-4561-CS.patch Commit with the fix: 988ea81d409968614d84dacb3a022dcb156172c3 There is no ticket in JIRA about that commit (at least I couldn't find the ticket). Also our client node just was not synchronized accordingly Cassandra's nodes. Client node lived in the future (just a few minutes). So that's the cause of described streams during repair process. Thanks all for the discussion!
Re: unnecessary tombstone's transmission during repair process
Sylvain, I've seen to the code. Yes, you right about local deletion time. But it contradicts to the tests results. Do you have any thoughts how to explain result of the second test after patch applying? Our patch: diff --git a/src/java/org/apache/cassandra/db/DeletedColumn.java b/src/java/org/apache/cassandra/db/DeletedColumn.java index 18faeef..31744f6 100644 --- a/src/java/org/apache/cassandra/db/DeletedColumn.java +++ b/src/java/org/apache/cassandra/db/DeletedColumn.java @@ -17,10 +17,13 @@ */ package org.apache.cassandra.db; +import java.io.IOException; import java.nio.ByteBuffer; +import java.security.MessageDigest; import org.apache.cassandra.config.CFMetaData; import org.apache.cassandra.db.marshal.MarshalException; +import org.apache.cassandra.io.util.DataOutputBuffer; import org.apache.cassandra.utils.Allocator; import org.apache.cassandra.utils.ByteBufferUtil; import org.apache.cassandra.utils.HeapAllocator; @@ -46,6 +49,25 @@ public class DeletedColumn extends Column } @Override +public void updateDigest(MessageDigest digest) { +digest.update(name.duplicate()); +// it's commented to prevent consideration of the localDeletionTime in Merkle Tree construction +//digest.update(value.duplicate()); + +DataOutputBuffer buffer = new DataOutputBuffer(); +try +{ +buffer.writeLong(timestamp); +buffer.writeByte(serializationFlags()); +} +catch (IOException e) +{ +throw new RuntimeException(e); +} +digest.update(buffer.getData(), 0, buffer.getLength()); +} + +@Override public long getMarkedForDeleteAt() { return timestamp; -- Best regards** Zotov Alexey Grid Dynamics Skype: azotcsit
Re: unnecessary tombstone's transmission during repair process
+1 I want to see how this plays out as well. Anyone know the answer? Dean From: Alexey Zotov azo...@griddynamics.commailto:azo...@griddynamics.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Friday, October 12, 2012 1:33 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: unnecessary tombstone's transmission during repair process iff --git a/src/java/org/apache/cassandra/db/DeletedColumn.java b/src/java/org/apache/cassandra/db/DeletedColumn.java index 18faeef..31744f6 100644 --- a/src/java/org/apache/cassandra/db/DeletedColumn.java +++ b/src/java/org/apache/cassandra/db/DeletedColumn.java @@ -17,10 +17,13 @@ */ package org.apache.cassandra.db; +import java.io.IOException; import java.nio.ByteBuffer; +import java.security.MessageDigest; import org.apache.cassandra.config.CFMetaData; import org.apache.cassandra.db.marshal.MarshalException; +import org.apache.cassandra.io.util.DataOutputBuffer; import org.apache.cassandra.utils.Allocator;
unnecessary tombstone's transmission during repair process
Hi Guys, I have a question about merkle tree construction and repair process. When mercle tree is constructing it calculates hashes. For DeletedColumn it calculates hash using value. Value of DeletedColumn is a serialized local deletion time. We know that local deletion time can be different on different nodes for the same tombstone. So hashes of the same tombstone on different nodes will be different. Is it true? I think that local deletion time shouldn't be considered in hash's calculation. We've provided several tests: // we have 3 node, RF=2, CL=QUORUM. So we have strong consistency. 1. Populate data to all nodes. Run repair process. No any streams were transmitted. It's predictable behaviour. 2. Then we removed some columns for some rows. No any nodes we down. All writes were done successfully. We run repair. There were some streams. It's strange for me, because all data should be consistent. We've created some patch and applied it. 1. Result of the first test is the same. 2. Result of the second test: there were no any unnecessary streams as I expected. My question is: Is transmission of the equals tombstones during repair process a feature? :) or is it a bug? If it's a bug, I'll create ticket and attach patch to it.
Re: unnecessary tombstone's transmission during repair process
On Thu, Oct 11, 2012 at 8:41 AM, Alexey Zotov azo...@griddynamics.com wrote: Value of DeletedColumn is a serialized local deletion time. We know that local deletion time can be different on different nodes for the same tombstone. So hashes of the same tombstone on different nodes will be different. Is it true? Yes, this seems correct based on my understanding of the process of writing tombstones. I think that local deletion time shouldn't be considered in hash's calculation. I think you are correct; the only thing that matters is whether the tombstone exists or not. There may be something I am missing about why the very-unlikely-to-be-identical value should be considered a merkle tree failure. https://issues.apache.org/jira/browse/CASSANDRA-2279 Seems related to this issue, fwiw. Is transmission of the equals tombstones during repair process a feature? :) or is it a bug? I think it's a bug. If it's a bug, I'll create ticket and attach patch to it. Yay! =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: unnecessary tombstone's transmission during repair process
I have a question about merkle tree construction and repair process. When mercle tree is constructing it calculates hashes. For DeletedColumn it calculates hash using value. Value of DeletedColumn is a serialized local deletion time. The deletion time time is not local to each replica, it's computed only once by the coordinator node that received the deletion initially. We know that local deletion time can be different on different nodes for the same tombstone. Given the above, no it cannot. -- Sylvain