Re: unnecessary tombstone's transmission during repair process

2012-10-19 Thread Alexey Zotov
Gus, we've found the cause.

It was a problem in Cassandra, but it has been already fixed in cassandra
1.1.6.

Commit with the problem:
2c69e2ea757be9492a095aa22b5d51234c4b4102
You can see it at
https://issues.apache.org/jira/secure/attachment/12544204/CASSANDRA-4561-CS.patch

Commit with the fix:
988ea81d409968614d84dacb3a022dcb156172c3
There is no ticket in JIRA about that commit (at least I couldn't find the
ticket).

Also our client node just was not synchronized accordingly Cassandra's
nodes. Client node lived in the future (just a few minutes).

So that's the cause of described streams during repair process.

Thanks all for the discussion!


Re: unnecessary tombstone's transmission during repair process

2012-10-12 Thread Alexey Zotov
Sylvain,

I've seen to the code. Yes, you right about local deletion time. But it
contradicts to the tests results.

Do you have any thoughts how to explain result of the second test after
patch applying?


Our patch:

diff --git a/src/java/org/apache/cassandra/db/DeletedColumn.java
b/src/java/org/apache/cassandra/db/DeletedColumn.java
index 18faeef..31744f6 100644
--- a/src/java/org/apache/cassandra/db/DeletedColumn.java
+++ b/src/java/org/apache/cassandra/db/DeletedColumn.java
@@ -17,10 +17,13 @@
  */
 package org.apache.cassandra.db;

+import java.io.IOException;
 import java.nio.ByteBuffer;
+import java.security.MessageDigest;

 import org.apache.cassandra.config.CFMetaData;
 import org.apache.cassandra.db.marshal.MarshalException;
+import org.apache.cassandra.io.util.DataOutputBuffer;
 import org.apache.cassandra.utils.Allocator;
 import org.apache.cassandra.utils.ByteBufferUtil;
 import org.apache.cassandra.utils.HeapAllocator;
@@ -46,6 +49,25 @@ public class DeletedColumn extends Column
 }

 @Override
+public void updateDigest(MessageDigest digest) {
+digest.update(name.duplicate());
+// it's commented to prevent consideration of the
localDeletionTime in Merkle Tree construction
+//digest.update(value.duplicate());
+
+DataOutputBuffer buffer = new DataOutputBuffer();
+try
+{
+buffer.writeLong(timestamp);
+buffer.writeByte(serializationFlags());
+}
+catch (IOException e)
+{
+throw new RuntimeException(e);
+}
+digest.update(buffer.getData(), 0, buffer.getLength());
+}
+
+@Override
 public long getMarkedForDeleteAt()
 {
 return timestamp;




-- 

Best regards**

Zotov Alexey
Grid Dynamics
Skype: azotcsit


Re: unnecessary tombstone's transmission during repair process

2012-10-12 Thread Hiller, Dean
+1

I want to see how this plays out as well.  Anyone know the answer?

Dean

From: Alexey Zotov azo...@griddynamics.commailto:azo...@griddynamics.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Friday, October 12, 2012 1:33 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: unnecessary tombstone's transmission during repair process

iff --git a/src/java/org/apache/cassandra/db/DeletedColumn.java 
b/src/java/org/apache/cassandra/db/DeletedColumn.java
index 18faeef..31744f6 100644
--- a/src/java/org/apache/cassandra/db/DeletedColumn.java
+++ b/src/java/org/apache/cassandra/db/DeletedColumn.java
@@ -17,10 +17,13 @@
  */
 package org.apache.cassandra.db;

+import java.io.IOException;
 import java.nio.ByteBuffer;
+import java.security.MessageDigest;

 import org.apache.cassandra.config.CFMetaData;
 import org.apache.cassandra.db.marshal.MarshalException;
+import org.apache.cassandra.io.util.DataOutputBuffer;
 import org.apache.cassandra.utils.Allocator;


unnecessary tombstone's transmission during repair process

2012-10-11 Thread Alexey Zotov
Hi Guys,

I have a question about merkle tree construction and repair process. When
mercle tree is constructing it calculates hashes. For DeletedColumn it
calculates hash using value. Value of DeletedColumn is a serialized local
deletion time. We know that local deletion time can be different on
different nodes for the same tombstone. So hashes of the same tombstone on
different nodes will be different. Is it true? I think that local deletion
time shouldn't be considered in hash's calculation.

We've provided several tests:
// we have 3 node, RF=2, CL=QUORUM. So we have strong consistency.
1. Populate data to all nodes. Run repair process. No any streams were
transmitted. It's predictable behaviour.
2. Then we removed some columns for some rows. No any nodes we down. All
writes were done successfully. We run repair. There were some streams. It's
strange for me, because all data should be consistent.

We've created some patch and applied it.
1. Result of the first test is the same.
2. Result of the second test: there were no any unnecessary streams as I
expected.


My question is:
Is transmission of the equals tombstones during repair process a feature?
:) or is it a bug?
If it's a bug, I'll create ticket and attach patch to it.


Re: unnecessary tombstone's transmission during repair process

2012-10-11 Thread Rob Coli
On Thu, Oct 11, 2012 at 8:41 AM, Alexey Zotov azo...@griddynamics.com wrote:
 Value of DeletedColumn is a serialized local
 deletion time. We know that local deletion time can be different on
 different nodes for the same tombstone. So hashes of the same tombstone on
 different nodes will be different. Is it true?

Yes, this seems correct based on my understanding of the process of
writing tombstones.

 I think that local deletion time shouldn't be considered in hash's 
 calculation.

I think you are correct; the only thing that matters is whether the
tombstone exists or not. There may be something I am missing about why
the very-unlikely-to-be-identical value should be considered a merkle
tree failure.

https://issues.apache.org/jira/browse/CASSANDRA-2279

Seems related to this issue, fwiw.

 Is transmission of the equals tombstones during repair process a feature? :)
 or is it a bug?

I think it's a bug.

 If it's a bug, I'll create ticket and attach patch to it.

Yay!

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: unnecessary tombstone's transmission during repair process

2012-10-11 Thread Sylvain Lebresne
 I have a question about merkle tree construction and repair process. When
 mercle tree is constructing it calculates hashes. For DeletedColumn it
 calculates hash using value. Value of DeletedColumn is a serialized local
 deletion time.

The deletion time time is not local to each replica, it's computed
only once by the coordinator node that received the deletion
initially.

 We know that local deletion time can be different on
 different nodes for the same tombstone.

Given the above, no it cannot.

--
Sylvain