How to change the seed node Cassandra 1.0.11

2012-10-23 Thread Roshan
Hi

In our production, we have 3 Cassandra 1.0.11 nodes.

Due to a reason, I want to move the current seed node to another node and
once seed node change, the previous node want to remove from cluster.

How can I do that?

Thanks. 




--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-to-change-the-seed-node-Cassandra-1-0-11-tp7583338.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: What does ReadRepair exactly do?

2012-10-23 Thread aaron morton
Yes, all this starts because of the call to filter.collateColumns()…

The ColumnFamily is an implementation of o.a.c.dbAbstractColumnContainer , the 
methods to add columns on that interface pass through to an implementation of 
ISortedColumns. 

The implementations of ISortedColumns, e.g. ArrayBackedSortedColumns, will call 
reconcile() on the IColumn if they need to. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/10/2012, at 4:45 AM, Manu Zhang owenzhang1...@gmail.com wrote:

 Is it through filter.collateColumns(resolved, iters, Integer.MIN_VALUE) and 
 then MergeIterator.get(toCollate, fcomp, reducer) but I don't know what 
 happens hereafter? How is reconcile exactly been called?
 
 On Mon, Oct 22, 2012 at 6:49 AM, aaron morton aa...@thelastpickle.com wrote:
 There are two processes in cassandra that trigger Read Repair like behaviour. 
 
 During a DigestMismatchException is raised if the responses from the replicas 
 do not match. In this case another read is run that involves reading all the 
 data. This is the CL level agreement kicking in. 
 
 The other Read Repair is the one controlled by the read_repair_chance. 
 When RR is active on a request ALL up replicas are involved in the read. When 
 RR is not active only CL replicas are involved. When test for CL agreement 
 occurs synchronously to the request; the RR check waits asynchronously to the 
 request for all nodes in the request to return. It then checks for 
 consistency and repairs differences. 
 
 From looking at the source code, I do not understand how this set is built 
 and I do not understand how the reconciliation is executed.
 When a DigestMismatch is detected a read is run using RepairCallback. The 
 callback will call the RowRepairResolver.resolve() when enough responses have 
 been collected. 
 
 resolveSuperset() picks one response to the baseline, and then calls delete() 
 to apply row level deletes from the other responses (ColumnFamily's). It 
 collects the other CF's into an iterator with a filter that returns all 
 columns. The columns are then applied to the baseline CF which may result in 
 reconcile() being called. 
 
 reconcile() is used when a AbstractColumnContainer has two versions of a 
 column and it wants to only have one. 
 
 RowRepairResolve.scheduleRepairs() works out the delta for each node by 
 calling ColumnFamily.diff(). The delta is then sent to the appropriate node.
 
 
 Hope that helps. 
 
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 19/10/2012, at 6:33 AM, Markus Klems markuskl...@gmail.com wrote:
 
 Hi guys,
 
 I am looking through the Cassandra source code in the github trunk to better 
 understand how Cassandra's fault-tolerance mechanisms work. Most things make 
 sense. I am also aware of the wiki and DataStax documentation. However, I do 
 not understand what read repair does in detail. The method 
 RowRepairResolver.resolveSuperset(IterableColumnFamily versions) seems to 
 do the trick of merging conflicting versions of column family replicas and 
 builds the set of columns that need to be repaired. From looking at the 
 source code, I do not understand how this set is built and I do not 
 understand how the reconciliation is executed. ReadRepair does not seem to 
 trigger a Column.reconcile() to reconcile conflicting column versions on 
 different servers. Does it?
 
 If this is not what read repair does, then: What kind of inconsistencies are 
 resolved by read repair? And: How are the inconsistencies resolved?
 
 Could someone give me a hint?
 
 Thanks so much,
 
 -Markus
 
 



Re: Node Dead/Up

2012-10-23 Thread Jason Wee
check 10.50.10.21 for what is the system load.

On Tue, Oct 23, 2012 at 10:41 AM, Jason Hill jasonhill...@gmail.com wrote:

 Hello,

 I'm on version 1.0.11.

 I'm seeing this in my system log with occasional frequency:

 INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818)
 InetAddress /10.50.10.21 is now dead.
 INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804)
 InetAddress /10.50.10.21 is now UP


 INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java
 (line 228) Streaming to /10.50.10.25 --this line included for context
 INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818)
 InetAddress /10.50.10.25 is now dead.
 INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804)
 InetAddress /10.50.10.25 is now UP
 INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249
 AntiEntropyService.java (line 233) [repair
 #5a3383c0-1cb5-11e2--56b66459adef] Sending completed merkle tree
 to /10.50.10.25 for (Innovari,TICCompressedLoad) --this line included
 for context

 What is this telling me? Is my network dropping for less than a
 second? Are my nodes really dead and then up? Can someone shed some
 light on this for me?

 cheers,
 Jason



Re: tuning for read performance

2012-10-23 Thread aaron morton
 and nodetool tpstats always shows pending tasks in the ReadStage.
Are clients reading a single row at a time or multiple rows ? Each row 
requested in a multi get becomes a task in the read stage. 

Also look at the type of query you are sending. I talked a little about the 
performance of different query techniques at Cassandra 
SFhttp://www.datastax.com/events/cassandrasummit2012/presentations

 
 1. Consider Leveled compaction instead of Size Tiered.  LCS improves
 read performance at the cost of more writes.
I would look at other options first. 
If you want to know how many SSTables a read is hitting look at nodetool 
cfhistograms

 2. You said skinny column family which I took to mean not a lot of
 columns/row.  See if you can organize your data into wider rows which
 allow reading fewer rows and thus fewer queries/disk seeks.

Wide rows take longer to read than narrow ones. Artificially wide rows may take 
longer to read than narrow ones. 


 4. Splitting your data from your MetaData could definitely help.  I
 like separating my read heavy from write heavy CF's because generally
 speaking they benefit from different compaction methods.  But don't go
 crazy creating 1000's of CF's either.

+1
25 ms read latency is high. 

Hope that helps. 

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/10/2012, at 9:06 AM, Aaron Turner synfina...@gmail.com wrote:

 On Mon, Oct 22, 2012 at 11:05 AM, feedly team feedly...@gmail.com wrote:
 Hi,
I have a small 2 node cassandra cluster that seems to be constrained by
 read throughput. There are about 100 writes/s and 60 reads/s mostly against
 a skinny column family. Here's the cfstats for that family:
 
 SSTable count: 13
 Space used (live): 231920026568
 Space used (total): 231920026568
 Number of Keys (estimate): 356899200
 Memtable Columns Count: 1385568
 Memtable Data Size: 359155691
 Memtable Switch Count: 26
 Read Count: 40705879
 Read Latency: 25.010 ms.
 Write Count: 9680958
 Write Latency: 0.036 ms.
 Pending Tasks: 0
 Bloom Filter False Postives: 28380
 Bloom Filter False Ratio: 0.00360
 Bloom Filter Space Used: 874173664
 Compacted row minimum size: 61
 Compacted row maximum size: 152321
 Compacted row mean size: 1445
 
 iostat shows almost no write activity, here's a typical line:
 
 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
 avgqu-sz   await  svctm  %util
 sdb   0.00 0.00  312.870.00 6.61 0.0043.27
 23.35  105.06   2.28  71.19
 
 and nodetool tpstats always shows pending tasks in the ReadStage. The data
 set has grown beyond physical memory (250GB/node w/64GB of RAM) so I know
 disk access is required, but are there particular settings I should
 experiment with that could help relieve some read i/o pressure? I already
 put memcached in front of cassandra so the row cache probably won't help
 much.
 
 Also this column family stores smallish documents (usually 1-100K) along
 with metadata. The document is only occasionally accessed, usually only the
 metadata is read/written. Would splitting out the document into a separate
 column family help?
 
 
 Some un-expert advice:
 
 1. Consider Leveled compaction instead of Size Tiered.  LCS improves
 read performance at the cost of more writes.
 
 2. You said skinny column family which I took to mean not a lot of
 columns/row.  See if you can organize your data into wider rows which
 allow reading fewer rows and thus fewer queries/disk seeks.
 
 3. Enable compression if you haven't already.
 
 4. Splitting your data from your MetaData could definitely help.  I
 like separating my read heavy from write heavy CF's because generally
 speaking they benefit from different compaction methods.  But don't go
 crazy creating 1000's of CF's either.
 
 Hope that gives you some ideas to investigate further!
 
 
 -- 
 Aaron Turner
 http://synfin.net/ Twitter: @synfinatic
 http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix  
 Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
-- Benjamin Franklin
 carpe diem quam minimum credula postero



Re: Strange row expiration behavior

2012-10-23 Thread aaron morton
 Performing these steps results in the rows still being present using 
 cassandra-cli list. 
I assume you are saying the row key is listed without any columns. aka a ghost 
row. 

  What gets really odd is if I add these steps it works
That's working as designed. 

gc_grace_seconds does not specify when tombstones must be purged, rather it 
specifies the minimum duration the tombstone must be stored. It's really saying 
if you compact this column X seconds after the delete you can purge the 
tombstone.

Minor / automatic compaction will kick in if there are (by default) 4 SSTables 
of the same size. And will only purge tombstones if all fragments of the row 
exists in the SSTables being compaction. 

Major / manual compaction compacts all the sstables, and so purges the 
tombstones IF gc_grace_seconds has expired. 

In your first example compaction had not run so the tombstones stayed on disk. 
In the second the major compaction purged expired tombstones. 

Hope that helps. 
  
-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/10/2012, at 2:49 PM, Stephen Mullins smull...@thebrighttag.com wrote:

 Hello, I'm seeing Cassandra behavior that I can't explain, on v1.0.12. I'm 
 trying to test removing rows after all columns have expired. I've read the 
 following:
 http://wiki.apache.org/cassandra/DistributedDeletes
 http://wiki.apache.org/cassandra/MemtableSSTable
 https://issues.apache.org/jira/browse/CASSANDRA-2795
 
 And came up with a test to demonstrate the empty row removal that does the 
 following:
 create a keyspace
 create a column family with gc_seconds=10 (arbitrary small number)
 insert a couple rows with ttl=5 (again, just a small number)
 use nodetool to flush the column family
 sleep 10 seconds
 ensure the columns are removed with cassandra-cli list 
 use nodetool to compact the keyspace
 Performing these steps results in the rows still being present using 
 cassandra-cli list. What gets really odd is if I add these steps it works:
 sleep 5 seconds
 use cassandra-cli to del mycf[arow]
 use nodetool to flush the column family
 use nodetool to compact the keyspace
 I don't understand why the first set of steps (1-7) don't work to remove the 
 empty row, nor do I understand why the explicit row delete somehow makes this 
 work. I have all this in a script that I could attach if that's appropriate. 
 Is there something wrong with the steps that I have?
 
 Thanks,
 Stephen



Re: nodetool cleanup

2012-10-23 Thread aaron morton
 what is the internal memory model used? It sounds like it doesn't have a page 
 manager?
Nodetool cleanup is a maintenance process to remove data on disk that the node 
is no longer a replica for. It is typically used after the token assignments 
have been changed. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/10/2012, at 6:42 PM, Will @ SOHO w...@voodoolunchbox.com wrote:

 On 10/23/2012 01:25 AM, Peter Schuller wrote:
 
 On Oct 22, 2012 11:54 AM, B. Todd Burruss bto...@gmail.com wrote:
 
  does nodetool cleanup perform a major compaction in the process of
  removing unwanted data?
 
 No.
 what is the internal memory model used? It sounds like it doesn't have a page 
 manager?



Re: How to change the seed node Cassandra 1.0.11

2012-10-23 Thread aaron morton
Just change the yaml and restart. The seed list is not persisted i the System 
KS (like the token assignment). 

I would suggest running 2 or 3 seeds in your cluster, even if you only have 3 
nodes. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/10/2012, at 7:13 PM, Roshan codeva...@gmail.com wrote:

 Hi
 
 In our production, we have 3 Cassandra 1.0.11 nodes.
 
 Due to a reason, I want to move the current seed node to another node and
 once seed node change, the previous node want to remove from cluster.
 
 How can I do that?
 
 Thanks. 
 
 
 
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-to-change-the-seed-node-Cassandra-1-0-11-tp7583338.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: Node Dead/Up

2012-10-23 Thread aaron morton
 check 10.50.10.21 for what is the system load.
+1

And take a look in the logs on 10.21. 

10.21 is being seen as down by the other nodes. it could be:

* 10.21 failing to gossip fast enough, say by being overloaded to in long 
ParNew GC pauses. 
* This node failing to process gossip fast , say by being overloaded to in long 
ParNew GC pauses. 
* Problems with the tubes used to connect the nodes. 

(It's probably the first one.)
 
Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/10/2012, at 8:19 PM, Jason Wee peich...@gmail.com wrote:

 check 10.50.10.21 for what is the system load.
 
 On Tue, Oct 23, 2012 at 10:41 AM, Jason Hill jasonhill...@gmail.com wrote:
 Hello,
 
 I'm on version 1.0.11.
 
 I'm seeing this in my system log with occasional frequency:
 
 INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818)
 InetAddress /10.50.10.21 is now dead.
 INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804)
 InetAddress /10.50.10.21 is now UP
 
 
 INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java
 (line 228) Streaming to /10.50.10.25 --this line included for context
 INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818)
 InetAddress /10.50.10.25 is now dead.
 INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804)
 InetAddress /10.50.10.25 is now UP
 INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249
 AntiEntropyService.java (line 233) [repair
 #5a3383c0-1cb5-11e2--56b66459adef] Sending completed merkle tree
 to /10.50.10.25 for (Innovari,TICCompressedLoad) --this line included
 for context
 
 What is this telling me? Is my network dropping for less than a
 second? Are my nodes really dead and then up? Can someone shed some
 light on this for me?
 
 cheers,
 Jason
 



Re: Strange row expiration behavior

2012-10-23 Thread Stephen Mullins
Thanks Aaron, my reply is inline below:

On Tue, Oct 23, 2012 at 2:38 AM, aaron morton aa...@thelastpickle.comwrote:

 Performing these steps results in the rows still being present using 
 *cassandra-cli
 list*.

 I assume you are saying the row key is listed without any columns. aka a
 ghost row.

Correct.


  What gets really odd is if I add these steps it works

 That's working as designed.

 gc_grace_seconds does not specify when tombstones must be purged, rather
 it specifies the minimum duration the tombstone must be stored. It's really
 saying if you compact this column X seconds after the delete you can purge
 the tombstone.

 Minor / automatic compaction will kick in if there are (by default) 4
 SSTables of the same size. And will only purge tombstones if all fragments
 of the row exists in the SSTables being compaction.

 Major / manual compaction compacts all the sstables, and so purges the
 tombstones IF gc_grace_seconds has expired.

 In your first example compaction had not run so the tombstones stayed on
 disk. In the second the major compaction purged expired tombstones.

In the first example, I am running compaction at step 7 through nodetool,
after gc_grace_seconds has expired. Additionally, if I do not perform the
manual delete of the row in the second example, the ghost rows are not
cleaned up. I want to know that in our production environment, I don't have
to manually delete empty rows after the columns expire. But I can't get an
example working to that effect.


 Hope that helps.

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 23/10/2012, at 2:49 PM, Stephen Mullins smull...@thebrighttag.com
 wrote:

 Hello, I'm seeing Cassandra behavior that I can't explain, on v1.0.12. I'm
 trying to test removing rows after all columns have expired. I've read the
 following:
 http://wiki.apache.org/cassandra/DistributedDeletes
 http://wiki.apache.org/cassandra/MemtableSSTable
 https://issues.apache.org/jira/browse/CASSANDRA-2795

 And came up with a test to demonstrate the empty row removal that does the
 following:

1. create a keyspace
2. create a column family with gc_seconds=10 (arbitrary small number)
3. insert a couple rows with ttl=5 (again, just a small number)
4. use nodetool to flush the column family
5. sleep 10 seconds
6. ensure the columns are removed with *cassandra-cli list *
7. use nodetool to compact the keyspace

 Performing these steps results in the rows still being present using 
 *cassandra-cli
 list*. What gets really odd is if I add these steps it works:

1. sleep 5 seconds
2. use cassandra-cli to *del mycf[arow]*
3. use nodetool to flush the column family
4. use nodetool to compact the keyspace

 I don't understand why the first set of steps (1-7) don't work to remove
 the empty row, nor do I understand why the explicit row delete somehow
 makes this work. I have all this in a script that I could attach if that's
 appropriate. Is there something wrong with the steps that I have?

 Thanks,
 Stephen





Re: constant CMS GC using CPU time

2012-10-23 Thread Bryan Talbot
These GC settings are the default (recommended?) settings from
cassandra-env.  I added the UseCompressedOops.

-Bryan


On Mon, Oct 22, 2012 at 6:15 PM, Will @ SOHO w...@voodoolunchbox.comwrote:

  On 10/22/2012 09:05 PM, aaron morton wrote:

  # GC tuning options
 JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
 JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
 JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
 JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1
  JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
 JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops

  You are too far behind the reference JVM's. Parallel GC is the preferred
 and highest performing form in the current Security Baseline version of the
 JVM's.




-- 
Bryan Talbot
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo


Re: nodetool cleanup

2012-10-23 Thread B. Todd Burruss
since SSTABLEs are immutable, it must create new SSTABLEs without the
data that the node is no longer a replica for ... but it doesn't
remove deleted data.  seems like a possible optimization to also
removed deleted data and tombstone cleanup ... but i guess cleanup
shouldn't really be used that much.

thx

On Tue, Oct 23, 2012 at 12:40 AM, aaron morton aa...@thelastpickle.com wrote:
 what is the internal memory model used? It sounds like it doesn't have a
 page manager?

 Nodetool cleanup is a maintenance process to remove data on disk that the
 node is no longer a replica for. It is typically used after the token
 assignments have been changed.

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 23/10/2012, at 6:42 PM, Will @ SOHO w...@voodoolunchbox.com wrote:

 On 10/23/2012 01:25 AM, Peter Schuller wrote:


 On Oct 22, 2012 11:54 AM, B. Todd Burruss bto...@gmail.com wrote:

 does nodetool cleanup perform a major compaction in the process of
 removing unwanted data?

 No.

 what is the internal memory model used? It sounds like it doesn't have a
 page manager?




Re: What does ReadRepair exactly do?

2012-10-23 Thread Shankaranarayanan P N
Hello,

This conversation precisely targets a question that I had been having for a
while - would be grateful if you someone cloud clarify it a little further:

Considering the case of a repair created due to a consistency constraint
(first case in the discussion above), would the following interpretation be
correct ?

1. A digest mismatch exception is raised even if one among the many
responses (even if consistency is met on an out-of-date value, say by
virtue of timestamp).
2. A read is initiated by the callback to fetch data from all replicas
3. Resolve() is invoked to find the deltas for each replica that was out of
date.
4. ReadRepair is scheduled to the above replicas.
5. Perform a normal read and check if this meets the consistency
constraints. Mismatches would trigger a repair again.

Assuming the above is true, would the mutations in step 4 and the read in
step 5 happen in parallel ? In other words, would the time taken by the
read correction be the round trip between the coordinator and its farthest
replica that meets the consistency constraint.

Thanks,
Shankar


On Tue, Oct 23, 2012 at 3:17 AM, aaron morton aa...@thelastpickle.comwrote:

 Yes, all this starts because of the call to filter.collateColumns()…

 The ColumnFamily is an implementation of o.a.c.dbAbstractColumnContainer ,
 the methods to add columns on that interface pass through to an
 implementation of ISortedColumns.

 The implementations of ISortedColumns, e.g. ArrayBackedSortedColumns, will
 call reconcile() on the IColumn if they need to.

 Cheers

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 23/10/2012, at 4:45 AM, Manu Zhang owenzhang1...@gmail.com wrote:

 Is it through filter.collateColumns(resolved, iters, Integer.MIN_VALUE)
 and then MergeIterator.get(toCollate, fcomp, reducer) but I don't know what
 happens hereafter? How is reconcile exactly been called?

 On Mon, Oct 22, 2012 at 6:49 AM, aaron morton aa...@thelastpickle.comwrote:

 There are two processes in cassandra that trigger Read Repair like
 behaviour.

 During a DigestMismatchException is raised if the responses from the
 replicas do not match. In this case another read is run that involves
 reading all the data. This is the CL level agreement kicking in.

 The other Read Repair is the one controlled by the
 read_repair_chance. When RR is active on a request ALL up replicas are
 involved in the read. When RR is not active only CL replicas are involved.
 When test for CL agreement occurs synchronously to the request; the RR
 check waits asynchronously to the request for all nodes in the request to
 return. It then checks for consistency and repairs differences.

 From looking at the source code, I do not understand how this set is
 built and I do not understand how the reconciliation is executed.

 When a DigestMismatch is detected a read is run using RepairCallback. The
 callback will call the RowRepairResolver.resolve() when enough responses
 have been collected.

 resolveSuperset() picks one response to the baseline, and then calls
 delete() to apply row level deletes from the other responses
 (ColumnFamily's). It collects the other CF's into an iterator with a filter
 that returns all columns. The columns are then applied to the baseline CF
 which may result in reconcile() being called.

 reconcile() is used when a AbstractColumnContainer has two versions of a
 column and it wants to only have one.

 RowRepairResolve.scheduleRepairs() works out the delta for each node by
 calling ColumnFamily.diff(). The delta is then sent to the appropriate node.


 Hope that helps.


   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 19/10/2012, at 6:33 AM, Markus Klems markuskl...@gmail.com wrote:

 Hi guys,

 I am looking through the Cassandra source code in the github trunk to
 better understand how Cassandra's fault-tolerance mechanisms work. Most
 things make sense. I am also aware of the wiki and DataStax documentation.
 However, I do not understand what read repair does in detail. The method
 RowRepairResolver.resolveSuperset(IterableColumnFamily versions) seems to
 do the trick of merging conflicting versions of column family replicas and
 builds the set of columns that need to be repaired. From looking at the
 source code, I do not understand how this set is built and I do not
 understand how the reconciliation is executed. ReadRepair does not seem to
 trigger a Column.reconcile() to reconcile conflicting column versions on
 different servers. Does it?

 If this is not what read repair does, then: What kind of inconsistencies
 are resolved by read repair? And: How are the inconsistencies resolved?

 Could someone give me a hint?

 Thanks so much,

 -Markus







Re: Strange row expiration behavior

2012-10-23 Thread aaron morton
 In the first example, I am running compaction at step 7 through nodetool,
Sorry missed that. 

 insert a couple rows with ttl=5 (again, just a small number)
 

ExpiringColumn's are only purged if their TTL has expired AND their absolute 
(node local) expiry time occurred before the current gcBefore time. 
This may have explained why the columns were not purged in the first 
compaction. 

Can you try your first steps again. And then for the second set of steps add a 
new row, flush, compact. The expired rows should be removed.

 I don't have to manually delete empty rows after the columns expire. . 

Rows are automatically purged when all columns are purged. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/10/2012, at 3:05 AM, Stephen Mullins smull...@thebrighttag.com wrote:

 Thanks Aaron, my reply is inline below:
 
 On Tue, Oct 23, 2012 at 2:38 AM, aaron morton aa...@thelastpickle.com wrote:
 Performing these steps results in the rows still being present using 
 cassandra-cli list. 
 I assume you are saying the row key is listed without any columns. aka a 
 ghost row. 
 Correct. 
 
  What gets really odd is if I add these steps it works
 That's working as designed. 
 
 gc_grace_seconds does not specify when tombstones must be purged, rather it 
 specifies the minimum duration the tombstone must be stored. It's really 
 saying if you compact this column X seconds after the delete you can purge 
 the tombstone.
 
 Minor / automatic compaction will kick in if there are (by default) 4 
 SSTables of the same size. And will only purge tombstones if all fragments of 
 the row exists in the SSTables being compaction. 
 
 Major / manual compaction compacts all the sstables, and so purges the 
 tombstones IF gc_grace_seconds has expired. 
 
 In your first example compaction had not run so the tombstones stayed on 
 disk. In the second the major compaction purged expired tombstones. 
 In the first example, I am running compaction at step 7 through nodetool, 
 after gc_grace_seconds has expired. Additionally, if I do not perform the 
 manual delete of the row in the second example, the ghost rows are not 
 cleaned up. I want to know that in our production environment, I don't have 
 to manually delete empty rows after the columns expire. But I can't get an 
 example working to that effect.
 
 Hope that helps. 
   
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 23/10/2012, at 2:49 PM, Stephen Mullins smull...@thebrighttag.com wrote:
 
 Hello, I'm seeing Cassandra behavior that I can't explain, on v1.0.12. I'm 
 trying to test removing rows after all columns have expired. I've read the 
 following:
 http://wiki.apache.org/cassandra/DistributedDeletes
 http://wiki.apache.org/cassandra/MemtableSSTable
 https://issues.apache.org/jira/browse/CASSANDRA-2795
 
 And came up with a test to demonstrate the empty row removal that does the 
 following:
 create a keyspace
 create a column family with gc_seconds=10 (arbitrary small number)
 insert a couple rows with ttl=5 (again, just a small number)
 use nodetool to flush the column family
 sleep 10 seconds
 ensure the columns are removed with cassandra-cli list 
 use nodetool to compact the keyspace
 Performing these steps results in the rows still being present using 
 cassandra-cli list. What gets really odd is if I add these steps it works:
 sleep 5 seconds
 use cassandra-cli to del mycf[arow]
 use nodetool to flush the column family
 use nodetool to compact the keyspace
 I don't understand why the first set of steps (1-7) don't work to remove the 
 empty row, nor do I understand why the explicit row delete somehow makes 
 this work. I have all this in a script that I could attach if that's 
 appropriate. Is there something wrong with the steps that I have?
 
 Thanks,
 Stephen
 
 



Re: constant CMS GC using CPU time

2012-10-23 Thread Bryan Talbot
On Mon, Oct 22, 2012 at 6:05 PM, aaron morton aa...@thelastpickle.comwrote:

 The GC was on-going even when the nodes were not compacting or running a
 heavy application load -- even when the main app was paused constant the GC
 continued.

 If you restart a node is the onset of GC activity correlated to some event?


Yes and no.  When the nodes were generally under the
.75 occupancy threshold a weekly repair -pr job would cause them to go
over the threshold and then stay there even after the repair had completed
and there were no ongoing compactions.  It acts as though at least some
substantial amount of memory used during repair was never dereferenced once
the repair was complete.

Once one CF in particular grew larger the constant GC would start up pretty
soon (less than 90 minutes) after a node restart even without a repair.






 As a test we dropped the largest CF and the memory
 usage immediately dropped to acceptable levels and the constant GC stopped.
  So it's definitely related to data load.  memtable size is 1 GB, row cache
 is disabled and key cache is small (default).

 How many keys did the CF have per node?
 I dismissed the memory used to  hold bloom filters and index sampling.
 That memory is not considered part of the memtable size, and will end up in
 the tenured heap. It is generally only a problem with very large key counts
 per node.


I've changed the app to retain less data for that CF but I think that it
was about 400M rows per node.  Row keys are a TimeUUID.  All of the rows
are write-once, never updated, and rarely read.  There are no secondary
indexes for this particular CF.




  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like
 the default bloom_filter_fp_chance defaults to 0.0

 The default should be 0.000744.

 If the chance is zero or null this code should run when a new SSTable is
 written
   // paranoia -- we've had bugs in the thrift - avro - CfDef dance
 before, let's not let that break things
 logger.error(Bloom filter FP chance of zero isn't
 supposed to happen);

 Were the CF's migrated from an old version ?


Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to
1.1.5 with a upgradesstables run at each upgrade along the way.

I could not find a way to view the current bloom_filter_fp_chance settings
when they are at a default value.  JMX reports the actual fp rate and if a
specific rate is set for a CF that shows up in describe table but I
couldn't find out how to tell what the default was.  I didn't inspect the
source.



 Is there any way to predict how much memory the bloom filters will consume
 if the size of the row keys, number or rows is known, and fp chance is
 known?


 See o.a.c.utils.BloomFilter.getFilter() in the code
 This http://hur.st/bloomfilter appears to give similar results.




Ahh, very helpful.  This indicates that 714MB would be used for the bloom
filter for that one CF.

JMX / cfstats reports Bloom Filter Space Used but the MBean method name
(getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If
on-disk and in-memory space used is similar then summing up all the Bloom
Filter Space Used says they're currently consuming 1-2 GB of the heap
which is substantial.

If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0?  It
just means more trips to SSTable indexes for a read correct?  Trade RAM for
time (disk I/O).

-Bryan


Re: What does ReadRepair exactly do?

2012-10-23 Thread shankarpnsn
Hello, 

This conversation precisely targets a question that I had been having for a
while - would be grateful if you someone cloud clarify it a little further: 

Considering the case of a repair created due to a consistency constraint
(first case in the discussion above), would the following interpretation be
correct ?

1. A digest mismatch exception is raised even if one among the many
responses (even if consistency is met on an out-of-date value, say by virtue
of timestamp).
2. A read is initiated by the callback to fetch data from all replicas
3. Resolve() is invoked to find the deltas for each replica that was out of
date. 
4. ReadRepair is scheduled to the above replicas. 
5. Perform a normal read and check if this meets the consistency
constraints. Mismatches would trigger a repair again. 

Assuming the above is true, would the mutations in step 4 and the read in
step 5 happen in parallel ? In other words, would the time taken by the read
correction be the round trip between the coordinator and its farthest
replica that meets the consistency constraint.  

Thanks,
Shankar



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583352.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Node Dead/Up

2012-10-23 Thread Jason Hill
thanks for the replies.

I'll check the load on the node that is reported as DOWN/UP. At first
glace it does not appear to be overloaded. But, I will dig in deeper,
is there a specific indicator on an ubuntu server that would be useful
to me?

Also, I didn't make it clear, but in my original post, there are logs
from 2 different nodes: 10.21 and 10.25. They are each reporting that
the other is DOWN/UP at the same time. Would that still point me to
the suggestions you made? I don't see errors in the logs, but I do see
a lot of dropped mutations and reads. Any correlation?

thanks again,
Jason

On Tue, Oct 23, 2012 at 12:49 AM, aaron morton aa...@thelastpickle.com wrote:
 check 10.50.10.21 for what is the system load.

 +1

 And take a look in the logs on 10.21.

 10.21 is being seen as down by the other nodes. it could be:

 * 10.21 failing to gossip fast enough, say by being overloaded to in long
 ParNew GC pauses.
 * This node failing to process gossip fast , say by being overloaded to in
 long ParNew GC pauses.
 * Problems with the tubes used to connect the nodes.

 (It's probably the first one.)

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 23/10/2012, at 8:19 PM, Jason Wee peich...@gmail.com wrote:

 check 10.50.10.21 for what is the system load.

 On Tue, Oct 23, 2012 at 10:41 AM, Jason Hill jasonhill...@gmail.com wrote:

 Hello,

 I'm on version 1.0.11.

 I'm seeing this in my system log with occasional frequency:

 INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818)
 InetAddress /10.50.10.21 is now dead.
 INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804)
 InetAddress /10.50.10.21 is now UP


 INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java
 (line 228) Streaming to /10.50.10.25 --this line included for context
 INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818)
 InetAddress /10.50.10.25 is now dead.
 INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804)
 InetAddress /10.50.10.25 is now UP
 INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249
 AntiEntropyService.java (line 233) [repair
 #5a3383c0-1cb5-11e2--56b66459adef] Sending completed merkle tree
 to /10.50.10.25 for (Innovari,TICCompressedLoad) --this line included
 for context

 What is this telling me? Is my network dropping for less than a
 second? Are my nodes really dead and then up? Can someone shed some
 light on this for me?

 cheers,
 Jason





Re: What does ReadRepair exactly do?

2012-10-23 Thread Manu Zhang
why repair again? We block until the consistency constraint is met. Then
the latest version is returned and repair is done asynchronously if any
mismatch. We may retry read if fewer columns than required are returned.

On Wed, Oct 24, 2012 at 6:10 AM, shankarpnsn shankarp...@gmail.com wrote:

 Hello,

 This conversation precisely targets a question that I had been having for a
 while - would be grateful if you someone cloud clarify it a little further:

 Considering the case of a repair created due to a consistency constraint
 (first case in the discussion above), would the following interpretation be
 correct ?

 1. A digest mismatch exception is raised even if one among the many
 responses (even if consistency is met on an out-of-date value, say by
 virtue
 of timestamp).
 2. A read is initiated by the callback to fetch data from all replicas
 3. Resolve() is invoked to find the deltas for each replica that was out of
 date.
 4. ReadRepair is scheduled to the above replicas.
 5. Perform a normal read and check if this meets the consistency
 constraints. Mismatches would trigger a repair again.

 Assuming the above is true, would the mutations in step 4 and the read in
 step 5 happen in parallel ? In other words, would the time taken by the
 read
 correction be the round trip between the coordinator and its farthest
 replica that meets the consistency constraint.

 Thanks,
 Shankar



 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583352.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.



Re: What does ReadRepair exactly do?

2012-10-23 Thread shankarpnsn
manuzhang wrote
 why repair again? We block until the consistency constraint is met. Then
 the latest version is returned and repair is done asynchronously if any
 mismatch. We may retry read if fewer columns than required are returned.

Just to make sure I understand you correct, considering the case when a read
repair is in flight and a subsequent write affects one or more of the
replicas that was scheduled to received the repair mutations. In this case,
are you saying that we return the older version to the user rather than the
latest version that was effected by the write ?



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583355.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: What does ReadRepair exactly do?

2012-10-23 Thread Manu Zhang
I think so. Otherwise, we may never complete a read if writes come in
continuously.

On Wed, Oct 24, 2012 at 9:04 AM, shankarpnsn shankarp...@gmail.com wrote:

 manuzhang wrote
  why repair again? We block until the consistency constraint is met. Then
  the latest version is returned and repair is done asynchronously if any
  mismatch. We may retry read if fewer columns than required are returned.

 Just to make sure I understand you correct, considering the case when a
 read
 repair is in flight and a subsequent write affects one or more of the
 replicas that was scheduled to received the repair mutations. In this case,
 are you saying that we return the older version to the user rather than the
 latest version that was effected by the write ?



 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583355.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.



Re: constant CMS GC using CPU time

2012-10-23 Thread B. Todd Burruss
Regarding memory usage after a repair ... Are the merkle trees kept around?
On Oct 23, 2012 3:00 PM, Bryan Talbot btal...@aeriagames.com wrote:

 On Mon, Oct 22, 2012 at 6:05 PM, aaron morton aa...@thelastpickle.comwrote:

 The GC was on-going even when the nodes were not compacting or running a
 heavy application load -- even when the main app was paused constant the GC
 continued.

 If you restart a node is the onset of GC activity correlated to some
 event?


 Yes and no.  When the nodes were generally under the
 .75 occupancy threshold a weekly repair -pr job would cause them to go
 over the threshold and then stay there even after the repair had completed
 and there were no ongoing compactions.  It acts as though at least some
 substantial amount of memory used during repair was never dereferenced once
 the repair was complete.

 Once one CF in particular grew larger the constant GC would start up
 pretty soon (less than 90 minutes) after a node restart even without a
 repair.






 As a test we dropped the largest CF and the memory
 usage immediately dropped to acceptable levels and the constant GC stopped.
  So it's definitely related to data load.  memtable size is 1 GB, row cache
 is disabled and key cache is small (default).

 How many keys did the CF have per node?
 I dismissed the memory used to  hold bloom filters and index sampling.
 That memory is not considered part of the memtable size, and will end up in
 the tenured heap. It is generally only a problem with very large key counts
 per node.


 I've changed the app to retain less data for that CF but I think that it
 was about 400M rows per node.  Row keys are a TimeUUID.  All of the rows
 are write-once, never updated, and rarely read.  There are no secondary
 indexes for this particular CF.




  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like
 the default bloom_filter_fp_chance defaults to 0.0

 The default should be 0.000744.

 If the chance is zero or null this code should run when a new SSTable is
 written
   // paranoia -- we've had bugs in the thrift - avro - CfDef dance
 before, let's not let that break things
 logger.error(Bloom filter FP chance of zero isn't
 supposed to happen);

 Were the CF's migrated from an old version ?


 Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to
 1.1.5 with a upgradesstables run at each upgrade along the way.

 I could not find a way to view the current bloom_filter_fp_chance settings
 when they are at a default value.  JMX reports the actual fp rate and if a
 specific rate is set for a CF that shows up in describe table but I
 couldn't find out how to tell what the default was.  I didn't inspect the
 source.



 Is there any way to predict how much memory the bloom filters will
 consume if the size of the row keys, number or rows is known, and fp chance
 is known?


 See o.a.c.utils.BloomFilter.getFilter() in the code
 This http://hur.st/bloomfilter appears to give similar results.




 Ahh, very helpful.  This indicates that 714MB would be used for the bloom
 filter for that one CF.

 JMX / cfstats reports Bloom Filter Space Used but the MBean method name
 (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If
 on-disk and in-memory space used is similar then summing up all the Bloom
 Filter Space Used says they're currently consuming 1-2 GB of the heap
 which is substantial.

 If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0?
  It just means more trips to SSTable indexes for a read correct?  Trade RAM
 for time (disk I/O).

 -Bryan