Re: OOM while reading key cache

2013-11-14 Thread olek.stas...@gmail.com
Yes, as I wrote in first e-mail.  When I removed key cache file
cassandra started without further problems.
regards
Olek

2013/11/13 Robert Coli rc...@eventbrite.com:

 On Wed, Nov 13, 2013 at 12:35 AM, Tom van den Berge t...@drillster.com
 wrote:

 I'm having the same problem, after upgrading from 1.2.3 to 1.2.10.

 I can remember this was a bug that was solved in the 1.0 or 1.1 version
 some time ago, but apparently it got back.
 A workaround is to delete the contents of the saved_caches directory
 before starting up.


 Yours is not the first report of this I've heard resulting from a 1.2.x to
 1.2.x upgrade. Reports are of the form I had to nuke my saved_caches or I
 couldn't start my node, it OOMED, etc..

 https://issues.apache.org/jira/browse/CASSANDRA-6325

 Exists, but doesn't seem  to be the same issue.

 https://issues.apache.org/jira/browse/CASSANDRA-5986

 Similar, doesn't seem to be an issue triggered by upgrade..

 If I were one of the posters on this thread, I would strongly consider
 filing a JIRA on point.

 @OP (olek) : did removing the saved_caches also fix your problem?

 =Rob




Cassand is holding too many deleted file descriptors

2013-11-14 Thread Murthy Chelankuri
I See lots of these deleted file descriptors cassandra is holding in my
case out of 90K file descriptors 80.5K is having these descriptors

Because of this cassandra is not performing well.

Can some one please tell what i am doing wrong.


lr-x-- 1 root root 64 Nov 14 08:25 10875 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-119-Data.db
(deleted)
lr-x-- 1 root root 64 Nov 14 08:25 10876 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-110-Data.db
(deleted)
lr-x-- 1 root root 64 Nov 14 08:25 10877 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-133-Data.db
(deleted)
lr-x-- 1 root root 64 Nov 14 08:25 10878 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-124-Data.db
(deleted)
lr-x-- 1 root root 64 Nov 14 08:25 10879 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-110-Data.db
(deleted)
lr-x-- 1 root root 64 Nov 14 08:11 1088 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-110-Data.db
(deleted)
lr-x-- 1 root root 64 Nov 14 08:25 10880 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-133-Data.db
(deleted)
lr-x-- 1 root root 64 Nov 14 08:25 10881 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-119-Data.db
(deleted)
lr-x-- 1 root root 64 Nov 14 08:25 10882 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-124-Data.db
(deleted)
lr-x-- 1 root root 64 Nov 14 08:25 10883 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-119-Data.db
(deleted)
lr-x-- 1 root root 64 Nov 14 08:25 10884 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-133-Data.db
(deleted)
lr-x-- 1 root root 64 Nov 14 08:25 10885 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-110-Data.db
(deleted)
lr-x-- 1 root root 64 Nov 14 08:25 10886 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-124-Data.db
(deleted)
lr-x-- 1 root root 64 Nov 14 08:25 10887 -
/var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-119-Data.db
(deleted)


Re: OOM while reading key cache

2013-11-14 Thread Fabien Rousseau
A few month ago, we've got a similar issue on 1.2.6 :
https://issues.apache.org/jira/browse/CASSANDRA-5706

But it has been fixed and did not encountered this issue anymore (we're
also on 1.2.10)


2013/11/14 olek.stas...@gmail.com olek.stas...@gmail.com

 Yes, as I wrote in first e-mail.  When I removed key cache file
 cassandra started without further problems.
 regards
 Olek

 2013/11/13 Robert Coli rc...@eventbrite.com:
 
  On Wed, Nov 13, 2013 at 12:35 AM, Tom van den Berge t...@drillster.com
  wrote:
 
  I'm having the same problem, after upgrading from 1.2.3 to 1.2.10.
 
  I can remember this was a bug that was solved in the 1.0 or 1.1 version
  some time ago, but apparently it got back.
  A workaround is to delete the contents of the saved_caches directory
  before starting up.
 
 
  Yours is not the first report of this I've heard resulting from a 1.2.x
 to
  1.2.x upgrade. Reports are of the form I had to nuke my saved_caches or
 I
  couldn't start my node, it OOMED, etc..
 
  https://issues.apache.org/jira/browse/CASSANDRA-6325
 
  Exists, but doesn't seem  to be the same issue.
 
  https://issues.apache.org/jira/browse/CASSANDRA-5986
 
  Similar, doesn't seem to be an issue triggered by upgrade..
 
  If I were one of the posters on this thread, I would strongly consider
  filing a JIRA on point.
 
  @OP (olek) : did removing the saved_caches also fix your problem?
 
  =Rob
 
 




-- 
Fabien Rousseau


 aur...@yakaz.comwww.yakaz.com


Re: Cassand is holding too many deleted file descriptors

2013-11-14 Thread Marcus Eriksson
yeah this is known, and we are looking for a fix

https://issues.apache.org/jira/browse/CASSANDRA-6275

if you have a simple way of reproducing, please add a comment


On Thu, Nov 14, 2013 at 10:53 AM, Murthy Chelankuri kmurt...@gmail.comwrote:

 I See lots of these deleted file descriptors cassandra is holding in my
 case out of 90K file descriptors 80.5K is having these descriptors

 Because of this cassandra is not performing well.

 Can some one please tell what i am doing wrong.


 lr-x-- 1 root root 64 Nov 14 08:25 10875 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-119-Data.db
 (deleted)
 lr-x-- 1 root root 64 Nov 14 08:25 10876 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-110-Data.db
 (deleted)
 lr-x-- 1 root root 64 Nov 14 08:25 10877 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-133-Data.db
 (deleted)
 lr-x-- 1 root root 64 Nov 14 08:25 10878 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-124-Data.db
 (deleted)
 lr-x-- 1 root root 64 Nov 14 08:25 10879 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-110-Data.db
 (deleted)
 lr-x-- 1 root root 64 Nov 14 08:11 1088 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-110-Data.db
 (deleted)
 lr-x-- 1 root root 64 Nov 14 08:25 10880 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-133-Data.db
 (deleted)
 lr-x-- 1 root root 64 Nov 14 08:25 10881 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-119-Data.db
 (deleted)
 lr-x-- 1 root root 64 Nov 14 08:25 10882 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-124-Data.db
 (deleted)
 lr-x-- 1 root root 64 Nov 14 08:25 10883 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-119-Data.db
 (deleted)
 lr-x-- 1 root root 64 Nov 14 08:25 10884 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-133-Data.db
 (deleted)
 lr-x-- 1 root root 64 Nov 14 08:25 10885 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-110-Data.db
 (deleted)
 lr-x-- 1 root root 64 Nov 14 08:25 10886 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-124-Data.db
 (deleted)
 lr-x-- 1 root root 64 Nov 14 08:25 10887 -
 /var/lib/cassandra/data/tests/sample_data/sample_data_points-jb-119-Data.db
 (deleted)







Re: Modeling multi-tenanted Cassandra schema

2013-11-14 Thread Ben Hood
OK, so in the end I elected to go for option (c), which makes my table
definition look like this:

create table tenanted_foo_table (
tenant ascii,
application_key bigint,
timestamp timestamp,
 other non-key columns
PRIMARY KEY ((tenant, application_key), timestamp)
)

such that on disk the row keys are effectively tenant:application_key
concatenations.

Thanks for your input,

Ben

On Wed, Nov 13, 2013 at 2:43 PM, Nate McCall n...@thelastpickle.com wrote:
 Astyanax and/or the DS Java client depending on your use case. (Emphasis on
 the and - really no reason you can't use both - even on the same schema -
 depending on what you are doing as they both have their strengths and
 weaknesses).

 To be clear, Hector is not going away. We are still accepting patches and
 updates, but there is no active feature development.

 Any other hector specific questions, please start a thread over on
 hector-us...@googlegroups.com


 On Wed, Nov 13, 2013 at 8:35 AM, Shahab Yunus shahab.yu...@gmail.com
 wrote:

 Nate,

 (slightly OT), what client API/library is recommended now that Hector is
 sunsetting? Thanks.

 Regards,
 Shahab


 On Wed, Nov 13, 2013 at 9:28 AM, Nate McCall n...@thelastpickle.com
 wrote:

 You basically want option (c). Option (d) might work, but you would be
 bending the paradigm a bit, IMO. Certainly do not use dedicated column
 families or keyspaces per tennant. That never works. The list history will
 show that with a few google searches and we've seen it fail badly with
 several clients.

 Overall, option (c) would be difficult to do in CQL without some very
 well thought out abstractions and/or a deep hack on the Java driver (not
 in-ellegant or impossible, just lots of moving parts to get your head around
 if you are new to such). That said, depending on the size of your project
 and skill of your team, this direction might be worth considering.

 Usergrid (just accepted for incubation at Apache) functions this way via
 the Thrift API: https://github.com/apigee/usergrid-stack

 The commercial version of Usergrid has tens of thousands of active
 tennants on a single cluster (same code base at the service layer as the
 open source version). It uses Hector's built in virtual keyspaces:
 https://github.com/hector-client/hector/wiki/Virtual-Keyspaces (NOTE: though
 Hector is sunsetting/in patch maintenance, the approach is certainly
 legitimate - but I'd recommend you *not* start a new project on Hector).

 In short, Usergrid is the only project I know of that has a well-proven
 tenant model that functions at scale, though I'm sure there are others
 around, just not open sourced or actually running large deployments.

 Astyanax can do this as well albeit with a little more work required:

 https://github.com/Netflix/astyanax/wiki/Composite-columns#how-to-use-the-prefixedserializer-but-you-really-should-use-composite-columns

 Happy to clarify any of the above.


 On Tue, Nov 12, 2013 at 3:19 AM, Ben Hood 0x6e6...@gmail.com wrote:

 Hi,

 I've just received a requirement to make a Cassandra app
 multi-tenanted, where we'll have up to 100 tenants.

 Most of the tables are timestamped wide row tables with a natural
 application key for the partitioning key and a timestamp key as a
 cluster key.

 So I was considering the options:

 (a) Add a tenant column to each table and stick a secondary index on
 that column;
 (b) Add a tenant column to each table and maintain index tables that
 use the tenant id as a partitioning key;
 (c) Decompose the partitioning key of each table and add the tenant
 and the leading component of the key;
 (d) Add the tenant as a separate clustering key;
 (e) Replicate the schema in separate tenant specific key spaces;
 (f) Something I may have missed;

 Option (a) seems the easiest, but I'm wary of just adding secondary
 indexes without thinking about it.

 Option (b) seems to have the least impact of the layout of the
 storage, but a cost of maintaining each index table, both code wise
 and in terms of performance.

 Option (c) seems quite straight forward, but I feel it might have a
 significant effect on the distribution of the rows, if the cardinality
 of the tenants is low.

 Option (d) seems simple enough, but it would mean that you couldn't
 query for a range of tenants without supplying a range of natural
 application keys, through which you would need to iterate (under the
 assumption that you don't use an ordered partitioner).

 Option (e) appears relatively straight forward, but it does mean that
 the application CQL client needs to maintain separate cluster
 connections for each tenant. Also I'm not sure to what extent key
 spaces were designed to partition identically structured data.

 Does anybody have any experience with running a multi-tenanted
 Cassandra app, or does this just depend too much on the specifics of
 the application?

 Cheers,

 Ben




 --
 -
 Nate McCall
 Austin, TX
 @zznate

 Co-Founder  Sr. Technical 

Hints still exist for a removed node

2013-11-14 Thread Cyril Scetbon
Hi,

I saw on http://www.datastax.com/dev/blog/modern-hinted-handoff (wrote on 
december 2012) that hints targeting a removed node (our case) are automatically 
removed. However, a compaction has been done on our cf and hints for the 
removed node are still stored. We're using version 1.2.2 (February 2013). Do 
you mean by automatically that they will be removed after a period of time but 
not after a compaction ? I see a TTL of 10 days added to each row in the hints 
data file.

Another question, is about Finished hinted handoff of 0 rows to endpoint info 
messages. CASSANDRA-5068 patch included in our version is supposed to fix a bad 
behaviour which was the cause of similar messages. We don't have hints stored 
for the endpoints concerned by these messages, but they appear in our log 
files. I don't know if it's related but I have a compaction of hints at the 
same time : http://pastebin.com/71nw2Uqh . Can Anyone explain us what's 
happening if it's an expected behaviour.

thanks
-- 
Cyril SCETBON



Risk of not doing repair

2013-11-14 Thread olek.stas...@gmail.com
Hello,
I'm facing bug https://issues.apache.org/jira/browse/CASSANDRA-6277.
After migration to 2.0.2 I can't perform repair on my cluster (six
nodes). Repair on the biggest CF breaks with error described in Jira.
I know, that probably there is a solution in repository, but it's not
included in any release. I can estimate, that 2.0.3 with this fix will
be released in december. If it's not really neccessary, i would avoid
building unstable version of cass from sources and install it in prod
environ, I would rather use rpm-based distribution to keep system in
consistent state.
So this is my question:  What is the risk for me concerned with not
doing repair for a month, assuming that gc_grace is 10days? Should I
really worry? Maybe I should use repo version of cass?
best regards
Olek


Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled

2013-11-14 Thread J. Ryan Earl
First off, I'm curious what hardware (system specs) you're running this on?

Secondly, here are some observations:
* You're not running the newest JDK7, I can tell by your stack-size.
 Consider getting the newest.

* Cassandra 2.0.2 has a lot of improvements, consider upgrading.  We
noticed improved heap usage compared to 2.0.2

* Have you simply tried decreasing the size of your row cache?  Tried 256MB?

* Do you have JNA installed?  Otherwise, you're not getting off-heap usage
for these caches which seems likely.  Check your cassandra.log to verify
JNA operation.

* Your NewGen is too small.  See your heap peaks?  This is because
short-lived memory is being put into OldGen, which only gets cleaned up
during fullGC.  You should set your NewGen to about 25-30% of your total
heapsize.  Many objects are short-lived, and CMS GC is significantly more
efficient if the shorter-lived objects never get promoted to OldGen; you'll
get more concurrent, non-blocking GC.  If you're not using JNA (per above)
row-cache and key-cache is still on-heap, so you want your NewGen to be =
twice as large as the size of these combined caches.  You should never so
those crazy heap spikes, your caches are essentially overflowing into
OldGen (with JNA).



On Tue, Nov 5, 2013 at 3:04 AM, Jiri Horky ho...@avast.com wrote:

 Hi there,

 we are seeing extensive memory allocation leading to quite long and
 frequent GC pauses when using row cache. This is on cassandra 2.0.0
 cluster with JNA 4.0 library with following settings:

 key_cache_size_in_mb: 300
 key_cache_save_period: 14400
 row_cache_size_in_mb: 1024
 row_cache_save_period: 14400
 commitlog_sync: periodic
 commitlog_sync_period_in_ms: 1
 commitlog_segment_size_in_mb: 32

 -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms10G -Xmx10G
 -Xmn1024M -XX:+HeapDumpOnOutOfMemoryError

 -XX:HeapDumpPath=/data2/cassandra-work/instance-1/cassandra-1383566283-pid1893.hprof
 -Xss180k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
 -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark

 We have disabled row cache on one node to see  the  difference. Please
 see attached plots from visual VM, I think that the effect is quite
 visible. I have also taken 10x jmap -histo after 5s on a affected
 server and plotted the result, attached as well.

 I have taken a dump of the application when the heap size was 10GB, most
 of the memory was unreachable, which was expected. The majority was used
 by 55-59M objects of HeapByteBuffer, byte[] and
 org.apache.cassandra.db.Column classes. I also include a list of inbound
 references to the HeapByteBuffer objects from which it should be visible
 where they are being allocated. This was acquired using Eclipse MAT.

 Here is the comparison of GC times when row cache enabled and disabled:

 prg01 - row cache enabled
   - uptime 20h45m
   - ConcurrentMarkSweep - 11494686ms
   - ParNew - 14690885 ms
   - time spent in GC: 35%
 prg02 - row cache disabled
   - uptime 23h45m
   - ConcurrentMarkSweep - 251ms
   - ParNew - 230791 ms
   - time spent in GC: 0.27%

 I would be grateful for any hints. Please let me know if you need any
 further information. For now, we are going to disable the row cache.

 Regards
 Jiri Horky



db file missing error

2013-11-14 Thread Langston, Jim
Hi all,

When I run nodetool repair, I'm getting an error that indicates
that several of the Data.db files are missing. Is there a way to
correct this error ? The files that the error message is referencing
are indeed missing, I'm not sure why it is looking for them to begin
with. AFAIK nothing has been deleted, but there are several apps
that run against Cass.

Caused by: java.io.FileNotFoundException: 
/raid0/cassandra/data/OTester/OTester_one/OTester-OTester_one-ic-46-Data.db (No 
such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:216)
at 
org.apache.cassandra.io.util.RandomAccessReader.init(RandomAccessReader.java:67)
at 
org.apache.cassandra.io.compress.CompressedRandomAccessReader.init(CompressedRandomAccessReader.java:75)
at 
org.apache.cassandra.io.compress.CompressedRandomAccessReader.open(CompressedRandomAccessReader.java:42)
... 20 more


Thanks,

Jim


Re: db file missing error

2013-11-14 Thread Langston, Jim
Found it, had a second repair running which was generating the
error.

Jim

From: Langston, Jim 
jim.langs...@compuware.commailto:jim.langs...@compuware.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thu, 14 Nov 2013 18:34:19 +
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: db file missing error

Hi all,

When I run nodetool repair, I'm getting an error that indicates
that several of the Data.db files are missing. Is there a way to
correct this error ? The files that the error message is referencing
are indeed missing, I'm not sure why it is looking for them to begin
with. AFAIK nothing has been deleted, but there are several apps
that run against Cass.

Caused by: java.io.FileNotFoundException: 
/raid0/cassandra/data/OTester/OTester_one/OTester-OTester_one-ic-46-Data.db (No 
such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:216)
at 
org.apache.cassandra.io.util.RandomAccessReader.init(RandomAccessReader.java:67)
at 
org.apache.cassandra.io.compress.CompressedRandomAccessReader.init(CompressedRandomAccessReader.java:75)
at 
org.apache.cassandra.io.compress.CompressedRandomAccessReader.open(CompressedRandomAccessReader.java:42)
... 20 more


Thanks,

Jim


Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled

2013-11-14 Thread Robert Coli
On Thu, Nov 14, 2013 at 10:05 AM, J. Ryan Earl o...@jryanearl.us wrote:

 * Cassandra 2.0.2 has a lot of improvements, consider upgrading.  We
 noticed improved heap usage compared to 2.0.2


https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/

And especially if you're using Level Compaction / LCS :

https://issues.apache.org/jira/browse/CASSANDRA-6284
Wrong tracking of minLevel in Leveled Compaction Strategy causing serious
performance problems

tl;dr - don't upgrade to 2.0.2 in production.

=Rob


Re: Risk of not doing repair

2013-11-14 Thread Robert Coli
On Thu, Nov 14, 2013 at 6:25 AM, olek.stas...@gmail.com 
olek.stas...@gmail.com wrote:

 After migration to 2.0.2 I can't perform repair on my cluster (six
 nodes).

...

  If it's not really neccessary, i would avoid
 building unstable version of cass from sources and install it in prod
 environ


You've already installed an unstable version of cassandra in prod, moving
up to an unreleased version is unlikely to make things that much less
stable.


 So this is my question:  What is the risk for me concerned with not
 doing repair for a month, assuming that gc_grace is 10days? Should I
 really worry? Maybe I should use repo version of cass?


Do you do delete or CQL3-delete like operations?

If so, you have a risk of exposure to zombie data.

You should probably increase your gc_grace_seconds to 34 days anyway, so
why not use this experience as an opportunity to do so?

https://issues.apache.org/jira/browse/CASSANDRA-5850

=Rob


Re: Hints still exist for a removed node

2013-11-14 Thread Robert Coli
On Thu, Nov 14, 2013 at 6:08 AM, Cyril Scetbon cyril.scet...@free.frwrote:

 I saw on http://www.datastax.com/dev/blog/modern-hinted-handoff (wrote on
 december 2012) that hints targeting a removed node (our case) are
 automatically removed. However, a compaction has been done on our cf and
 hints for the removed node are still stored. We're using version 1.2.2
 (February 2013). Do you mean by automatically that they will be removed
 after a period of time but not after a compaction ? I see a TTL of 10 days
 added to each row in the hints data file.


gc_grace_seconds


 We're using version 1.2.2 (February 2013).


1.2.2 contains serious bugs, upgrade ASAP.

Finished hinted handoff of 0 rows to endpoint


Doesn't have any meaningful impact, is probably fixed upstream.

=Rob


Read inconsistency after backup and restore to different cluster

2013-11-14 Thread David Laube
Hi All,

After running through our backup and restore process FROM our test production 
TO our staging environment, we are seeing inconsistent reads from the cluster 
we restored to. We have the same number of nodes in both clusters. For example, 
we will select data from a column family on the newly restored cluster but 
sometimes the expected data is returned and other times it is not. These 
selects are carried out one after another with very little delay. It is almost 
as if the data only exists on some of the nodes, or perhaps the token ranges 
are dramatically different --again, we are using vnodes so I am not exactly 
sure how this plays into the equation.

We are running Cassadra 2.0.2 with vnodes and deploying via chef. The backup 
and restore process is currently orchestrated using bash scripts and chef's 
distributed SSH. I have outlined the process below for review. 


(I) Backup cluster-A (with existing prod data):
1. Run nodetool flush on each of the nodes in a 5 node ring.
2. Run nodetool snapshot keyspace_name on each of the nodes in a 5 node ring.
3. Archive the snapshot data from the snapshots directory in each node, 
creating a single archive of the snapshot.
4. Copy the snapshot data archive for each of the nodes to s3.


(II) Restore backup FROM cluster-A  TO  cluster-B:
*NOTE: cluster-B is a freshly deployed ring with no data, but a different 
cluster-name used for staging.

1. Deploy 5 nodes as part of the cluster-B ring. 
2. Create keyspace_name keyspace and column families on cluster-B.
3. Stop Cassandra on all 5 nodes in the cluster-B ring.
4. Clear commit logs on cluster-B with:  rm -f /var/lib/cassandra/commitlog/*
5. Copy 1 of the 5 snapshot archives from cluster-A to each of the five nodes 
in the new cluster-B ring.
6. Extract the archives to /var/lib/cassandra/data/keyspace_name ensuring that 
the column family directories and associated .DB files are in place under 
/var/lib/cassandra/data/keyspace_name/columfamily1/   ….etc.
7.Start Cassandra on each of the nodes in cluster-B.
8. Run nodetool repair on each of the nodes in cluster-B.


Please let me know if you see any major errors or deviation from best practices 
which could be contributing to our read inconsistencies. I'll be happy to 
answer any specific question you may have regarding our configuration. Thank 
you in advance!


Best regards,
-David Laube

Re: Read inconsistency after backup and restore to different cluster

2013-11-14 Thread Robert Coli
On Thu, Nov 14, 2013 at 12:37 PM, David Laube d...@stormpath.com wrote:

 It is almost as if the data only exists on some of the nodes, or perhaps
 the token ranges are dramatically different --again, we are using vnodes so
 I am not exactly sure how this plays into the equation.


The token ranges are dramatically different, due to vnode random token
selection from not setting initial_token, and setting num_tokens.

You can verify this by listing the tokens per physical node in nodetool
gossipinfo or (iirc) nodetool status.


 5. Copy 1 of the 5 snapshot archives from cluster-A to each of the five
 nodes in the new cluster-B ring.


I don't understand this at all, do you mean that you are using one source
node's data to load each of of the target nodes? Or are you just saying
there's a 1:1 relationship between source snapshots and target nodes to
load into? Unless you have RF=N, using one source for 5 target nodes won't
work.

To do what I think you're attempting to do, you have basically two options.

1) don't use vnodes and do a 1:1 copy of snapshots
2) use vnodes and
   a) get a list of tokens per node from the source cluster
   b) put a comma delimited list of these in initial_token in
cassandra.yaml on target nodes
   c) probably have to un-set num_tokens (this part is unclear to me, you
will have to test..)
   d) set auto_bootstrap:false in cassandra.yaml
   e) start target nodes, they will not-bootstrap into the same ranges as
the source cluster
   f) load schema / copy data into datadir (being careful of
https://issues.apache.org/jira/browse/CASSANDRA-6245)
   g) restart node or use nodetool refresh (I'd probably restart the node
to avoid the bulk rename that refresh does) to pick up sstables
   h) remove auto_bootstrap:false from cassandra.yaml

I *believe* this *should* work, but have never tried it as I do not
currently run with vnodes. It should work because it basically makes
implicit vnode tokens explicit in the conf file. If it *does* work, I'd
greatly appreciate you sharing details of your experience with the list.

General reference on tasks of this nature (does not consider vnodes, but
treat vnodes as just a lot of physical nodes and it is mostly relevant) :
http://www.palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra

=Rob


Re: Read inconsistency after backup and restore to different cluster

2013-11-14 Thread David Laube
Thank you for the detailed reply Rob!  I have replied to your comments in-line 
below;

On Nov 14, 2013, at 1:15 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Nov 14, 2013 at 12:37 PM, David Laube d...@stormpath.com wrote:
 It is almost as if the data only exists on some of the nodes, or perhaps the 
 token ranges are dramatically different --again, we are using vnodes so I am 
 not exactly sure how this plays into the equation.
 
 The token ranges are dramatically different, due to vnode random token 
 selection from not setting initial_token, and setting num_tokens.
 
 You can verify this by listing the tokens per physical node in nodetool 
 gossipinfo or (iirc) nodetool status.
  
 5. Copy 1 of the 5 snapshot archives from cluster-A to each of the five nodes 
 in the new cluster-B ring.
 
 I don't understand this at all, do you mean that you are using one source 
 node's data to load each of of the target nodes? Or are you just saying 
 there's a 1:1 relationship between source snapshots and target nodes to load 
 into? Unless you have RF=N, using one source for 5 target nodes won't work.

We have configured RF=3 for the keyspace in question. Also, from a client 
perspective, we read with CL=1 and write with CL=QUORUM. Since we have 5 nodes 
total in cluster-A, we snapshot keyspace_name on each of the five nodes which 
results in a snapshot directory on each of the five nodes that we archive and 
ship off to s3. We then take the snapshot archive generated FROM 
cluster-A_node1 and copy/extract/restore TO cluster-B_node1,  then we take the 
snapshot archive FROM cluster-A_node2 and copy/extract/restore TO 
cluster-B_node2 and so on and so forth.

 
 To do what I think you're attempting to do, you have basically two options.
 
 1) don't use vnodes and do a 1:1 copy of snapshots
 2) use vnodes and
a) get a list of tokens per node from the source cluster
b) put a comma delimited list of these in initial_token in cassandra.yaml 
 on target nodes
c) probably have to un-set num_tokens (this part is unclear to me, you 
 will have to test..)
d) set auto_bootstrap:false in cassandra.yaml
e) start target nodes, they will not-bootstrap into the same ranges as the 
 source cluster
f) load schema / copy data into datadir (being careful of 
 https://issues.apache.org/jira/browse/CASSANDRA-6245)
g) restart node or use nodetool refresh (I'd probably restart the node to 
 avoid the bulk rename that refresh does) to pick up sstables
h) remove auto_bootstrap:false from cassandra.yaml

 I *believe* this *should* work, but have never tried it as I do not currently 
 run with vnodes. It should work because it basically makes implicit vnode 
 tokens explicit in the conf file. If it *does* work, I'd greatly appreciate 
 you sharing details of your experience with the list. 

I'll start with parsing out the token ranges that our vnode config ends up 
assigning in cluster-A, and doing some creative config work on the target 
cluster-B we are trying to restore to as you have suggested. Depending on what 
additional comments/recommendation you or another member of the list may have 
(if any) based on the clarification I've made above, I will absolutely report 
back my findings here.


 
 General reference on tasks of this nature (does not consider vnodes, but 
 treat vnodes as just a lot of physical nodes and it is mostly relevant) : 
 http://www.palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra
 
 =Rob



making sense of output from Eclipse Memory Analyzer tool taken from .hprof file

2013-11-14 Thread Mike Koh
I am investigating Java Out of memory heap errors. So I created an .hprof 
file and loaded it into Eclipse Memory Analyzer Tool which gave some 
Problem Suspects.


First one looks like:

One instance of org.apache.cassandra.db.ColumnFamilyStore loaded by 
sun.misc.Launcher$AppClassLoader @ 0x613e1bdc8 occupies 984,094,664 
(11.64%) bytes. The memory is accumulated in one instance of 
org.apache.cassandra.db.DataTracker$View loaded by 
sun.misc.Launcher$AppClassLoader @ 0x613e1bdc8.



If I click around into the verbiage, I believe I can pick out the name of 
a column family but that is about it. Can someone explain what the above 
means in more detail and if it is indicative of a problem?



Next one looks like:
-
•java.lang.Thread @ 0x73e1f74c8 CompactionExecutor:158 - 839,225,000 
(9.92%) bytes.

•java.lang.Thread @ 0x717f08178 MutationStage:31 - 809,909,192 (9.58%) bytes.
•java.lang.Thread @ 0x717f082c8 MutationStage:5 - 649,667,472 (7.68%) bytes.
•java.lang.Thread @ 0x717f083a8 MutationStage:21 - 498,081,544 (5.89%) bytes.
•java.lang.Thread @ 0x71b357e70 MutationStage:11 - 444,931,288 (5.26%) bytes.
--
If I click into the verbiage, they above Compaction and Mutations all seem 
to be referencing the same column family. Are the above related? Is there 
a way I can tell more exactly what is being compacted and/or mutated more 
specifically than which column family?