Re: JBOD device space allocation?

2016-02-23 Thread Marcus Eriksson
If you don't use RandomPartitioner/Murmur3Partitioner you will get the old
behavior.

On Wed, Feb 24, 2016 at 2:47 AM, Jack Krupansky 
wrote:

> I just wanted to confirm whether my understanding of how JBOD allocates
> device space is correct of not...
>
> Pre-3.2:
> On each memtable flush Cassandra will select the directory (device) which
> has the most available space as a percentage of the total available space
> on all of the listed directories/devices. A random weighted value is used
> so it won't always pick the same directory/device with the most space, the
> goal being to balance writes for performance.
>
> As of 3.2:
> The ranges of tokens stored on the local node will be evenly distributed
> among the configured storage devices - even by token range, even if that
> may be uneven by actual partition sizes. The code presumes that each of the
> configured local storage devices has the same capacity.
>
> The relevant change in 3.2 appears to be:
> Make sure tokens don't exist in several data directories (CASSANDRA-6696)
>
> The code for the pre-3.2 model is still in 3.x - is there some other code
> path which will cause the pre-3.2 behavior even when runing 3.2 or later?
>
> I see this code which seems to allow for at least some cases where the
> pre-3.2 behavior would still be invoked, but I'm not sure what user-level
> cases that might be:
>
> if (!cfs.getPartitioner().splitter().isPresent() || localRanges.isEmpty())
>   return Collections.singletonList(new
> FlushRunnable(lastReplayPosition.get(), txn));
>
> return createFlushRunnables(localRanges, txn);
>
> IOW, if the partitioner does not have a splitter present or the
> localRanges for the node cannot be determined. But... what exactly would a
> user do to cause that?
>
> There is no doc for this stuff - can a committer (or adventurous user!)
> confirm what is actually implemented, both pre and post 3.2? (I already
> pinged docs on this.)
>
> Or if anybody is actually using JBOD, what behavior they are seeing for
> device space utilization.
>
> Thanks!
>
> -- Jack Krupansky
>


Re: JBOD device space allocation?

2016-02-23 Thread Marcus Eriksson
It is mentioned here btw: http://www.datastax.com/dev/blog/improving-jbod

On Wed, Feb 24, 2016 at 8:14 AM, Marcus Eriksson  wrote:

> If you don't use RandomPartitioner/Murmur3Partitioner you will get the old
> behavior.
>
> On Wed, Feb 24, 2016 at 2:47 AM, Jack Krupansky 
> wrote:
>
>> I just wanted to confirm whether my understanding of how JBOD allocates
>> device space is correct of not...
>>
>> Pre-3.2:
>> On each memtable flush Cassandra will select the directory (device) which
>> has the most available space as a percentage of the total available space
>> on all of the listed directories/devices. A random weighted value is used
>> so it won't always pick the same directory/device with the most space, the
>> goal being to balance writes for performance.
>>
>> As of 3.2:
>> The ranges of tokens stored on the local node will be evenly distributed
>> among the configured storage devices - even by token range, even if that
>> may be uneven by actual partition sizes. The code presumes that each of the
>> configured local storage devices has the same capacity.
>>
>> The relevant change in 3.2 appears to be:
>> Make sure tokens don't exist in several data directories (CASSANDRA-6696)
>>
>> The code for the pre-3.2 model is still in 3.x - is there some other code
>> path which will cause the pre-3.2 behavior even when runing 3.2 or later?
>>
>> I see this code which seems to allow for at least some cases where the
>> pre-3.2 behavior would still be invoked, but I'm not sure what user-level
>> cases that might be:
>>
>> if (!cfs.getPartitioner().splitter().isPresent() || localRanges.isEmpty())
>>   return Collections.singletonList(new
>> FlushRunnable(lastReplayPosition.get(), txn));
>>
>> return createFlushRunnables(localRanges, txn);
>>
>> IOW, if the partitioner does not have a splitter present or the
>> localRanges for the node cannot be determined. But... what exactly would a
>> user do to cause that?
>>
>> There is no doc for this stuff - can a committer (or adventurous user!)
>> confirm what is actually implemented, both pre and post 3.2? (I already
>> pinged docs on this.)
>>
>> Or if anybody is actually using JBOD, what behavior they are seeing for
>> device space utilization.
>>
>> Thanks!
>>
>> -- Jack Krupansky
>>
>
>


Re: CRT

2016-02-23 Thread Chris Lohfink
Check out
http://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen. You
can run it yourself to test as well.

Chris

On Tue, Feb 23, 2016 at 7:02 PM, Rakesh Kumar  wrote:

> https://www.aphyr.com/posts/294-jepsen-cassandra
>
> How much of this is still valid in ver 3.0. The above seems to have been
> written for ver 1.0.
>
> thanks.
>


Reenable data access after temporarily moving data out of data directory

2016-02-23 Thread Jason Kania
Hi,
I encountered an error in Cassandra or the latest Oracle JVM that causes the 
JVM to terminate during compaction in my situation (CASSANDRA 11200). In trying 
work around the problem and access the data , I moved the data eg 
ma-NNN-big-Filter.db, ma-367-big-Data.db etc. out of the data directory and ran 
some cleanup commands which allowed the overall compactions to proceed.

Now I am wondering how I can get Cassandra to reaccess the data when it is put 
back into place. Right now, a SELECT * query on the table returns no results 
even though the files are back in place.
Also are there any tools to actually repair the data rather than copy it from a 
replica elsewhere because with the JVM error, the database JVMs are not staying 
up.

Suggestions would be appreciated.
Thanks,
Jason


Re: Nodes go down periodically

2016-02-23 Thread Joel Samuelsson
"Is it only one node at a time that goes down, and at widely dispersed
times?"
It is a two node cluster so both nodes consider the other node down at the
same time.

These are the times the latest few days:
INFO [GossipTasks:1] 2016-02-19 05:06:21,087 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-19 14:33:38,424 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-20 07:21:25,626 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-20 11:34:46,766 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-21 08:00:07,518 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-21 10:36:58,788 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-22 07:10:40,304 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-23 08:59:05,392 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN
INFO [GossipTasks:1] 2016-02-23 12:22:59,562 Gossiper.java (line 992)
InetAddress /x.x.x.x is now DOWN


2016-02-23 18:01 GMT+01:00 daemeon reiydelle :

> If you can, do a few (short, maybe 10m records, delete the default schema
> between executions) run of Cassandra Stress test against your production
> cluster (replication=3, force quorum to 3). Look for latency max in the 10s
> of SECONDS. If your devops team is running a monitoring tool that looks at
> the network, look for timeout/retries/errors/lost packets, etc. during the
> run (worst case you need to do netstats runs against the relevant nic e.g.
> every 10 seconds on the CassStress node, look for jumps in this count (if
> monitoring is enabled, look at the monitor's results for ALL of your nodes.
> At least one is having some issues.
>
>
> *...*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Tue, Feb 23, 2016 at 8:43 AM, Jack Krupansky 
> wrote:
>
>> The reality of modern distributed systems is that connectivity between
>> nodes is never guaranteed and distributed software must be able to cope
>> with occasional absence of connectivity. GC and network connectivity are
>> the two issues that a lot of us are most familiar with. There may be others
>> - but most technical problems on a node would be clearly logged on that
>> node. If you see a lapse of connectivity no more than once or twice a day,
>> consider yourselves lucky.
>>
>> Is it only one node at a time that goes down, and at widely dispersed
>> times?
>>
>> How many nodes?
>>
>> -- Jack Krupansky
>>
>> On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson <
>> samuelsson.j...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Version is 2.0.17.
>>> Yes, these are VMs in the cloud though I'm fairly certain they are on a
>>> LAN rather than WAN. They are both in the same data centre physically. The
>>> phi_convict_threshold is set to default. I'd rather find the root cause of
>>> the problem than just hiding it by not convicting a node if it isn't
>>> responding though. If pings are <2 ms without a single ping missed in
>>> several days, I highly doubt that network is the reason for the downtime.
>>>
>>> Best regards,
>>> Joel
>>>
>>> 2016-02-23 16:39 GMT+01:00 :
>>>
 You didn’t mention version, but I saw this kind of thing very often in
 the 1.1 line. Often this is connected to network flakiness. Are these VMs?
 In the cloud? Connected over a WAN? You mention that ping seems fine. Take
 a look at the phi_convict_threshold in c assandra.yaml. You may need to
 increase it to reduce the UP/DOWN flapping behavior.





 Sean Durity



 *From:* Joel Samuelsson [mailto:samuelsson.j...@gmail.com]
 *Sent:* Tuesday, February 23, 2016 9:41 AM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Nodes go down periodically



 Hi,



 Thanks for your reply.



 I have debug logging on and see no GC pauses that are that long. GC
 pauses are all well below 1s and 99 times out of 100 below 100ms.

 Do I need to enable GC log options to see the pauses?

 I see plenty of these lines:
 DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
 118) GC for ParNew: 24 ms for 1 collections

 as well as a few CMS GC log lines.



 Best regards,

 Joel



 2016-02-23 15:14 GMT+01:00 Hannu Kröger :

 Hi,



 Those are probably GC pauses. Memory tuning is probably needed. Check
 the parameters that you already have customised if they make sense.



 

Cassandra Data Audit

2016-02-23 Thread Charulata Sharma (charshar)
To all Cassandra experts out there,
 Can you please let me know if there is any inbuilt Cassandra 
feature that allows audits on Column family data ?

When I change any data in a CF, I want to record that change. Probably store 
the old value as well as the changed one.
One way of doing this is to create new CFs , but I wanted to know if there is 
any standard C* feature that could be used.
Any guidance in this and implementation approaches would really help.

Thanks,
Charu


JBOD device space allocation?

2016-02-23 Thread Jack Krupansky
I just wanted to confirm whether my understanding of how JBOD allocates
device space is correct of not...

Pre-3.2:
On each memtable flush Cassandra will select the directory (device) which
has the most available space as a percentage of the total available space
on all of the listed directories/devices. A random weighted value is used
so it won't always pick the same directory/device with the most space, the
goal being to balance writes for performance.

As of 3.2:
The ranges of tokens stored on the local node will be evenly distributed
among the configured storage devices - even by token range, even if that
may be uneven by actual partition sizes. The code presumes that each of the
configured local storage devices has the same capacity.

The relevant change in 3.2 appears to be:
Make sure tokens don't exist in several data directories (CASSANDRA-6696)

The code for the pre-3.2 model is still in 3.x - is there some other code
path which will cause the pre-3.2 behavior even when runing 3.2 or later?

I see this code which seems to allow for at least some cases where the
pre-3.2 behavior would still be invoked, but I'm not sure what user-level
cases that might be:

if (!cfs.getPartitioner().splitter().isPresent() || localRanges.isEmpty())
  return Collections.singletonList(new
FlushRunnable(lastReplayPosition.get(), txn));

return createFlushRunnables(localRanges, txn);

IOW, if the partitioner does not have a splitter present or the localRanges
for the node cannot be determined. But... what exactly would a user do to
cause that?

There is no doc for this stuff - can a committer (or adventurous user!)
confirm what is actually implemented, both pre and post 3.2? (I already
pinged docs on this.)

Or if anybody is actually using JBOD, what behavior they are seeing for
device space utilization.

Thanks!

-- Jack Krupansky


CRT

2016-02-23 Thread Rakesh Kumar

https://www.aphyr.com/posts/294-jepsen-cassandra


How much of this is still valid in ver 3.0. The above seems to have been 
written for ver 1.0.


thanks.


IF NOT EXISTS with multiple static columns confusion

2016-02-23 Thread Nimi Wariboko Jr
I have a table with 2 static columns, and I write to either one of them, if
I then write to the other one using IF NOT EXISTS, it fails even though it
has never been written too before. Is it the case that all static columns
share the same "written too" marker?

Given a table like so:

CREATE TABLE test (
  id timeuuid,
  foo int static,
  bar int static,
  baz int,
  baq int
  PRIMARY KEY (id, baz)
)

I'm seeing some confusing behavior see the statements below -

"""
INSERT INTO cmpayments.report_payments (id, foo) VALUES (NOW(), 1) IF NOT
EXISTS; // succeeds
TRUNCATE test;
INSERT INTO cmpayments.report_payments (id, baq) VALUES
(99c3-b01a-11e5-b170-0242ac110002, 1);
UPDATE cmpayments.report_payments SET foo = 1 WHERE
id=99c3-b01a-11e5-b170-0242ac110002 IF foo=null; // fails, even though
foo=null
TRUNCATE test;
INSERT INTO cmpayments.report_payments (id, bar) VALUES
(99c3-b01a-11e5-b170-0242ac110002, 1); // succeeds
INSERT INTO cmpayments.report_payments (id, foo) VALUES (NOW(), 1) IF NOT
EXISTS; // fails, even though foo=null, and has never been written too
"""

Nimi


Re: copy and rename sstable files as keyspace migration approach

2016-02-23 Thread Jarod Guertin
Great info about the Summary.db files, thanks Tyler.

On Tue, Feb 23, 2016 at 2:27 PM, Tyler Hobbs  wrote:

>
> On Tue, Feb 23, 2016 at 12:36 PM, Robert Coli 
> wrote:
>
>> [1] In some very new versions of Cassandra, this may not be safe to do
>> with certain meta information files which are sadly no longer immutable.
>
>
> I presume you're referring to the index summary (i.e Summary.db files).
> These just contain a sampling of the (immutable) Index.db files, and are
> safe to hardlink in the way that you've described.  The sampling level of
> the summary (which is what can change over time) is serialized at the start
> of the Summary.db file.
>
> If you're truly paranoid, you can skip the Summary.db files and they'll be
> rebuilt on startup.
>
> --
> Tyler Hobbs
> DataStax 
>


Re: copy and rename sstable files as keyspace migration approach

2016-02-23 Thread Jarod Guertin
Yes, 1) was just for safety but if cassandra is stopped locally, it's
probably not needed.

3) thanks for the note, will add
3) we were thinking of copying, and later (silent 7, as you mentioned,
after we drop the old keyspaces\CFs we would delete the original files)

6) good to know!

Thanks Rob


On Tue, Feb 23, 2016 at 1:36 PM, Robert Coli  wrote:

>
> On Tue, Feb 23, 2016 at 6:44 AM, Jarod Guertin <
> jarod.guer...@sparkpost.com> wrote:
>
>> Being fairly new to Cassandra, I'd like to run the following with the
>> experts to make sure it's an ok thing to do.
>>
>> We have a particular case where we have multiple keyspaces with multiple
>> tables each and we want to migrate to a new unique keyspace on the same
>> cluster.
>>
>> The approach envisioned is:
>> 1. take snapshots on all the nodes
>> 2. create the new keyspace and all the tables with identical schema
>> settings (just a different name and keyspace location)
>> 3. one node at a time, stop cassandra, copy the db files from the old
>> keyspace\table locations to the new keyspace\table locations and rename the
>> db filename to use the new keyspace name; then restart cassandra
>> 4. verify cassandra is running, then repeat step 3 for each other node
>> 5. once all done switch our application calls to use the new keyspace \
>> tables
>> 6. run node repair on each node, one node at a time
>>
>> It is understood that between the snapshots (1) and using the new
>> keyspace (5) that any changes would not be included in the migration, it
>> would be done during a maintenance window when only read operations would
>> be permitted.  I should also mention that our number of cassandra nodes is
>> greater than the replication factor (3).
>>
>
> This is essentially the same operation as renaming a columnfamily, which I
> described (and someone provided some useful details regarding) in this Jira
> :
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-1585?focusedCommentId=13488959=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13488959
>
> It's similar to the "copy-the-sstables" method here as well :
>
> https://www.pythian.com/blog/bulk-loading-options-for-cassandra/
>
> Notes on your variant :
>
> - 1) why snapshot? just for safety?
> - 3) add nodetool drain before stopping
> - 3) if you're "copying" you should strongly consider hard linking
> instead. that way you keep the (immutable) files in both places but only
> use the disk space once. [1]
> - 6) is un-necessary if you've done things properly, which you could
> verify by having a representative known set of data that you read before
> and after
> - Presumably there is a silent 7) where you drop the old keyspaces/CFs?
>
> =Rob
>
> [1] In some very new versions of Cassandra, this may not be safe to do
> with certain meta information files which are sadly no longer immutable.
>
>


Re: copy and rename sstable files as keyspace migration approach

2016-02-23 Thread Tyler Hobbs
On Tue, Feb 23, 2016 at 12:36 PM, Robert Coli  wrote:

> [1] In some very new versions of Cassandra, this may not be safe to do
> with certain meta information files which are sadly no longer immutable.


I presume you're referring to the index summary (i.e Summary.db files).
These just contain a sampling of the (immutable) Index.db files, and are
safe to hardlink in the way that you've described.  The sampling level of
the summary (which is what can change over time) is serialized at the start
of the Summary.db file.

If you're truly paranoid, you can skip the Summary.db files and they'll be
rebuilt on startup.

-- 
Tyler Hobbs
DataStax 


Re: copy and rename sstable files as keyspace migration approach

2016-02-23 Thread Robert Coli
On Tue, Feb 23, 2016 at 6:44 AM, Jarod Guertin 
wrote:

> Being fairly new to Cassandra, I'd like to run the following with the
> experts to make sure it's an ok thing to do.
>
> We have a particular case where we have multiple keyspaces with multiple
> tables each and we want to migrate to a new unique keyspace on the same
> cluster.
>
> The approach envisioned is:
> 1. take snapshots on all the nodes
> 2. create the new keyspace and all the tables with identical schema
> settings (just a different name and keyspace location)
> 3. one node at a time, stop cassandra, copy the db files from the old
> keyspace\table locations to the new keyspace\table locations and rename the
> db filename to use the new keyspace name; then restart cassandra
> 4. verify cassandra is running, then repeat step 3 for each other node
> 5. once all done switch our application calls to use the new keyspace \
> tables
> 6. run node repair on each node, one node at a time
>
> It is understood that between the snapshots (1) and using the new keyspace
> (5) that any changes would not be included in the migration, it would be
> done during a maintenance window when only read operations would be
> permitted.  I should also mention that our number of cassandra nodes is
> greater than the replication factor (3).
>

This is essentially the same operation as renaming a columnfamily, which I
described (and someone provided some useful details regarding) in this Jira
:

https://issues.apache.org/jira/browse/CASSANDRA-1585?focusedCommentId=13488959=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13488959

It's similar to the "copy-the-sstables" method here as well :

https://www.pythian.com/blog/bulk-loading-options-for-cassandra/

Notes on your variant :

- 1) why snapshot? just for safety?
- 3) add nodetool drain before stopping
- 3) if you're "copying" you should strongly consider hard linking instead.
that way you keep the (immutable) files in both places but only use the
disk space once. [1]
- 6) is un-necessary if you've done things properly, which you could verify
by having a representative known set of data that you read before and after
- Presumably there is a silent 7) where you drop the old keyspaces/CFs?

=Rob

[1] In some very new versions of Cassandra, this may not be safe to do with
certain meta information files which are sadly no longer immutable.


RE: Restart Cassandra automatically

2016-02-23 Thread SEAN_R_DURITY
What anti-pattern are you mocking me for exactly?


Sean Durity

From: daemeon reiydelle [mailto:daeme...@gmail.com]
Sent: Tuesday, February 23, 2016 11:21 AM
To: user@cassandra.apache.org
Subject: RE: Restart Cassandra automatically


Cassandra nodes do not go down "for no reason". They are not stateless. I would 
like to thank you for this marvelous example of a wonderful antipattern. 
Absolutely fantastic.

Thank you! I am not being a satirical smartass. I sometimes am challenged by 
clients in my presentations about sre best practices around c*, hadoop, and elk 
on the grounds that "noone would ever do this in production". Now I have 
objective proof!

Daemeon

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198
London +44.0.20.8144.9872
On Feb 23, 2016 7:53 AM, 
> wrote:
Yes, I can see the potential problem in theory. However, we never do your #2. 
Generally, we don’t have unused spare hardware. We just fix the host that is 
down and run repairs. (Side note: while I have seen nodes fight it out over who 
owns a particular token in earlier versions, it seems that 1.2+ doesn’t allow 
that to happen as easily. The second node will just not come up.)

For most of our use cases, I would agree with your Coli Conjecture.


Sean Durity

From: Robert Coli [mailto:rc...@eventbrite.com]
Sent: Tuesday, February 09, 2016 4:41 PM
To: user@cassandra.apache.org
Subject: Re: Restart Cassandra automatically

On Tue, Feb 9, 2016 at 6:20 AM, 
> wrote:
Call me naïve, but we do use an in-house built program for keeping nodes 
started (based on a flag-check). The program is something that was written for 
all kinds of daemon processes here, not Cassandra specifically. The basic idea 
is that is runs a status check. If that fails, and the flag is set, start 
Cassandra. In my opinion, it has helped more than hurt us – especially with the 
very fragile 1.1 releases that were prone to heap problems.

Ok, you're naïve.. ;P

But seriously, think of this scenario :

1) Node A, responsible for range A-M, goes down due to hardware failure of a 
disk in a RAID
2) Node B is put into service and is made responsible for A-M
3) Months pass
4) Node A comes back up, announces that it is responsible for A-M, and the 
cluster agrees

Consistency is now permanently broken for any involved rows. Why doesn't it 
(usually) matter?

It's not so much that you are naïve but that you are providing still more 
support for the Coli Conjecture : "If you are using a distributed database you 
probably do not care about consistency, even if you think you do." You have 
repeatedly chosen Availability over Consistency and it has never had a negative 
impact on your actual application.

=Rob




The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its 

Re: „Using Timestamp“ Feature

2016-02-23 Thread Ben Bromhead
When using client supplied timestamps you need to ensure the clock on the
client is in sync with the nodes in the cluster otherwise behaviour will be
unpredictable.

On Thu, 18 Feb 2016 at 08:50 Tyler Hobbs  wrote:

> 2016-02-18 2:00 GMT-06:00 Matthias Niehoff <
> matthias.nieh...@codecentric.de>:
>
>>
>> * is the 'using timestamp' feature (and providing statement timestamps)
>> sufficiently robust and mature to build an application on?
>>
>
> Yes.  It's been there since the start of CQL3.
>
>
>> * In a BatchedStatement, can different statements have different
>> (explicitly provided) timestamps, or is the BatchedStatement's timestamp
>> used for them all? Is this specified / stable behaviour?
>>
>
> Yes, you can separate timestamps per statement.  And, in fact, if you
> potentially mix inserts and deletes on the same rows, you *should *use
> explicit timestamps with different values.  See the timestamp notes here:
> http://cassandra.apache.org/doc/cql3/CQL.html#batchStmt
>
>
>> * cqhsh reports a syntax error when I use 'using timestamp' with an
>> update statement (works with 'insert'). Is there a good reason for this, or
>> is it a bug?
>>
>
> The "USING TIMESTAMP" goes in a different place in update statements.  It
> should be something like:
>
> UPDATE mytable USING TIMESTAMP ? SET col = ? WHERE key = ?
>
>
> --
> Tyler Hobbs
> DataStax 
>
-- 
Ben Bromhead
CTO | Instaclustr 
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer


Re: High Bloom filter false ratio

2016-02-23 Thread Jeff Jirsa
sstablemetadata definitely exists for 2.0 – it may be in a different location, 
but it exists.

If all else fails, it’s a 50 line bash script, grab it from here: 

https://github.com/apache/cassandra/blob/cassandra-2.0/tools/bin/sstablemetadata



From:  Anishek Agarwal
Reply-To:  "user@cassandra.apache.org"
Date:  Tuesday, February 23, 2016 at 12:37 AM
To:  "user@cassandra.apache.org"
Subject:  Re: High Bloom filter false ratio

Looks like that sstablemetadata is available in 2.2 , we are on 2.0.x do you 
know anything that will work on 2.0.x

On Tue, Feb 23, 2016 at 1:48 PM, Anishek Agarwal  wrote:
Thanks Jeff, Awesome will look at the tools and JMX endpoint. 

our settings are below originated from the jira you posted above as the base. 
we are running on 48 core machines with 2 SSD disks of 800 GB each .

MAX_HEAP_SIZE="6G"

HEAP_NEWSIZE="4G"

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=6"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32678"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32678"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32678"



On Tue, Feb 23, 2016 at 1:06 PM, Jeff Jirsa  wrote:
There exists a JMX endpoint called forceUserDefinedCompaction that takes a 
comma separated list of sstables to compact together.

There also exists a tool called sstablemetadata (may be in a ‘cassandra-tools’ 
package separate from whatever package you used to install cassandra, or in the 
tools/ directory of your binary package). Using sstablemetadata, you can look 
at the maxTimestamp for each table, and the ‘Estimated droppable tombstones’. 
Using those two fields, you could, very easily, write a script that gives you a 
list of sstables that you could feed to forceUserDefinedCompaction to join 
together to eliminate leftover waste.

Your long ParNew times may be fixable by increasing the new gen size of your 
heap – the general guidance in cassandra-env.sh is out of date, you may want to 
reference CASSANDRA-8150 for “newer” advice ( 
http://issues.apache.org/jira/browse/CASSANDRA-8150 ) 

- Jeff

From: Anishek Agarwal
Reply-To: "user@cassandra.apache.org"
Date: Monday, February 22, 2016 at 8:33 PM 

To: "user@cassandra.apache.org"
Subject: Re: High Bloom filter false ratio

Hey Jeff, 

Thanks for the clarification, I did not explain my self clearly, the 
max_stable_age_days is set to 30 days and the ttl on every insert is set to 30 
days also by default. gc_grace_seconds is 0, so i would think the sstable as a 
whole would be deleted.

Because of the problems mentioned by at 1) above it looks like, there might be 
cases where the table just lies around since no compaction is happening on it 
and even though everything is expired it would still not be deleted?

for 3) the average read is pretty good, though the throughput doesn't seem to 
be that great, when no repair is running we get GCIns > 200ms every couple of 
hours once, otherwise its every 10-20 mins 
INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line 116) GC 
for ParNew: 205 ms for 1 collections, 1712439128 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line 116) GC 
for ParNew: 242 ms for 1 collections, 1819126928 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line 116) GC 
for ParNew: 374 ms for 1 collections, 1829660304 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line 116) GC 
for ParNew: 419 ms for 1 collections, 2309875224 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line 116) GC 
for ParNew: 231 ms for 1 collections, 2515325328 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line 116) GC 
for ParNew: 252 ms for 1 collections, 1724241952 used; max is 7784628224



our reading patterns are dependent on BF to work efficiently as we do a lot of 
reads for keys that may not exists because its time series and we segregate 
data based on hourly boundary from epoch.



Re: Nodes go down periodically

2016-02-23 Thread daemeon reiydelle
If you can, do a few (short, maybe 10m records, delete the default schema
between executions) run of Cassandra Stress test against your production
cluster (replication=3, force quorum to 3). Look for latency max in the 10s
of SECONDS. If your devops team is running a monitoring tool that looks at
the network, look for timeout/retries/errors/lost packets, etc. during the
run (worst case you need to do netstats runs against the relevant nic e.g.
every 10 seconds on the CassStress node, look for jumps in this count (if
monitoring is enabled, look at the monitor's results for ALL of your nodes.
At least one is having some issues.


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Feb 23, 2016 at 8:43 AM, Jack Krupansky 
wrote:

> The reality of modern distributed systems is that connectivity between
> nodes is never guaranteed and distributed software must be able to cope
> with occasional absence of connectivity. GC and network connectivity are
> the two issues that a lot of us are most familiar with. There may be others
> - but most technical problems on a node would be clearly logged on that
> node. If you see a lapse of connectivity no more than once or twice a day,
> consider yourselves lucky.
>
> Is it only one node at a time that goes down, and at widely dispersed
> times?
>
> How many nodes?
>
> -- Jack Krupansky
>
> On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson <
> samuelsson.j...@gmail.com> wrote:
>
>> Hi,
>>
>> Version is 2.0.17.
>> Yes, these are VMs in the cloud though I'm fairly certain they are on a
>> LAN rather than WAN. They are both in the same data centre physically. The
>> phi_convict_threshold is set to default. I'd rather find the root cause of
>> the problem than just hiding it by not convicting a node if it isn't
>> responding though. If pings are <2 ms without a single ping missed in
>> several days, I highly doubt that network is the reason for the downtime.
>>
>> Best regards,
>> Joel
>>
>> 2016-02-23 16:39 GMT+01:00 :
>>
>>> You didn’t mention version, but I saw this kind of thing very often in
>>> the 1.1 line. Often this is connected to network flakiness. Are these VMs?
>>> In the cloud? Connected over a WAN? You mention that ping seems fine. Take
>>> a look at the phi_convict_threshold in c assandra.yaml. You may need to
>>> increase it to reduce the UP/DOWN flapping behavior.
>>>
>>>
>>>
>>>
>>>
>>> Sean Durity
>>>
>>>
>>>
>>> *From:* Joel Samuelsson [mailto:samuelsson.j...@gmail.com]
>>> *Sent:* Tuesday, February 23, 2016 9:41 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Nodes go down periodically
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> Thanks for your reply.
>>>
>>>
>>>
>>> I have debug logging on and see no GC pauses that are that long. GC
>>> pauses are all well below 1s and 99 times out of 100 below 100ms.
>>>
>>> Do I need to enable GC log options to see the pauses?
>>>
>>> I see plenty of these lines:
>>> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
>>> 118) GC for ParNew: 24 ms for 1 collections
>>>
>>> as well as a few CMS GC log lines.
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Joel
>>>
>>>
>>>
>>> 2016-02-23 15:14 GMT+01:00 Hannu Kröger :
>>>
>>> Hi,
>>>
>>>
>>>
>>> Those are probably GC pauses. Memory tuning is probably needed. Check
>>> the parameters that you already have customised if they make sense.
>>>
>>>
>>>
>>> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
>>>
>>>
>>>
>>> Hannu
>>>
>>>
>>>
>>>
>>>
>>> On 23 Feb 2016, at 16:08, Joel Samuelsson 
>>> wrote:
>>>
>>>
>>>
>>> Our nodes go down periodically, around 1-2 times each day. Downtime is
>>> from <1 second to 30 or so seconds.
>>>
>>>
>>>
>>> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
>>> InetAddress /109.74.13.67 is now DOWN
>>>
>>>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
>>> (line 978) InetAddress /109.74.13.67 is now UP
>>>
>>>
>>>
>>> I find nothing odd in the logs around the same time. I logged a ping
>>> with timestamp and checked during the same time and saw nothing weird (ping
>>> is less than 2ms at all times).
>>>
>>>
>>>
>>> Does anyone have any suggestions as to why this might happen?
>>>
>>>
>>>
>>> Best regards,
>>> Joel
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> The information in this Internet Email is confidential and may be
>>> legally privileged. It is intended solely for the addressee. Access to this
>>> Email by anyone else is unauthorized. If you are not the intended
>>> recipient, any disclosure, copying, distribution or any action taken or
>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>> When addressed to our clients any opinions or advice contained in this
>>> Email are subject to the terms and conditions expressed in any applicable
>>> governing The Home Depot terms of 

Re: Nodes go down periodically

2016-02-23 Thread Jack Krupansky
The reality of modern distributed systems is that connectivity between
nodes is never guaranteed and distributed software must be able to cope
with occasional absence of connectivity. GC and network connectivity are
the two issues that a lot of us are most familiar with. There may be others
- but most technical problems on a node would be clearly logged on that
node. If you see a lapse of connectivity no more than once or twice a day,
consider yourselves lucky.

Is it only one node at a time that goes down, and at widely dispersed times?

How many nodes?

-- Jack Krupansky

On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson  wrote:

> Hi,
>
> Version is 2.0.17.
> Yes, these are VMs in the cloud though I'm fairly certain they are on a
> LAN rather than WAN. They are both in the same data centre physically. The
> phi_convict_threshold is set to default. I'd rather find the root cause of
> the problem than just hiding it by not convicting a node if it isn't
> responding though. If pings are <2 ms without a single ping missed in
> several days, I highly doubt that network is the reason for the downtime.
>
> Best regards,
> Joel
>
> 2016-02-23 16:39 GMT+01:00 :
>
>> You didn’t mention version, but I saw this kind of thing very often in
>> the 1.1 line. Often this is connected to network flakiness. Are these VMs?
>> In the cloud? Connected over a WAN? You mention that ping seems fine. Take
>> a look at the phi_convict_threshold in c assandra.yaml. You may need to
>> increase it to reduce the UP/DOWN flapping behavior.
>>
>>
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> *From:* Joel Samuelsson [mailto:samuelsson.j...@gmail.com]
>> *Sent:* Tuesday, February 23, 2016 9:41 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Nodes go down periodically
>>
>>
>>
>> Hi,
>>
>>
>>
>> Thanks for your reply.
>>
>>
>>
>> I have debug logging on and see no GC pauses that are that long. GC
>> pauses are all well below 1s and 99 times out of 100 below 100ms.
>>
>> Do I need to enable GC log options to see the pauses?
>>
>> I see plenty of these lines:
>> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
>> 118) GC for ParNew: 24 ms for 1 collections
>>
>> as well as a few CMS GC log lines.
>>
>>
>>
>> Best regards,
>>
>> Joel
>>
>>
>>
>> 2016-02-23 15:14 GMT+01:00 Hannu Kröger :
>>
>> Hi,
>>
>>
>>
>> Those are probably GC pauses. Memory tuning is probably needed. Check the
>> parameters that you already have customised if they make sense.
>>
>>
>>
>> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
>>
>>
>>
>> Hannu
>>
>>
>>
>>
>>
>> On 23 Feb 2016, at 16:08, Joel Samuelsson 
>> wrote:
>>
>>
>>
>> Our nodes go down periodically, around 1-2 times each day. Downtime is
>> from <1 second to 30 or so seconds.
>>
>>
>>
>> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
>> InetAddress /109.74.13.67 is now DOWN
>>
>>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
>> (line 978) InetAddress /109.74.13.67 is now UP
>>
>>
>>
>> I find nothing odd in the logs around the same time. I logged a ping with
>> timestamp and checked during the same time and saw nothing weird (ping is
>> less than 2ms at all times).
>>
>>
>>
>> Does anyone have any suggestions as to why this might happen?
>>
>>
>>
>> Best regards,
>> Joel
>>
>>
>>
>>
>>
>> --
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>
>


Re: Restart Cassandra automatically

2016-02-23 Thread Anuj Wadehra
Hi Subharaj,
Cassandra is built to be a Fault tolerant distributed db and suitable for 
building HA systems. As Cassandra provides multiple replicas for the same data, 
if a single nide goes down in Production, it wont bring down the cluster.
In my opinion, if you target to start one or more failed Cassandra nodes 
without investigating the issue, you can damage system health rather than 
preserve it.
Please set RF amd CL appropriately to ensure that system can afford node 
failures.
ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Fri, 5 Feb, 2016 at 9:56 am, Debraj Manna wrote: 
  Hi,


What is the best way to keep cassandra running? My requirement is if for some 
reason cassandra stops then it should get started automatically. 

I tried to achieve this by adding cassandra to supervisord. My supervisor conf 
for cassandra looks like below:-
[program:cassandra]
command=/bin/bash -c 'sleep 10 && bin/cassandra'
directory=/opt/cassandra/
autostart=true
autorestart=true
startretries=3
stderr_logfile=/var/log/cassandra_supervisor.err.log
stdout_logfile=/var/log/cassandra_supervisor.out.log

But it does not seem to work properly. Even if I stop cassandra from 
supervisor then the cassandra process seem to be running if I do 
ps -ef | grep cassandra


I also tried the configuration mentioned in this question but still no luck.

Can someone let me know what is the best way to keep cassandra running on 
production environment?
Environment
   
   - Cassandra 2.2.4
   - Debian 8

Thanks,



  


RE: Restart Cassandra automatically

2016-02-23 Thread daemeon reiydelle
Cassandra nodes do not go down "for no reason". They are not stateless. I
would like to thank you for this marvelous example of a wonderful
antipattern. Absolutely fantastic.

Thank you! I am not being a satirical smartass. I sometimes am challenged
by clients in my presentations about sre best practices around c*, hadoop,
and elk on the grounds that "noone would ever do this in production". Now I
have objective proof!

Daemeon

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198
London +44.0.20.8144.9872
On Feb 23, 2016 7:53 AM,  wrote:

> Yes, I can see the potential problem in theory. However, we never do your
> #2. Generally, we don’t have unused spare hardware. We just fix the host
> that is down and run repairs. (Side note: while I have seen nodes fight it
> out over who owns a particular token in earlier versions, it seems that
> 1.2+ doesn’t allow that to happen as easily. The second node will just not
> come up.)
>
>
>
> For most of our use cases, I would agree with your Coli Conjecture.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Robert Coli [mailto:rc...@eventbrite.com]
> *Sent:* Tuesday, February 09, 2016 4:41 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Restart Cassandra automatically
>
>
>
> On Tue, Feb 9, 2016 at 6:20 AM,  wrote:
>
> Call me naïve, but we do use an in-house built program for keeping nodes
> started (based on a flag-check). The program is something that was written
> for all kinds of daemon processes here, not Cassandra specifically. The
> basic idea is that is runs a status check. If that fails, and the flag is
> set, start Cassandra. In my opinion, it has helped more than hurt us –
> especially with the very fragile 1.1 releases that were prone to heap
> problems.
>
>
>
> Ok, you're naïve.. ;P
>
>
>
> But seriously, think of this scenario :
>
>
>
> 1) Node A, responsible for range A-M, goes down due to hardware failure of
> a disk in a RAID
>
> 2) Node B is put into service and is made responsible for A-M
>
> 3) Months pass
>
> 4) Node A comes back up, announces that it is responsible for A-M, and the
> cluster agrees
>
>
>
> Consistency is now permanently broken for any involved rows. Why doesn't
> it (usually) matter?
>
>
>
> It's not so much that you are naïve but that you are providing still more
> support for the Coli Conjecture : "If you are using a distributed database
> you probably do not care about consistency, even if you think you do." You
> have repeatedly chosen Availability over Consistency and it has never had a
> negative impact on your actual application.
>
>
>
> =Rob
>
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


Re: Nodes go down periodically

2016-02-23 Thread Joel Samuelsson
Hi,

Version is 2.0.17.
Yes, these are VMs in the cloud though I'm fairly certain they are on a LAN
rather than WAN. They are both in the same data centre physically. The
phi_convict_threshold is set to default. I'd rather find the root cause of
the problem than just hiding it by not convicting a node if it isn't
responding though. If pings are <2 ms without a single ping missed in
several days, I highly doubt that network is the reason for the downtime.

Best regards,
Joel

2016-02-23 16:39 GMT+01:00 :

> You didn’t mention version, but I saw this kind of thing very often in the
> 1.1 line. Often this is connected to network flakiness. Are these VMs? In
> the cloud? Connected over a WAN? You mention that ping seems fine. Take a
> look at the phi_convict_threshold in c assandra.yaml. You may need to
> increase it to reduce the UP/DOWN flapping behavior.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Joel Samuelsson [mailto:samuelsson.j...@gmail.com]
> *Sent:* Tuesday, February 23, 2016 9:41 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Nodes go down periodically
>
>
>
> Hi,
>
>
>
> Thanks for your reply.
>
>
>
> I have debug logging on and see no GC pauses that are that long. GC pauses
> are all well below 1s and 99 times out of 100 below 100ms.
>
> Do I need to enable GC log options to see the pauses?
>
> I see plenty of these lines:
> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
> 118) GC for ParNew: 24 ms for 1 collections
>
> as well as a few CMS GC log lines.
>
>
>
> Best regards,
>
> Joel
>
>
>
> 2016-02-23 15:14 GMT+01:00 Hannu Kröger :
>
> Hi,
>
>
>
> Those are probably GC pauses. Memory tuning is probably needed. Check the
> parameters that you already have customised if they make sense.
>
>
>
> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
>
>
>
> Hannu
>
>
>
>
>
> On 23 Feb 2016, at 16:08, Joel Samuelsson 
> wrote:
>
>
>
> Our nodes go down periodically, around 1-2 times each day. Downtime is
> from <1 second to 30 or so seconds.
>
>
>
> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
> InetAddress /109.74.13.67 is now DOWN
>
>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
> (line 978) InetAddress /109.74.13.67 is now UP
>
>
>
> I find nothing odd in the logs around the same time. I logged a ping with
> timestamp and checked during the same time and saw nothing weird (ping is
> less than 2ms at all times).
>
>
>
> Does anyone have any suggestions as to why this might happen?
>
>
>
> Best regards,
> Joel
>
>
>
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


RE: Restart Cassandra automatically

2016-02-23 Thread SEAN_R_DURITY
Yes, I can see the potential problem in theory. However, we never do your #2. 
Generally, we don’t have unused spare hardware. We just fix the host that is 
down and run repairs. (Side note: while I have seen nodes fight it out over who 
owns a particular token in earlier versions, it seems that 1.2+ doesn’t allow 
that to happen as easily. The second node will just not come up.)

For most of our use cases, I would agree with your Coli Conjecture.


Sean Durity

From: Robert Coli [mailto:rc...@eventbrite.com]
Sent: Tuesday, February 09, 2016 4:41 PM
To: user@cassandra.apache.org
Subject: Re: Restart Cassandra automatically

On Tue, Feb 9, 2016 at 6:20 AM, 
> wrote:
Call me naïve, but we do use an in-house built program for keeping nodes 
started (based on a flag-check). The program is something that was written for 
all kinds of daemon processes here, not Cassandra specifically. The basic idea 
is that is runs a status check. If that fails, and the flag is set, start 
Cassandra. In my opinion, it has helped more than hurt us – especially with the 
very fragile 1.1 releases that were prone to heap problems.

Ok, you're naïve.. ;P

But seriously, think of this scenario :

1) Node A, responsible for range A-M, goes down due to hardware failure of a 
disk in a RAID
2) Node B is put into service and is made responsible for A-M
3) Months pass
4) Node A comes back up, announces that it is responsible for A-M, and the 
cluster agrees

Consistency is now permanently broken for any involved rows. Why doesn't it 
(usually) matter?

It's not so much that you are naïve but that you are providing still more 
support for the Coli Conjecture : "If you are using a distributed database you 
probably do not care about consistency, even if you think you do." You have 
repeatedly chosen Availability over Consistency and it has never had a negative 
impact on your actual application.

=Rob




The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


RE: Nodes go down periodically

2016-02-23 Thread SEAN_R_DURITY
You didn’t mention version, but I saw this kind of thing very often in the 1.1 
line. Often this is connected to network flakiness. Are these VMs? In the 
cloud? Connected over a WAN? You mention that ping seems fine. Take a look at 
the phi_convict_threshold in c assandra.yaml. You may need to increase it to 
reduce the UP/DOWN flapping behavior.


Sean Durity

From: Joel Samuelsson [mailto:samuelsson.j...@gmail.com]
Sent: Tuesday, February 23, 2016 9:41 AM
To: user@cassandra.apache.org
Subject: Re: Nodes go down periodically

Hi,

Thanks for your reply.

I have debug logging on and see no GC pauses that are that long. GC pauses are 
all well below 1s and 99 times out of 100 below 100ms.
Do I need to enable GC log options to see the pauses?
I see plenty of these lines:
DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line 118) GC 
for ParNew: 24 ms for 1 collections
as well as a few CMS GC log lines.

Best regards,
Joel

2016-02-23 15:14 GMT+01:00 Hannu Kröger 
>:
Hi,

Those are probably GC pauses. Memory tuning is probably needed. Check the 
parameters that you already have customised if they make sense.

http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html

Hannu


On 23 Feb 2016, at 16:08, Joel Samuelsson 
> wrote:

Our nodes go down periodically, around 1-2 times each day. Downtime is from <1 
second to 30 or so seconds.

INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) 
InetAddress /109.74.13.67 is now DOWN
 INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java (line 
978) InetAddress /109.74.13.67 is now UP

I find nothing odd in the logs around the same time. I logged a ping with 
timestamp and checked during the same time and saw nothing weird (ping is less 
than 2ms at all times).

Does anyone have any suggestions as to why this might happen?

Best regards,
Joel





The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


RE: High Bloom filter false ratio

2016-02-23 Thread SEAN_R_DURITY
I see the sstablemetadata tool as far back as 1.2.19 (in tools/bin).


Sean Durity
From: Anishek Agarwal [mailto:anis...@gmail.com]
Sent: Tuesday, February 23, 2016 3:37 AM
To: user@cassandra.apache.org
Subject: Re: High Bloom filter false ratio

Looks like that sstablemetadata is available in 2.2 , we are on 2.0.x do you 
know anything that will work on 2.0.x

On Tue, Feb 23, 2016 at 1:48 PM, Anishek Agarwal 
> wrote:
Thanks Jeff, Awesome will look at the tools and JMX endpoint.

our settings are below originated from the jira you posted above as the base. 
we are running on 48 core machines with 2 SSD disks of 800 GB each .


MAX_HEAP_SIZE="6G"

HEAP_NEWSIZE="4G"

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=6"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32678"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32678"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32678"


On Tue, Feb 23, 2016 at 1:06 PM, Jeff Jirsa 
> wrote:
There exists a JMX endpoint called forceUserDefinedCompaction that takes a 
comma separated list of sstables to compact together.

There also exists a tool called sstablemetadata (may be in a ‘cassandra-tools’ 
package separate from whatever package you used to install cassandra, or in the 
tools/ directory of your binary package). Using sstablemetadata, you can look 
at the maxTimestamp for each table, and the ‘Estimated droppable tombstones’. 
Using those two fields, you could, very easily, write a script that gives you a 
list of sstables that you could feed to forceUserDefinedCompaction to join 
together to eliminate leftover waste.

Your long ParNew times may be fixable by increasing the new gen size of your 
heap – the general guidance in cassandra-env.sh is out of date, you may want to 
reference CASSANDRA-8150 for “newer” advice ( 
http://issues.apache.org/jira/browse/CASSANDRA-8150 )

- Jeff

From: Anishek Agarwal
Reply-To: "user@cassandra.apache.org"
Date: Monday, February 22, 2016 at 8:33 PM

To: "user@cassandra.apache.org"
Subject: Re: High Bloom filter false ratio

Hey Jeff,

Thanks for the clarification, I did not explain my self clearly, the 
max_stable_age_days is set to 30 days and the ttl on every insert is set to 30 
days also by default. gc_grace_seconds is 0, so i would think the sstable as a 
whole would be deleted.

Because of the problems mentioned by at 1) above it looks like, there might be 
cases where the table just lies around since no compaction is happening on it 
and even though everything is expired it would still not be deleted?

for 3) the average read is pretty good, though the throughput doesn't seem to 
be that great, when no repair is running we get GCIns > 200ms every couple of 
hours once, otherwise its every 10-20 mins

INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line 116) GC 
for ParNew: 205 ms for 1 collections, 1712439128 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line 116) GC 
for ParNew: 242 ms for 1 collections, 1819126928 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line 116) GC 
for ParNew: 374 ms for 1 collections, 1829660304 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line 116) GC 
for ParNew: 419 ms for 1 collections, 2309875224 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line 116) GC 
for ParNew: 231 ms for 1 collections, 2515325328 used; max is 7784628224

 INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line 116) GC 
for ParNew: 252 ms for 1 collections, 1724241952 used; max is 7784628224


our reading patterns are dependent on BF to work efficiently as we do a lot of 
reads for keys that may not exists because its time series and we segregate 
data based on hourly boundary from epoch.


hey Christoper,

yes every row in the stable that should have 

Re: Nodes go down periodically

2016-02-23 Thread Joel Samuelsson
Hi,

Thanks for your reply.

I have debug logging on and see no GC pauses that are that long. GC pauses
are all well below 1s and 99 times out of 100 below 100ms.
Do I need to enable GC log options to see the pauses?
I see plenty of these lines:
DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
118) GC for ParNew: 24 ms for 1 collections
as well as a few CMS GC log lines.

Best regards,
Joel

2016-02-23 15:14 GMT+01:00 Hannu Kröger :

> Hi,
>
> Those are probably GC pauses. Memory tuning is probably needed. Check the
> parameters that you already have customised if they make sense.
>
> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
>
> Hannu
>
>
> On 23 Feb 2016, at 16:08, Joel Samuelsson 
> wrote:
>
> Our nodes go down periodically, around 1-2 times each day. Downtime is
> from <1 second to 30 or so seconds.
>
> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
> InetAddress /109.74.13.67 is now DOWN
>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
> (line 978) InetAddress /109.74.13.67 is now UP
>
> I find nothing odd in the logs around the same time. I logged a ping with
> timestamp and checked during the same time and saw nothing weird (ping is
> less than 2ms at all times).
>
> Does anyone have any suggestions as to why this might happen?
>
> Best regards,
> Joel
>
>
>


copy and rename sstable files as keyspace migration approach

2016-02-23 Thread Jarod Guertin
Being fairly new to Cassandra, I'd like to run the following with the
experts to make sure it's an ok thing to do.

We have a particular case where we have multiple keyspaces with multiple
tables each and we want to migrate to a new unique keyspace on the same
cluster.

The approach envisioned is:
1. take snapshots on all the nodes
2. create the new keyspace and all the tables with identical schema
settings (just a different name and keyspace location)
3. one node at a time, stop cassandra, copy the db files from the old
keyspace\table locations to the new keyspace\table locations and rename the
db filename to use the new keyspace name; then restart cassandra
4. verify cassandra is running, then repeat step 3 for each other node
5. once all done switch our application calls to use the new keyspace \
tables
6. run node repair on each node, one node at a time

It is understood that between the snapshots (1) and using the new keyspace
(5) that any changes would not be included in the migration, it would be
done during a maintenance window when only read operations would be
permitted.  I should also mention that our number of cassandra nodes is
greater than the replication factor (3).

If you have done something similar and have experiences\advices to share,
it would be most welcomed.

Regards,
Jarod


Re: Nodes go down periodically

2016-02-23 Thread Hannu Kröger
Hi,

Those are probably GC pauses. Memory tuning is probably needed. Check the 
parameters that you already have customised if they make sense.

http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html 


Hannu


> On 23 Feb 2016, at 16:08, Joel Samuelsson  wrote:
> 
> Our nodes go down periodically, around 1-2 times each day. Downtime is from 
> <1 second to 30 or so seconds.
> 
> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992) 
> InetAddress /109.74.13.67  is now DOWN
>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java (line 
> 978) InetAddress /109.74.13.67  is now UP
> 
> I find nothing odd in the logs around the same time. I logged a ping with 
> timestamp and checked during the same time and saw nothing weird (ping is 
> less than 2ms at all times).
> 
> Does anyone have any suggestions as to why this might happen?
> 
> Best regards,
> Joel



Nodes go down periodically

2016-02-23 Thread Joel Samuelsson
Our nodes go down periodically, around 1-2 times each day. Downtime is from
<1 second to 30 or so seconds.

INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
InetAddress /109.74.13.67 is now DOWN
 INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
(line 978) InetAddress /109.74.13.67 is now UP

I find nothing odd in the logs around the same time. I logged a ping with
timestamp and checked during the same time and saw nothing weird (ping is
less than 2ms at all times).

Does anyone have any suggestions as to why this might happen?

Best regards,
Joel


Re: High Bloom filter false ratio

2016-02-23 Thread Anishek Agarwal
Looks like that sstablemetadata is available in 2.2 , we are on 2.0.x do
you know anything that will work on 2.0.x

On Tue, Feb 23, 2016 at 1:48 PM, Anishek Agarwal  wrote:

> Thanks Jeff, Awesome will look at the tools and JMX endpoint.
>
> our settings are below originated from the jira you posted above as the
> base. we are running on 48 core machines with 2 SSD disks of 800 GB each .
>
> MAX_HEAP_SIZE="6G"
>
> HEAP_NEWSIZE="4G"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=6"
>
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>
> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>
> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>
> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>
> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>
> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>
> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>
> # earlier value 131072
>
> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32678"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32678"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32678"
>
>
> On Tue, Feb 23, 2016 at 1:06 PM, Jeff Jirsa 
> wrote:
>
>> There exists a JMX endpoint called forceUserDefinedCompaction that takes
>> a comma separated list of sstables to compact together.
>>
>> There also exists a tool called sstablemetadata (may be in a
>> ‘cassandra-tools’ package separate from whatever package you used to
>> install cassandra, or in the tools/ directory of your binary package).
>> Using sstablemetadata, you can look at the maxTimestamp for each table, and
>> the ‘Estimated droppable tombstones’. Using those two fields, you could,
>> very easily, write a script that gives you a list of sstables that you
>> could feed to forceUserDefinedCompaction to join together to eliminate
>> leftover waste.
>>
>> Your long ParNew times may be fixable by increasing the new gen size of
>> your heap – the general guidance in cassandra-env.sh is out of date, you
>> may want to reference CASSANDRA-8150 for “newer” advice (
>> http://issues.apache.org/jira/browse/CASSANDRA-8150 )
>>
>> - Jeff
>>
>> From: Anishek Agarwal
>> Reply-To: "user@cassandra.apache.org"
>> Date: Monday, February 22, 2016 at 8:33 PM
>>
>> To: "user@cassandra.apache.org"
>> Subject: Re: High Bloom filter false ratio
>>
>> Hey Jeff,
>>
>> Thanks for the clarification, I did not explain my self clearly, the 
>> max_stable_age_days
>> is set to 30 days and the ttl on every insert is set to 30 days also
>> by default. gc_grace_seconds is 0, so i would think the sstable as a whole
>> would be deleted.
>>
>> Because of the problems mentioned by at 1) above it looks like, there
>> might be cases where the table just lies around since no compaction is
>> happening on it and even though everything is expired it would still not be
>> deleted?
>>
>> for 3) the average read is pretty good, though the throughput doesn't
>> seem to be that great, when no repair is running we get GCIns > 200ms every
>> couple of hours once, otherwise its every 10-20 mins
>>
>> INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line
>> 116) GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line
>> 116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line
>> 116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line
>> 116) GC for ParNew: 419 ms for 1 collections, 2309875224 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line
>> 116) GC for ParNew: 231 ms for 1 collections, 2515325328 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line
>> 116) GC for ParNew: 252 ms for 1 collections, 1724241952 used; max is
>> 7784628224
>>
>>
>> our reading patterns are dependent on BF to work efficiently as we do a
>> lot of reads for keys that may not exists because its time series and
>> we segregate data based on hourly boundary from epoch.
>>
>>
>> hey Christoper,
>>
>> yes every row in the stable that should have been deleted has "d" in that
>> column. Also the key for one of the row is as
>>
>> "key": 

Re: High Bloom filter false ratio

2016-02-23 Thread Anishek Agarwal
Thanks Jeff, Awesome will look at the tools and JMX endpoint.

our settings are below originated from the jira you posted above as the
base. we are running on 48 core machines with 2 SSD disks of 800 GB each .

MAX_HEAP_SIZE="6G"

HEAP_NEWSIZE="4G"

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=6"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32678"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32678"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32678"


On Tue, Feb 23, 2016 at 1:06 PM, Jeff Jirsa 
wrote:

> There exists a JMX endpoint called forceUserDefinedCompaction that takes a
> comma separated list of sstables to compact together.
>
> There also exists a tool called sstablemetadata (may be in a
> ‘cassandra-tools’ package separate from whatever package you used to
> install cassandra, or in the tools/ directory of your binary package).
> Using sstablemetadata, you can look at the maxTimestamp for each table, and
> the ‘Estimated droppable tombstones’. Using those two fields, you could,
> very easily, write a script that gives you a list of sstables that you
> could feed to forceUserDefinedCompaction to join together to eliminate
> leftover waste.
>
> Your long ParNew times may be fixable by increasing the new gen size of
> your heap – the general guidance in cassandra-env.sh is out of date, you
> may want to reference CASSANDRA-8150 for “newer” advice (
> http://issues.apache.org/jira/browse/CASSANDRA-8150 )
>
> - Jeff
>
> From: Anishek Agarwal
> Reply-To: "user@cassandra.apache.org"
> Date: Monday, February 22, 2016 at 8:33 PM
>
> To: "user@cassandra.apache.org"
> Subject: Re: High Bloom filter false ratio
>
> Hey Jeff,
>
> Thanks for the clarification, I did not explain my self clearly, the 
> max_stable_age_days
> is set to 30 days and the ttl on every insert is set to 30 days also
> by default. gc_grace_seconds is 0, so i would think the sstable as a whole
> would be deleted.
>
> Because of the problems mentioned by at 1) above it looks like, there
> might be cases where the table just lies around since no compaction is
> happening on it and even though everything is expired it would still not be
> deleted?
>
> for 3) the average read is pretty good, though the throughput doesn't seem
> to be that great, when no repair is running we get GCIns > 200ms every
> couple of hours once, otherwise its every 10-20 mins
>
> INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line
> 116) GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is
> 7784628224
>
>  INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line
> 116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is
> 7784628224
>
>  INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line
> 116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is
> 7784628224
>
>  INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line
> 116) GC for ParNew: 419 ms for 1 collections, 2309875224 used; max is
> 7784628224
>
>  INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line
> 116) GC for ParNew: 231 ms for 1 collections, 2515325328 used; max is
> 7784628224
>
>  INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line
> 116) GC for ParNew: 252 ms for 1 collections, 1724241952 used; max is
> 7784628224
>
>
> our reading patterns are dependent on BF to work efficiently as we do a
> lot of reads for keys that may not exists because its time series and
> we segregate data based on hourly boundary from epoch.
>
>
> hey Christoper,
>
> yes every row in the stable that should have been deleted has "d" in that
> column. Also the key for one of the row is as
>
> "key": "00080cdd5edd080006251000"
>
>
>
> how do i get it back to normal readable format to get the (long,long) --
> composite partition key back?
>
> Looks like i have to force a major compaction to delete a lot of data ?
> are there any other solutions ?
>
> thanks
> anishek
>
>
>
> On Mon, Feb 22, 2016 at 11:21 PM, Jeff Jirsa 
>