unrecognized column family in logs

2017-11-08 Thread Anubhav Kale
Hello,

We run Cassandra 2.1.13 and since last few days we're seeing below in logs 
occasionally. The node then becomes unresponsive to cqlsh.

ERROR [SharedPool-Worker-2] 2017-11-08 17:02:32,362 CommitLogSegment.java:441 - 
Attempted to write commit log entry for unrecognized column family: 
2854d160-3a2f-11e6-925c-b143135bdc80

https://github.com/mariusae/cassandra/blob/master/src/java/org/apache/cassandra/db/commitlog/CommitLogSegment.java#L95

The column family has heavy writes, but it hasn't changed schema / load wise 
recently. How can this be troubleshooted / fixed ?

Thanks !





RE: Re: Tuning bootstrap new node

2017-10-31 Thread Anubhav Kale
You can change YAML setting of memtable_cleanup_threshold to 0.7 (from the 
default of 0.3). This will push SSTables to disk less often and will reduce the 
compaction time.

While this won’t change the streaming time, it will reduce the overall time for 
your node to be healthy.

From: Harikrishnan Pillai [mailto:hpil...@walmartlabs.com]
Sent: Tuesday, October 31, 2017 11:28 AM
To: user@cassandra.apache.org
Subject: Re: Re: Tuning bootstrap new node


There is no magic in speeding up the node addition other than increasing stream 
throughput and compaction throughput.

it has been noticed that with heavy compactions the latency may go up if the 
node also start serving data.

if you really don't want  this node to service traffic till all compactions 
settle down, you can disable gossip and binary protocol using the nodetool 
command. This will allow compactions to continue but requires a repair to fix 
the stale data later.

Regards

Hari


From: Nitan Kainth >
Sent: Tuesday, October 31, 2017 5:47 AM
To: user@cassandra.apache.org
Subject: EXT: Re: Tuning bootstrap new node

Do not stop compaction, you will end up with thousands of sstables.

You increase stream throughput from default 200 to a heifer value if your 
network can handle it.
Sent from my iPhone

On Oct 31, 2017, at 6:35 AM, Peng Xiao <2535...@qq.com> 
wrote:
Can we stop the compaction during the new node bootstraping and enable it after 
the new node joined?

Thanks
-- Original --
From:  "我自己的邮箱";<2535...@qq.com>;
Date:  Tue, Oct 31, 2017 07:18 PM
To:  "user">;
Subject:  Tuning bootstrap new node

Dear All,

Can we make some tuning to make bootstrap new node more quick?We have a three 
DC cluster(RF=3 in two DCs,RF=1 in another ,48 nodes in the DC with RF=3).As 
the Cluster is becoming larger and larger,we need to spend more than 24 hours 
to bootstrap a new node.
Could you please advise how to tune this ?

Many Thanks,
Peng Xiao


RE: Cassandra proxy to control read/write throughput

2017-10-31 Thread Anubhav Kale
There are some caveats with coordinator only nodes. You can read about our 
experience in detail 
here.

From: Nate McCall [mailto:n...@thelastpickle.com]
Sent: Sunday, October 29, 2017 2:12 PM
To: Cassandra Users 
Subject: Re: Cassandra proxy to control read/write throughput

The following presentation describes in detail a technique for using 
coordinator-only nodes which will give you similar behavior (particularly 
slides 12 to 14):
https://www.slideshare.net/DataStax/optimizing-your-cluster-with-coordinator-nodes-eric-lubow-simplereach-cassandra-summit-2016

On Thu, Oct 26, 2017 at 12:07 PM, AI Rumman 
> wrote:
Hi,

I am using different versions of Casandra in my environment where I have 60 
nodes are running for different applications. Each application is connecting to 
its own cluster. I am thinking about abstracting the Cassandra IP from app 
drivers.
App will communicate to one proxy IP which will redirect traffic to appropriate 
Cassandra cluster. The reason behind this thinking is to merge multiple 
clusters and control the read/write throughput from proxy based on the 
application.
If anyone knows about pg_bouncer for Postgresql, I am thinking something 
similar to that.
Have anyone worked in such a project? Can you please share some idea?

Thanks.



--
-
Nate McCall
Wellington, NZ
@zznate

CTO
Apache Cassandra Consulting
http://www.thelastpickle.com


RE: nodetool repair failure

2017-06-30 Thread Anubhav Kale
If possible, simply read the table under question with consistency=ALL. This 
will trigger a repair and is far more reliable than the nodetool command.

From: Balaji Venkatesan [mailto:venkatesan.bal...@gmail.com]
Sent: Thursday, June 29, 2017 7:26 PM
To: user@cassandra.apache.org
Subject: Re: nodetool repair failure

It did not help much. But other issue or error I saw when I repair the keyspace 
was it says

"Sync failed between /xx.xx.xx.93 and /xx.xx.xx.94" this was run from .91 node.



On Thu, Jun 29, 2017 at 4:44 PM, Akhil Mehra 
> wrote:
Run the following query and see if it gives you more information:

select * from system_distributed.repair_history;

Also is there any additional logging on the nodes where the error is coming 
from. Seems to be xx.xx.xx.94 for your last run.


On 30/06/2017, at 9:43 AM, Balaji Venkatesan 
> wrote:

The verify and scrub went without any error on the keyspace. I ran it again 
with trace mode and still the same issue


[2017-06-29 21:37:45,578] Parsing UPDATE 
system_distributed.parent_repair_history SET finished_at = toTimestamp(now()), 
successful_ranges = {'} WHERE parent_id=f1f10af0-5d12-11e7-8df9-59d19ef3dd23
[2017-06-29 21:37:45,580] Preparing statement
[2017-06-29 21:37:45,580] Determining replicas for mutation
[2017-06-29 21:37:45,580] Sending MUTATION message to /xx.xx.xx.95
[2017-06-29 21:37:45,580] Sending MUTATION message to /xx.xx.xx.94
[2017-06-29 21:37:45,580] Sending MUTATION message to /xx.xx.xx.93
[2017-06-29 21:37:45,581] REQUEST_RESPONSE message received from /xx.xx.xx.93
[2017-06-29 21:37:45,581] REQUEST_RESPONSE message received from /xx.xx.xx.94
[2017-06-29 21:37:45,581] Processing response from /xx.xx.xx.93
[2017-06-29 21:37:45,581] /xx.xx.xx.94: MUTATION message received from 
/xx.xx.xx.91
[2017-06-29 21:37:45,582] Processing response from /xx.xx.xx.94
[2017-06-29 21:37:45,582] /xx.xx.xx.93: MUTATION message received from 
/xx.xx.xx.91
[2017-06-29 21:37:45,582] /xx.xx.xx.95: MUTATION message received from 
/xx.xx.xx.91
[2017-06-29 21:37:45,582] /xx.xx.xx.94: Appending to commitlog
[2017-06-29 21:37:45,582] /xx.xx.xx.94: Adding to parent_repair_history memtable
[2017-06-29 21:37:45,582] Some repair failed
[2017-06-29 21:37:45,582] Repair command #3 finished in 1 minute 44 seconds
error: Repair job has failed with the error message: [2017-06-29 21:37:45,582] 
Some repair failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: 
[2017-06-29 21:37:45,582] Some repair failed
at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
at 
org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
at 
com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)



On Thu, Jun 29, 2017 at 1:36 PM, Subroto Barua 
> wrote:
Balaji,

Are you repairing a specific keyspace/table? if the failure is tied to a table, 
try 'verify' and 'scrub' options on .91...see if you get any errors.




On Thursday, June 29, 2017, 12:12:14 PM PDT, Balaji Venkatesan 
> wrote:


Thanks. I tried with trace option and there is not much info. Here are the few 
log lines just before it failed.


[2017-06-29 19:01:54,969] /xx.xx.xx.93: Sending REPAIR_MESSAGE message to 
/xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
[2017-06-29 19:01:54,969] 

Cassandra 2.1.13: Using JOIN_RING=False

2017-05-09 Thread Anubhav Kale
Hello,

With some inspiration from the Cassandra Summit talk from last year, we are 
trying to setup a cluster with coordinator-only nodes. We setup join_ring=false 
in env.sh, disabled auth in YAML and the nodes are able to start just fine. 
However, we're running into a few problems

1] The nodes marked with join_ring=false continue to store data. Why ?
2] We tried Python driver's whitelistedpolicy. But we notice message like 
below, so we are not able to send queries to all nodes marked as coordinators. 
We also changed the Scala driver to support whitelisting, but see the same 
thing. What are we missing ?
3] Is there any way to concretely tell that only coordinator nodes are getting 
requests from clients ? We don't have OpsCenter.

Thanks !

2017-05-09 20:45:25,060 [DEBUG] cassandra.cluster: [control connection] 
Removing host not found in peers metadata: 
2017-05-09 20:45:25,060 [INFO] cassandra.cluster: Cassandra host 10.80.10.128 
removed
2017-05-09 20:45:25,060 [DEBUG] cassandra.cluster: Removing host 10.80.10.128
2017-05-09 20:45:25,060 [DEBUG] cassandra.cluster: [control connection] 
Removing host not found in peers metadata: 
2017-05-09 20:45:25,060 [INFO] cassandra.cluster: Cassandra host 10.80.10.127 
removed
2017-05-09 20:45:25,060 [DEBUG] cassandra.cluster: Removing host 10.80.10.127
2017-05-09 20:45:25,060 [DEBUG] cassandra.cluster: [control connection] 
Removing host not found in peers metadata: 
2017-05-09 20:45:25,060 [INFO] cassandra.cluster: Cassandra host 10.80.10.129 
removed
2017-05-09 20:45:25,060 [DEBUG] cassandra.cluster: Removing host 10.80.10.129
2017-05-09 20:45:25,060 [DEBUG] cassandra.cluster: [control connection] 
Finished fetching ring info
2017-05-09 20:45:25,060 [DEBUG] cassandra.cluster: [control connection] 
Rebuilding token map due to topology changes
2017-05-09 20:45:25,081 [DEBUG] cassandra.metadata: user functions table not 
found
2017-05-09 20:45:25,081 [DEBUG] cassandra.metadata: user aggregates table not 
found
2017-05-09 20:45:25,098 [DEBUG] cassandra.cluster: Control connection created
2017-05-09 20:45:25,099 [DEBUG] cassandra.pool: Initializing connection for 
host 10.80.10.125
2017-05-09 20:45:25,099 [DEBUG] cassandra.pool: Initializing connection for 
host 10.80.10.126



RE: RemoveNode CPU Spike Question

2017-01-10 Thread Anubhav Kale
Well, looking through logs I confirmed that my understanding below is correct, 
but would be good to hear from experts for sure 

From: Anubhav Kale [mailto:anubhav.k...@microsoft.com]
Sent: Tuesday, January 10, 2017 9:58 AM
To: user@cassandra.apache.org
Cc: Sean Usher <seus...@exchange.microsoft.com>
Subject: RemoveNode CPU Spike Question

Hello,

Recently, I started noticing an interesting pattern. When I execute 
“removenode”, a subset of the nodes that now own the tokens result it in a CPU 
spike / disk activity, and sometimes SSTables on those nodes shoot up.

After looking through the code, it appears to me that below function forces 
data to be streamed from some of the new nodes to the node from where 
“removenode” is kicked in. Is my understanding correct ?

https://github.com/apache/cassandra/blob/d384e781d6f7c028dbe88cfe9dd3e966e72cd046/src/java/org/apache/cassandra/service/StorageService.java#L2548<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fcassandra%2Fblob%2Fd384e781d6f7c028dbe88cfe9dd3e966e72cd046%2Fsrc%2Fjava%2Forg%2Fapache%2Fcassandra%2Fservice%2FStorageService.java%23L2548=02%7C01%7CAnubhav.Kale%40microsoft.com%7C173daa48fcaf4ca6498d08d43982318c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636196678720784947=JZ9zWh%2FtJJ%2FbhXXkT41yQhANKaUSBHfP53WraY2vL8M%3D=0>

Our nodes don’t run very hot, but it appears this streaming causes them to have 
issues. Have other people seen this ?

Thanks !


RemoveNode CPU Spike Question

2017-01-10 Thread Anubhav Kale
Hello,

Recently, I started noticing an interesting pattern. When I execute 
"removenode", a subset of the nodes that now own the tokens result it in a CPU 
spike / disk activity, and sometimes SSTables on those nodes shoot up.

After looking through the code, it appears to me that below function forces 
data to be streamed from some of the new nodes to the node from where 
"removenode" is kicked in. Is my understanding correct ?

https://github.com/apache/cassandra/blob/d384e781d6f7c028dbe88cfe9dd3e966e72cd046/src/java/org/apache/cassandra/service/StorageService.java#L2548

Our nodes don't run very hot, but it appears this streaming causes them to have 
issues. Have other people seen this ?

Thanks !


RE: High CPU on nodes

2016-12-21 Thread Anubhav Kale
CIL

From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
Sent: Saturday, December 17, 2016 5:18 AM
To: user@cassandra.apache.org
Subject: Re: High CPU on nodes

Hi,

What does 'nodetool netstats' looks like on those nodes?

Its not doing any streaming.

we have 30GB heap

How is the JVM / GC doing? Are you using G1GC or CMS? This setting would be bad 
for CMS.

G1. GC is doing fine. I don’t see any long pauses beyond 200 ms.

You can use this tool to understand were the CPU is being used 
https://github.com/aragozin/jvm-tools/blob/master/sjk-core/COMMANDS.md#ttop-command<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faragozin%2Fjvm-tools%2Fblob%2Fmaster%2Fsjk-core%2FCOMMANDS.md%23ttop-command=02%7C01%7CAnubhav.Kale%40microsoft.com%7Cab2c0fcf99a447694b0908d4267f3036%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636175775106811606=R%2FouOelExm1C3okjg9zEJsdlCiDRrhy8%2B9n3SIqC4fg%3D=0>.

I hope that helps,

C*heers,
---
Alain Rodriguez - @arodream - 
al...@thelastpickle.com<mailto:al...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.thelastpickle.com=02%7C01%7CAnubhav.Kale%40microsoft.com%7Cab2c0fcf99a447694b0908d4267f3036%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636175775106811606=kZPi%2B43OyWGNr%2FAmJsLflVOWkSMI0V7oK4x%2Ff%2FR27BU%3D=0>



2016-12-17 0:10 GMT+01:00 Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>>:
Hello,

I am trying to fight a high CPU problem on some of our nodes. Thread dumps show 
that it’s not GC threads (we have 30GB heap), iostat %iowait confirms it’s not 
disk (ranges between 0.3 – 0.9%). One of the ways in which the problem 
manifests is that the nodes can’t compact SSTables and it happens randomly. We 
run Cassandra 2.1.13 on Azure Premium Storage (network attached SSDs).

One of the sample threads that was taking high CPU shows :

"pool-13-thread-1" 
#3352<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsupport.datastax.com%2Fhc%2Frequests%2F3352=02%7C01%7CAnubhav.Kale%40microsoft.com%7Cab2c0fcf99a447694b0908d4267f3036%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636175775106811606=OP%2FepExQP5HyrBitvVlyjCj4cVXpB0zc8Oj5TWapduY%3D=0>
 prio=5 os_prio=0 tid=0x7f2275340bb0 nid=0x1b0b runnable 
[0x7f33ffaae000]
java.lang.Thread.State: RUNNABLE
at java.util.TimSort.gallopRight(TimSort.java:632)
at java.util.TimSort.mergeLo(TimSort.java:739)
at java.util.TimSort.mergeAt(TimSort.java:514)
at java.util.TimSort.mergeCollapse(TimSort.java:441)
at java.util.TimSort.sort(TimSort.java:245)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1454)
at java.util.Collections.sort(Collections.java:175)
at 
org.apache.cassandra.locator.DynamicEndpointSnitch.sortByProximityWithScore(DynamicEndpointSnitch.java:163)
at 
org.apache.cassandra.locator.DynamicEndpointSnitch.sortByProximityWithBadness(DynamicEndpointSnitch.java:200)
at 
org.apache.cassandra.locator.DynamicEndpointSnitch.sortByProximity(DynamicEndpointSnitch.java:152)
at 
org.apache.cassandra.service.StorageProxy.getLiveSortedEndpoints(StorageProxy.java:1581)
at 
org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:1739)

Looking at code, I can’t figure out why things like this would require a high 
CPU and I don’t find any JIRAs relating this as well. So, what can I do next to 
troubleshoot this ?

Thanks !



High CPU on nodes

2016-12-16 Thread Anubhav Kale
Hello,

I am trying to fight a high CPU problem on some of our nodes. Thread dumps show 
that it's not GC threads (we have 30GB heap), iostat %iowait confirms it's not 
disk (ranges between 0.3 - 0.9%). One of the ways in which the problem 
manifests is that the nodes can't compact SSTables and it happens randomly. We 
run Cassandra 2.1.13 on Azure Premium Storage (network attached SSDs).

One of the sample threads that was taking high CPU shows :

"pool-13-thread-1" #3352 prio=5 
os_prio=0 tid=0x7f2275340bb0 nid=0x1b0b runnable [0x7f33ffaae000]
java.lang.Thread.State: RUNNABLE
at java.util.TimSort.gallopRight(TimSort.java:632)
at java.util.TimSort.mergeLo(TimSort.java:739)
at java.util.TimSort.mergeAt(TimSort.java:514)
at java.util.TimSort.mergeCollapse(TimSort.java:441)
at java.util.TimSort.sort(TimSort.java:245)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1454)
at java.util.Collections.sort(Collections.java:175)
at 
org.apache.cassandra.locator.DynamicEndpointSnitch.sortByProximityWithScore(DynamicEndpointSnitch.java:163)
at 
org.apache.cassandra.locator.DynamicEndpointSnitch.sortByProximityWithBadness(DynamicEndpointSnitch.java:200)
at 
org.apache.cassandra.locator.DynamicEndpointSnitch.sortByProximity(DynamicEndpointSnitch.java:152)
at 
org.apache.cassandra.service.StorageProxy.getLiveSortedEndpoints(StorageProxy.java:1581)
at 
org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:1739)

Looking at code, I can't figure out why things like this would require a high 
CPU and I don't find any JIRAs relating this as well. So, what can I do next to 
troubleshoot this ?

Thanks !


RE: Question on Read Repair

2016-11-03 Thread Anubhav Kale
Does it work the same way for writes as well ? If “nodetool status” shows that 
a node is DN, would writes fail right away assuming enough nodes are down to 
fail QUORUM ?

From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com]
Sent: Tuesday, October 11, 2016 1:13 PM
To: user@cassandra.apache.org
Subject: Re: Question on Read Repair

Yes:

https://github.com/apache/cassandra/blob/81f6c784ce967fadb6ed7f58de1328e713eaf53c/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L286



From: Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, October 11, 2016 at 11:45 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: RE: Question on Read Repair

Thank you.

Interesting detail. Does it work the same way for other consistency levels as 
well ?

From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com]
Sent: Tuesday, October 11, 2016 10:29 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Question on Read Repair

If the failuredetector knows that the node is down, it won’t attempt a read, 
because the consistency level can’t be satisfied – none of the other replicas 
will be repaired.


From: Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, October 11, 2016 at 10:24 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Question on Read Repair

Hello,

This is more of a theory / concept question. I set CL=ALL and do a read. Say 
one replica was down, will the rest of the replicas get repaired as part of 
this ? (I am hoping the answer is yes).

Thanks !

CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may 
be legally privileged. If you are not the intended recipient, do not disclose, 
copy, distribute, or use this email or any attachments. If you have received 
this in error please let the sender know and then delete the email and all 
attachments.

CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may 
be legally privileged. If you are not the intended recipient, do not disclose, 
copy, distribute, or use this email or any attachments. If you have received 
this in error please let the sender know and then delete the email and all 
attachments.


RE: Backup restore with a different name

2016-11-02 Thread Anubhav Kale
You would have to build some logic on top of what’s natively supported.

Here is an option: 
https://github.com/anubhavkale/CassandraTools/tree/master/BackupRestore


From: Jens Rantil [mailto:jens.ran...@tink.se]
Sent: Wednesday, November 2, 2016 2:21 PM
To: Cassandra Group 
Subject: Backup restore with a different name

Hi,

Let's say I am periodically making snapshots of a table, say "users", for 
backup purposes. Let's say a developer makes a mistake and corrupts the table. 
Is there an easy way for me to restore a replica, say "users_20161102", of the 
original table for the developer to looks at the old copy?

Cheers,
Jens

--
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: 
www.tink.se

Facebook
 
Linkedin
 
Twitter


Anticompaction Question

2016-10-25 Thread Anubhav Kale
Hello,

If incremental repairs are enabled, there is logic in every compaction strategy 
to make sure not to mix repaired and unrepaired SS Tables.

Does this mean if some SS Table files are repaired and some aren't and 
incremental repairs don't work reliably, the unrepaired tables will never get 
compacted into repaired ones ?

Thanks !


New node overstreaming data ?

2016-10-13 Thread Anubhav Kale
Hello,

We run 2.1.13 and seeing an odd issue. A node went down, and stayed down for a 
while so it went out of gossip. When we try to bootstrap it again (as a new 
node), it overstreams from other nodes, eventually disk becomes full and 
crashes. This repeated 3 times.

Does anyone have any insights on what to try next (both in terms of root 
causing, and working around) ? To work around, we tried increasing #compactors 
and reducing stream throughput so that at least incoming #SSTables would be 
controlled.

This has happened to us few times in the past too, so I am wondering if this is 
a known problem (I couldn't find any JIRAs).

Thanks !


RE: Repair in Multi Datacenter - Should you use -dc Datacenter repair or repair with -pr

2016-10-12 Thread Anubhav Kale
Agree.

However, if we go from a world where repairs don’t run (or run very unreliably 
so C* can’t mark the SSTables as repaired anyways) to a world where repairs run 
more reliably (Spark / Tickler approach) – the impact on tombstone removal 
doesn’t become any worse (because SS Tables aren’t marked either ways).

From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com]
Sent: Wednesday, October 12, 2016 9:25 AM
To: user@cassandra.apache.org
Subject: Re: Repair in Multi Datacenter - Should you use -dc Datacenter repair 
or repair with -pr

Note that the tickle approach doesn’t mark sstables as repaired (it’s a simpler 
version of mutation based repair in a sense), so Cassandra has no way to prove 
that the data has been repaired.

With tickets like https://issues.apache.org/jira/browse/CASSANDRA-6434, this 
has implications on tombstone removal.


From: Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, October 12, 2016 at 9:17 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: RE: Repair in Multi Datacenter - Should you use -dc Datacenter repair 
or repair with -pr

The default repair process doesn’t usually work at scale, unfortunately.

Depending on your data size, you have the following options.


Netflix Tickler: 
https://github.com/ckalantzis/cassTickler<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ckalantzis_cassTickler=DQMFAg=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow=2YcSoi47BW6-V4BVz980x1Jr7cFbVwc8arJP3Qs4M-0=SIf2vucsd5X4ox-awetoQaxhIO5n3U3b4XzCTiCHT1g=>
 (Read at CL.ALL via CQL continuously :: Python)

Spotify Reaper: 
https://github.com/spotify/cassandra-reaper<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_spotify_cassandra-2Dreaper=DQMFAg=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow=2YcSoi47BW6-V4BVz980x1Jr7cFbVwc8arJP3Qs4M-0=PMkQdggR0dnPHGJ8d7mY-vxxyitPWSlgSdFiLVOm8lA=>
 (Subrange repair, provides a REST endpoint and calls APIs through JMX :: Java)

List subranges: 
https://github.com/pauloricardomg/cassandra-list-subranges<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pauloricardomg_cassandra-2Dlist-2Dsubranges=DQMFAg=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow=2YcSoi47BW6-V4BVz980x1Jr7cFbVwc8arJP3Qs4M-0=f7n9PVE3EeDZMk2I2LhX9MnpPWV7yTGUfPKwImjIxZU=>
 (Tool to get subranges for a given node. :: Java)

Subrange Repair: 
https://github.com/BrianGallew/cassandra_range_repair<https://urldefense.proofpoint.com/v2/url?u=https-3A__na01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com-252FBrianGallew-252Fcassandra-5Frange-5Frepair-26data-3D01-257C01-257CAnubhav.Kale-2540microsoft.com-257Cd8ed7c743f3a42ebac1808d3e94a97e4-257C72f988bf86f141af91ab2d7cd011db47-257C1-26sdata-3DrnOdSYfxRuV0RiXnI9HcLB220StFRDXSCMdoOQKcfvE-253D-26reserved-3D0=DQMFAg=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow=2YcSoi47BW6-V4BVz980x1Jr7cFbVwc8arJP3Qs4M-0=9pPoqSUhM0LtWSO_nhHuqqtY9kvhMaoPIcg4PfFLGx0=>
 (Tool to subrange repair :: Python)

Mutation Based Repair (Not ready yet): 
https://issues.apache.org/jira/browse/CASSANDRA-8911<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D8911=DQMFAg=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow=2YcSoi47BW6-V4BVz980x1Jr7cFbVwc8arJP3Qs4M-0=sodsKAWrUPXZ3YUR_rx2DKzeq6N6grWEhbr-JknNU0Y=>
 (C* is thinking of doing this - hot off the press)

If you have Spark in your system, you could use that to do what Netflix Tickler 
does. We’re experimenting with it and seems to be the best fit for our datasets 
over all the other options.

From: Leena Ghatpande [mailto:lghatpa...@hotmail.com]
Sent: Wednesday, October 12, 2016 7:16 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Repair in Multi Datacenter - Should you use -dc Datacenter repair or 
repair with -pr


Please advice. Cannot find any clear documentation on what is the best strategy 
for repairing nodes on a regular basis with multiple datacenters involved.



We are running cassandra 3.7 in multi datacenter with 4 nodes in each data 
center. We are trying to run repairs every other night to keep the nodes in 
good state.We currently run repair with -pr option , but the repair process 
gets hung and does not complete gracefully. Dont see any errors in the logs 
either.



What is the best way to perform repairs on multiple data centers on large 
tables.

1. Can we run Datacenter repair using -dc option for 

RE: Repair in Multi Datacenter - Should you use -dc Datacenter repair or repair with -pr

2016-10-12 Thread Anubhav Kale
The default repair process doesn't usually work at scale, unfortunately.

Depending on your data size, you have the following options.


Netflix Tickler: https://github.com/ckalantzis/cassTickler (Read at CL.ALL via 
CQL continuously :: Python)

Spotify Reaper: https://github.com/spotify/cassandra-reaper (Subrange repair, 
provides a REST endpoint and calls APIs through JMX :: Java)

List subranges: https://github.com/pauloricardomg/cassandra-list-subranges 
(Tool to get subranges for a given node. :: Java)

Subrange Repair: 
https://github.com/BrianGallew/cassandra_range_repair
 (Tool to subrange repair :: Python)

Mutation Based Repair (Not ready yet): 
https://issues.apache.org/jira/browse/CASSANDRA-8911 (C* is thinking of doing 
this - hot off the press)

If you have Spark in your system, you could use that to do what Netflix Tickler 
does. We're experimenting with it and seems to be the best fit for our datasets 
over all the other options.

From: Leena Ghatpande [mailto:lghatpa...@hotmail.com]
Sent: Wednesday, October 12, 2016 7:16 AM
To: user@cassandra.apache.org
Subject: Repair in Multi Datacenter - Should you use -dc Datacenter repair or 
repair with -pr


Please advice. Cannot find any clear documentation on what is the best strategy 
for repairing nodes on a regular basis with multiple datacenters involved.



We are running cassandra 3.7 in multi datacenter with 4 nodes in each data 
center. We are trying to run repairs every other night to keep the nodes in 
good state.We currently run repair with -pr option , but the repair process 
gets hung and does not complete gracefully. Dont see any errors in the logs 
either.



What is the best way to perform repairs on multiple data centers on large 
tables.

1. Can we run Datacenter repair using -dc option for each data center? Do we 
need to run repair on each node in that case or will it repair all nodes within 
the datacenter?

2. Is running repair with -pr across all nodes required , if we perform the 
step 1 every night?

3. Is cross data center repair required and if so whats the best option?



Thanks



Leena






VNode Streaming Math

2016-10-12 Thread Anubhav Kale
Hello,

Suppose I have a 100 node ring, with num_tokens=32 (thus, 32 VNodes per 
physical machine). Assume this cluster has just one keyspace having one table. 
There are 10 SS Tables on each node, and size on disk is 10GB on each node. For 
simplicity, assume each SSTable is 1GB.

Now, a node went down, and I need to rebuild it. Can you please explain to me 
the math around how many SS Table files (and size) each node would stream to 
this node ? How does that math change as #VNodes change ?

I am looking for rough calculations to understand this process better. I am 
guessing I might have missed some variables in here (amount of data per token 
range ?), so please let me know that too !

Thanks much !


RE: Question on Read Repair

2016-10-11 Thread Anubhav Kale
Thank you.

Interesting detail. Does it work the same way for other consistency levels as 
well ?

From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com]
Sent: Tuesday, October 11, 2016 10:29 AM
To: user@cassandra.apache.org
Subject: Re: Question on Read Repair

If the failuredetector knows that the node is down, it won’t attempt a read, 
because the consistency level can’t be satisfied – none of the other replicas 
will be repaired.


From: Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, October 11, 2016 at 10:24 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Question on Read Repair

Hello,

This is more of a theory / concept question. I set CL=ALL and do a read. Say 
one replica was down, will the rest of the replicas get repaired as part of 
this ? (I am hoping the answer is yes).

Thanks !

CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may 
be legally privileged. If you are not the intended recipient, do not disclose, 
copy, distribute, or use this email or any attachments. If you have received 
this in error please let the sender know and then delete the email and all 
attachments.


Question on Read Repair

2016-10-11 Thread Anubhav Kale
Hello,

This is more of a theory / concept question. I set CL=ALL and do a read. Say 
one replica was down, will the rest of the replicas get repaired as part of 
this ? (I am hoping the answer is yes).

Thanks !


RE: Nodetool rebuild question

2016-10-06 Thread Anubhav Kale
Sure.

When a read repair happens, does it go via the memtable -> SS Table route OR 
does the source node send SS Table tmp files directly to inconsistent replica ?

From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com]
Sent: Wednesday, October 5, 2016 2:20 PM
To: user@cassandra.apache.org
Subject: Re: Nodetool rebuild question

If you set RF to 0, you can ignore my second sentence/paragraph. The third 
still applies.


From: Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, October 5, 2016 at 1:56 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: RE: Nodetool rebuild question

Thanks.

We always set RF to 0 and then “removenode” all nodes in the DC that we want to 
decom. So, I highly doubt that is the problem. Plus, #SSTables on a given node 
on average is ~2000 (we have 140 nodes in one ring and two rings overall).

From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com]
Sent: Wednesday, October 5, 2016 1:44 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Nodetool rebuild question

Both of your statements are true.

During your decom, you likely streamed LOTs of sstables to the remaining nodes 
(especially true if you didn’t drop the replication factor to 0 for the DC you 
decommissioned). Since those tens of thousands of sstables take a while to 
compact, if you then rebuild (or bootstrap) before compaction is done, you’ll 
get a LOT of extra sstables.

This is one of the reasons that people with large clusters don’t use vnodes – 
if you needed to bootstrap ~100 more nodes into a cluster, you’d have to wait 
potentially a day or more per node to compact away the leftovers before 
bootstrapping the next, which is prohibitive at scale.


-  Jeff

From: Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, October 5, 2016 at 1:34 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Nodetool rebuild question

Hello,

As part of rebuild, I noticed that the destination node gets -tmp- files from 
other nodes. Are following statements correct ?


1.   The files are written to disk without going through memtables.

2.   Regular compactors eventually compact them to bring down # SSTables to 
a reasonable number.

We have noticed that the destination node has created > 40K *Data* files in 
first hour of streaming itself. We have not seen such pattern before, so trying 
to understand what could have changed. (We do use Vnodes and We haven’t 
increased # nodes recently, but have decomm-ed a DC).

Thanks much !

CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may 
be legally privileged. If you are not the intended recipient, do not disclose, 
copy, distribute, or use this email or any attachments. If you have received 
this in error please let the sender know and then delete the email and all 
attachments.

CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may 
be legally privileged. If you are not the intended recipient, do not disclose, 
copy, distribute, or use this email or any attachments. If you have received 
this in error please let the sender know and then delete the email and all 
attachments.


RE: Nodetool rebuild question

2016-10-05 Thread Anubhav Kale
Thanks.

We always set RF to 0 and then “removenode” all nodes in the DC that we want to 
decom. So, I highly doubt that is the problem. Plus, #SSTables on a given node 
on average is ~2000 (we have 140 nodes in one ring and two rings overall).

From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com]
Sent: Wednesday, October 5, 2016 1:44 PM
To: user@cassandra.apache.org
Subject: Re: Nodetool rebuild question

Both of your statements are true.

During your decom, you likely streamed LOTs of sstables to the remaining nodes 
(especially true if you didn’t drop the replication factor to 0 for the DC you 
decommissioned). Since those tens of thousands of sstables take a while to 
compact, if you then rebuild (or bootstrap) before compaction is done, you’ll 
get a LOT of extra sstables.

This is one of the reasons that people with large clusters don’t use vnodes – 
if you needed to bootstrap ~100 more nodes into a cluster, you’d have to wait 
potentially a day or more per node to compact away the leftovers before 
bootstrapping the next, which is prohibitive at scale.


-  Jeff

From: Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Wednesday, October 5, 2016 at 1:34 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Nodetool rebuild question

Hello,

As part of rebuild, I noticed that the destination node gets -tmp- files from 
other nodes. Are following statements correct ?


1.   The files are written to disk without going through memtables.

2.   Regular compactors eventually compact them to bring down # SSTables to 
a reasonable number.

We have noticed that the destination node has created > 40K *Data* files in 
first hour of streaming itself. We have not seen such pattern before, so trying 
to understand what could have changed. (We do use Vnodes and We haven’t 
increased # nodes recently, but have decomm-ed a DC).

Thanks much !

CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may 
be legally privileged. If you are not the intended recipient, do not disclose, 
copy, distribute, or use this email or any attachments. If you have received 
this in error please let the sender know and then delete the email and all 
attachments.


Nodetool rebuild question

2016-10-05 Thread Anubhav Kale
Hello,

As part of rebuild, I noticed that the destination node gets -tmp- files from 
other nodes. Are following statements correct ?


1.   The files are written to disk without going through memtables.

2.   Regular compactors eventually compact them to bring down # SSTables to 
a reasonable number.

We have noticed that the destination node has created > 40K *Data* files in 
first hour of streaming itself. We have not seen such pattern before, so trying 
to understand what could have changed. (We do use Vnodes and We haven't 
increased # nodes recently, but have decomm-ed a DC).

Thanks much !


RE: Repairs at scale in Cassandra 2.1.13

2016-09-29 Thread Anubhav Kale
ulting
http://www.thelastpickle.com<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.thelastpickle.com=01%7C01%7CAnubhav.Kale%40microsoft.com%7C698bf80ea0aa4b86e85608d3e79938db%7C72f988bf86f141af91ab2d7cd011db47%7C1=q9UnaVZgS0HekWDbwakpK3piOMdvEpQtiUuiDzly%2Bu0%3D=0>

2016-09-26 23:51 GMT+02:00 Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>>:
Hello,

We run Cassandra 2.1.13 (don’t have plans to upgrade yet). What is the best way 
to run repairs at scale (400 nodes, each holding ~600GB) that actually works ?

I’m considering doing subrange repairs 
(https://github.com/BrianGallew/cassandra_range_repair/blob/master/src/range_repair.py<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBrianGallew%2Fcassandra_range_repair%2Fblob%2Fmaster%2Fsrc%2Frange_repair.py=01%7C01%7CAnubhav.Kale%40microsoft.com%7C698bf80ea0aa4b86e85608d3e79938db%7C72f988bf86f141af91ab2d7cd011db47%7C1=w53NMlnYdbYgoAnBUS95yMEeb%2Fg%2BNH09UgMJEFaw9dE%3D=0>)
 as I’ve heard from folks that incremental repairs simply don’t work even in 
3.x (Yeah, that’s a strong statement but I heard that from multiple folks at 
the Summit).

Any guidance would be greatly appreciated !

Thanks,
Anubhav




Repairs at scale in Cassandra 2.1.13

2016-09-26 Thread Anubhav Kale
Hello,

We run Cassandra 2.1.13 (don't have plans to upgrade yet). What is the best way 
to run repairs at scale (400 nodes, each holding ~600GB) that actually works ?

I'm considering doing subrange repairs 
(https://github.com/BrianGallew/cassandra_range_repair/blob/master/src/range_repair.py)
 as I've heard from folks that incremental repairs simply don't work even in 
3.x (Yeah, that's a strong statement but I heard that from multiple folks at 
the Summit).

Any guidance would be greatly appreciated !

Thanks,
Anubhav


RE: Token Ring Question

2016-06-24 Thread Anubhav Kale
So, can someone educate me on how token aware policies in drivers really work ? 
It appears that it’s quite possible that the data may live on nodes that don’t 
own the tokens for it. By “own” I mean the ownership as defined in system.local 
/ peers and is fed back to drivers.

If this statement is correct,

In my view, unless the drivers execute the *Topology.GetReplicas from Cassandra 
core somehow (something that isn’t available to them), they will never be able 
to tell the correct node that holds data for a given token.

Is my understanding wrong ?

From: Anubhav Kale [mailto:anubhav.k...@microsoft.com]
Sent: Friday, June 3, 2016 3:17 PM
To: user@cassandra.apache.org
Subject: RE: Token Ring Question

Thank you, I was just curious about how this works.

From: Tyler Hobbs [mailto:ty...@datastax.com]
Sent: Friday, June 3, 2016 3:02 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Token Ring Question

There really is only one token ring, but conceptually it's easiest to think of 
it like multiple rings, as OpsCenter shows it.  The only difference is that 
every token has to be unique across the whole cluster.
Now, if the token for a particular write falls in the “primary range” of a node 
living in DC2, does the code check for such conditions and instead put it on 
some node in DC1 ?

Yes.  It will continue searching around the token ring until it hits a token 
that belongs to a node in the correct datacenter.
What is the true meaning of “primary” token range in such scenarios ?

There's not really any such thing as a "primary token range", it's just a 
convenient idea for some tools.  In reality, it's just the replica that owns 
the first (clockwise) token.  I'm not sure what you're really asking, though -- 
what are you concerned about?


On Wed, Jun 1, 2016 at 2:40 PM, Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>> wrote:
Hello,

I recently learnt that regardless of number of Data Centers, there is really 
only one token ring across all nodes. (I was under the impression that there is 
one per DC like how Datastax Ops Center would show it).

Suppose we have 4 v-nodes, and 2 DCs (2 nodes in each DC) and a key space is 
set to replicate in only one DC – say DC1.

Now, if the token for a particular write falls in the “primary range” of a node 
living in DC2, does the code check for such conditions and instead put it on 
some node in DC1 ? What is the true meaning of “primary” token range in such 
scenarios ?

Is this how things works roughly speaking or am I missing something ?

Thanks !



--
Tyler Hobbs
DataStax<https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fdatastax.com%2f=01%7c01%7cAnubhav.Kale%40microsoft.com%7cf121131457ea451f60ff08d38bfabe92%7c72f988bf86f141af91ab2d7cd011db47%7c1=n2DewZys%2fqY2A7nc%2bN0FqcF3%2bIfP3%2fNiTQeoYGxYDcU%3d>


RE: StreamCoordinator.ConnectionsPerHost set to 1

2016-06-17 Thread Anubhav Kale
Thanks Paulo. I made some changes along those lines, and seeing good 
improvement. I will discuss further (with a possible patch) on 
https://issues.apache.org/jira/browse/CASSANDRA-4663 (this is for bootstrap, so 
maybe we can repurpose it for rebuilds or create a separate one).

From: Paulo Motta [mailto:pauloricard...@gmail.com]
Sent: Thursday, June 16, 2016 3:06 PM
To: user@cassandra.apache.org
Subject: Re: StreamCoordinator.ConnectionsPerHost set to 1

Increasing the number of threads alone won't help, because you need to add 
connectionsPerHost-awareness to StreamPlan.requestRanges (otherwise only a 
single connection per host is created) similar to what was done to 
StreamPlan.transferFiles by CASSANDRA-3668, but maybe bit trickier. There's an 
open ticket to support that on CASSANDRA-4663
There's also another discussion on improving rebuild parallelism on 
CASSANDRA-12015.

2016-06-16 14:43 GMT-03:00 Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>>:
Hello,

I noticed that StreamCoordinator.ConnectionsPerHost is always set to 1 
(Cassandra 2.1.13). If I am reading the code correctly, this means there will 
always be just one socket (well, 2 technically for each direction) between 
nodes when rebuilding thus the data will always be serialized.

Have folks experimented with increasing this ? It appears that some parallelism 
here might help rebuilds in a significant way assuming we aren’t hitting 
bandwidth caps (it’s a pain for us at the moment to rebuild nodes holding 
500GB).

I’m going to try to patch our cluster with a change to test this out, but 
wanted to hear from experts as well.

Thanks !



StreamCoordinator.ConnectionsPerHost set to 1

2016-06-16 Thread Anubhav Kale
Hello,

I noticed that StreamCoordinator.ConnectionsPerHost is always set to 1 
(Cassandra 2.1.13). If I am reading the code correctly, this means there will 
always be just one socket (well, 2 technically for each direction) between 
nodes when rebuilding thus the data will always be serialized.

Have folks experimented with increasing this ? It appears that some parallelism 
here might help rebuilds in a significant way assuming we aren't hitting 
bandwidth caps (it's a pain for us at the moment to rebuild nodes holding 
500GB).

I'm going to try to patch our cluster with a change to test this out, but 
wanted to hear from experts as well.

Thanks !


RE: Token Ring Question

2016-06-03 Thread Anubhav Kale
Thank you, I was just curious about how this works.

From: Tyler Hobbs [mailto:ty...@datastax.com]
Sent: Friday, June 3, 2016 3:02 PM
To: user@cassandra.apache.org
Subject: Re: Token Ring Question

There really is only one token ring, but conceptually it's easiest to think of 
it like multiple rings, as OpsCenter shows it.  The only difference is that 
every token has to be unique across the whole cluster.
Now, if the token for a particular write falls in the “primary range” of a node 
living in DC2, does the code check for such conditions and instead put it on 
some node in DC1 ?

Yes.  It will continue searching around the token ring until it hits a token 
that belongs to a node in the correct datacenter.
What is the true meaning of “primary” token range in such scenarios ?

There's not really any such thing as a "primary token range", it's just a 
convenient idea for some tools.  In reality, it's just the replica that owns 
the first (clockwise) token.  I'm not sure what you're really asking, though -- 
what are you concerned about?


On Wed, Jun 1, 2016 at 2:40 PM, Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>> wrote:
Hello,

I recently learnt that regardless of number of Data Centers, there is really 
only one token ring across all nodes. (I was under the impression that there is 
one per DC like how Datastax Ops Center would show it).

Suppose we have 4 v-nodes, and 2 DCs (2 nodes in each DC) and a key space is 
set to replicate in only one DC – say DC1.

Now, if the token for a particular write falls in the “primary range” of a node 
living in DC2, does the code check for such conditions and instead put it on 
some node in DC1 ? What is the true meaning of “primary” token range in such 
scenarios ?

Is this how things works roughly speaking or am I missing something ?

Thanks !



--
Tyler Hobbs
DataStax<https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fdatastax.com%2f=01%7c01%7cAnubhav.Kale%40microsoft.com%7cf121131457ea451f60ff08d38bfabe92%7c72f988bf86f141af91ab2d7cd011db47%7c1=n2DewZys%2fqY2A7nc%2bN0FqcF3%2bIfP3%2fNiTQeoYGxYDcU%3d>


Token Ring Question

2016-06-01 Thread Anubhav Kale
Hello,

I recently learnt that regardless of number of Data Centers, there is really 
only one token ring across all nodes. (I was under the impression that there is 
one per DC like how Datastax Ops Center would show it).

Suppose we have 4 v-nodes, and 2 DCs (2 nodes in each DC) and a key space is 
set to replicate in only one DC - say DC1.

Now, if the token for a particular write falls in the "primary range" of a node 
living in DC2, does the code check for such conditions and instead put it on 
some node in DC1 ? What is the true meaning of "primary" token range in such 
scenarios ?

Is this how things works roughly speaking or am I missing something ?

Thanks !


RE: Removing a datacenter

2016-05-24 Thread Anubhav Kale
Sorry I should have more clear. What I meant was doing exactly what you wrote, 
but do a “removenode” instead of “decommission” to make it even faster. Will 
that have any side-effect (I think it shouldn’t) ?

From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com]
Sent: Monday, May 23, 2016 4:43 PM
To: user@cassandra.apache.org
Subject: Re: Removing a datacenter

If you remove a node at a time, you’ll eventually end up with a single node in 
the DC you’re decommissioning which will own all of the data, and you’ll likely 
overwhelm that node.

It’s typically recommended that you ALTER the keyspace, remove the replication 
settings for that DC, and then you can decommission (and they won’t need to 
stream nearly as much, since they no longer own that data – decom will go much 
faster).



From: Anubhav Kale
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
Date: Monday, May 23, 2016 at 4:41 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
Subject: Removing a datacenter

Hello,

Suppose we have 2 DCs and we know that the data is correctly replicated in 
both. In such situation, is it safe to “remove” one of the DCs by simply doing 
a “nodetool remove node” followed by “nodetool removenode force” for each node 
in that DC (instead of doing a “nodetool decommission” and waiting for it to 
finish) ?

Can someone confirm this won’t have any odd side-effects ?

Thanks !


Removing a datacenter

2016-05-23 Thread Anubhav Kale
Hello,

Suppose we have 2 DCs and we know that the data is correctly replicated in 
both. In such situation, is it safe to "remove" one of the DCs by simply doing 
a "nodetool remove node" followed by "nodetool removenode force" for each node 
in that DC (instead of doing a "nodetool decommission" and waiting for it to 
finish) ?

Can someone confirm this won't have any odd side-effects ?

Thanks !


Applying TTL Change quickly

2016-05-17 Thread Anubhav Kale
Hello,

We use STCS and DTCS on our tables and recently made a TTL change (reduced from 
8 days to 2) on a table with large amounts of data. What is the best way to 
quickly purge old data ? I am playing with tombstone_compaction_interval at the 
moment, but would like some suggestions on what else can be done to reclaim the 
space as quick as possible.

Thanks !


Nodetool repair question

2016-05-10 Thread Anubhav Kale
Hello,

Suppose I have 3 nodes, and stop Cassandra on one of them. Then I run a repair. 
Will repair move the token ranges from down node to other node ? In other words 
in any situation, does repair operation ever change token ownership ?

Thanks !


RE: SS Tables Files Streaming

2016-05-06 Thread Anubhav Kale
Does repair really send SS Table files as is ? Wouldn’t data for tokens be 
distributed across SS Tables ?

From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com]
Sent: Friday, May 6, 2016 2:12 PM
To: user@cassandra.apache.org
Subject: Re: SS Tables Files Streaming

Also probably sstableloader / bulk loading interface




(I don’t think any of these necessarily stream “as-is”, but that’s a different 
conversation I suspect)


From: Jonathan Haddad
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
Date: Friday, May 6, 2016 at 1:52 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
Subject: Re: SS Tables Files Streaming

Repairs, bootstamp, decommission.

On Fri, May 6, 2016 at 1:16 PM Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>> wrote:
Hello,

In what scenarios can SS Table files on disk from Node 1 go to Node 2 as is ?  
I’m aware this happens in nodetool rebuild and I am assuming this does not 
happen in repairs. Can someone confirm ?

The reason I ask is I am working on a solution for backup / restore and I need 
to be sure if I boot a node, start copying over backed up files then those 
files won’t get overwritten by something coming from other nodes.

Thanks !


SS Tables Files Streaming

2016-05-06 Thread Anubhav Kale
Hello,

In what scenarios can SS Table files on disk from Node 1 go to Node 2 as is ?  
I'm aware this happens in nodetool rebuild and I am assuming this does not 
happen in repairs. Can someone confirm ?

The reason I ask is I am working on a solution for backup / restore and I need 
to be sure if I boot a node, start copying over backed up files then those 
files won't get overwritten by something coming from other nodes.

Thanks !


SS Table File Names not containing GUIDs

2016-05-02 Thread Anubhav Kale
Hello,

I am wondering if there is any reason as to why the SS Table format doesn't 
have a GUID. As far as I can tell, the incrementing number isn't really used 
for any special purpose in code, and having a unique name for the file seems to 
be a better thing, in general.

Specifically, this causes some inconvenience when restoring snapshots. Ideally, 
I would like to restore just the system* keyspaces and boot the node. Then, 
once the node is taking live traffic copy the SS Tables over and do a DSE 
restart at the end to load old data.

The problem is it is possible to overwrite new data with old files if the file 
names match. I can't change the file names of snapshot-ed file to a huge 
number, because as soon as that file is copied over, C* will use that number in 
its get-next-number-gen logic potentially causing the same problem for the next 
snapshot-ed file.

How do people usually tackle this ? Is there some easy solution that I am not 
seeing ?

Thanks !


RE: Problem Replacing a Dead Node

2016-04-21 Thread Anubhav Kale
Reusing the bootstrapping node could have caused this, but hard to tell. Since 
you have only 7 nodes, have you tried doing a few rolling restarts of all nodes 
to let gossip settle ? Also, the node is pingable from other nodes even though 
it says Unreachable below. Correct ?

Based on nodetool status, it appears the node has streamed all the data it 
needs, but it doesn’t think it has joined the ring yet. Does cqlsh work on that 
node ?

From: Mir Tanvir Hossain [mailto:mir.tanvir.hoss...@gmail.com]
Sent: Thursday, April 21, 2016 11:51 AM
To: user@cassandra.apache.org
Subject: Re: Problem Replacing a Dead Node

Here is a bit more detail of the whole situation. I am hoping someone can help 
me out here.

We have a seven node cluster. One the nodes started to have issues but it was 
running. We decided to add a new node, and remove the problematic node after 
the new node joins. However, the new node did not join the cluster even after 
three days. Hence, we decided to go with the replacement option. We shutdown 
the problematic node. After that, we stopped cassandra on the bootstraping 
node, deleted all the data, and restarted that node as the replacement node for 
the problematic node.

Since, we reused the bootstrapping node as the replacement node, I am wondering 
whether that is causing any issue. Any insights are appreciated.

This is the output of nodetool describecluster from the replacement node, and 
two other nodes.

mhossain@cassandra-24:~$ nodetool describecluster
Cluster Information:
Name: App
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
80649e67-8ed9-38a4-8afa-560be7c694f4: [10.0.7.80, 
10.0.7.4, 10.0.7.190, 10.0.7.100, 10.0.7.195, 10.0.7.160, 10.0.7.176]


mhossain@cassandra-13:~$ nodetool describecluster
Cluster Information:
Name: App
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
80649e67-8ed9-38a4-8afa-560be7c694f4: [10.0.7.80, 
10.0.7.190, 10.0.7.100, 10.0.7.195, 10.0.7.160, 10.0.7.176]

UNREACHABLE: [10.0.7.91, 10.0.7.4]


mhossain@cassandra-09:~$ nodetool describecluster
Cluster Information:
Name: App
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
80649e67-8ed9-38a4-8afa-560be7c694f4: [10.0.7.80, 
10.0.7.190, 10.0.7.100, 10.0.7.195, 10.0.7.160, 10.0.7.176]

UNREACHABLE: [10.0.7.91, 10.0.7.4]


cassandra-24 (10.0.7.4) is the replacement node. 10.0.7.91 is the ip address of 
the dead node.

-Mir

On Thu, Apr 21, 2016 at 10:02 AM, Mir Tanvir Hossain 
> wrote:
Hi, I am trying to replace a dead node with by following 
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html.
 It's been 3 full days since the replacement node started, and the node is 
still not showing up as part of the cluster on OpsCenter. I was wondering 
whether the delay is due to the fact that I have a test keyspace with 
replication factor of one? If I delete that keyspace, would the new node 
successfully replace the dead node? Any general insight will be hugely 
appreciated.

Thanks,
Mir





RE: Problem Replacing a Dead Node

2016-04-21 Thread Anubhav Kale
Is the datastax-agent running fine on the node ? What does nodetool status and 
system.log show ?

From: Mir Tanvir Hossain [mailto:mir.tanvir.hoss...@gmail.com]
Sent: Thursday, April 21, 2016 10:02 AM
To: user@cassandra.apache.org
Subject: Problem Replacing a Dead Node

Hi, I am trying to replace a dead node with by following 
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html.
 It's been 3 full days since the replacement node started, and the node is 
still not showing up as part of the cluster on OpsCenter. I was wondering 
whether the delay is due to the fact that I have a test keyspace with 
replication factor of one? If I delete that keyspace, would the new node 
successfully replace the dead node? Any general insight will be hugely 
appreciated.

Thanks,
Mir




Changing Racks of Nodes

2016-04-20 Thread Anubhav Kale
Hello,

If a running node moves around and changes its rack in the process, when its 
back in the cluster (through ignore-rack property), is it a correct statement 
that queries will not see some data residing on this node until a repair is run 
?

Or, is it more like the node may get requests for the data it does not own 
(meaning data will never "disappear") ?

I'd appreciate some details on this topic from experts !

Thanks !


RE: Nodetool rebuild and bootstrap

2016-04-14 Thread Anubhav Kale
I confirmed that rebuild doesn’t resume at all. I couldn’t find a JIRA on this. 
Should I open one or can someone explain if there is a design rationale ?

From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com]
Sent: Thursday, April 14, 2016 4:01 PM
To: user@cassandra.apache.org
Subject: Re: Nodetool rebuild and bootstrap

https://issues.apache.org/jira/browse/CASSANDRA-8838

Bootstrap only resumes on 2.2.0 and newer. I’m unsure of rebuild, but I suspect 
it does not resume at all.


From: Anubhav Kale
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
Date: Thursday, April 14, 2016 at 3:07 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
Subject: Nodetool rebuild and bootstrap

Hello,

Is it a correct statement that both rebuild and bootstrap resume streaming from 
where they were left off (meaning they don’t stream the entire data again) in 
case of node restarting during rebuild / bootstrap process ?

Thanks !


RE: Leak Detected while bootstrap

2016-04-13 Thread Anubhav Kale
Thanks, Updated with logs.

From: Tyler Hobbs [mailto:ty...@datastax.com]
Sent: Wednesday, April 13, 2016 3:36 PM
To: user@cassandra.apache.org
Subject: Re: Leak Detected while bootstrap

This looks like it might be 
https://issues.apache.org/jira/browse/CASSANDRA-11374<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-11374=01%7c01%7cAnubhav.Kale%40microsoft.com%7caa645ead9c0f4f3b99d008d363ec045b%7c72f988bf86f141af91ab2d7cd011db47%7c1=XaRYCSG8f3h399%2fkdyipds9iuBXWkcErPaXiD9cEfCo%3d>.
  Can you comment on that ticket and share your logs leading up to the error?

On Wed, Apr 13, 2016 at 3:37 PM, Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>> wrote:
Hello,

Since we upgraded to Cassandra 2.1.12, we are noticing that below happens when 
we are trying to bootstrap nodes, and the process just gets stuck. Restarting 
the process / VM does not help. Our nodes are around ~300 GB and run on local 
SSDs and we haven’t seen this problem on older versions (specifically 2.1.9).

Is this a known issue / any workarounds ?

ERROR [Reference-Reaper:1] 2016-04-13 20:33:53,394  Ref.java:179 - LEAK 
DETECTED: a reference 
(org.apache.cassandra.utils.concurrent.Ref$State@15e611a3<mailto:org.apache.cassandra.utils.concurrent.Ref$State@15e611a3>)
 to class 
org.apache.cassandra.utils.concurrent.WrappedSharedCloseable$1@203187780:[[OffHeapBitSet<mailto:org.apache.cassandra.utils.concurrent.WrappedSharedCloseable$1@203187780:[[OffHeapBitSet>]]
 was not released before the reference was garbage collected

Thanks !



--
Tyler Hobbs
DataStax<https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fdatastax.com%2f=01%7c01%7cAnubhav.Kale%40microsoft.com%7caa645ead9c0f4f3b99d008d363ec045b%7c72f988bf86f141af91ab2d7cd011db47%7c1=pFIcwR7HKB53YBMczg3USP%2b%2bw5FwJxgKZrFmARctjRo%3d>


Leak Detected while bootstrap

2016-04-13 Thread Anubhav Kale
Hello,

Since we upgraded to Cassandra 2.1.12, we are noticing that below happens when 
we are trying to bootstrap nodes, and the process just gets stuck. Restarting 
the process / VM does not help. Our nodes are around ~300 GB and run on local 
SSDs and we haven't seen this problem on older versions (specifically 2.1.9).

Is this a known issue / any workarounds ?

ERROR [Reference-Reaper:1] 2016-04-13 20:33:53,394  Ref.java:179 - LEAK 
DETECTED: a reference 
(org.apache.cassandra.utils.concurrent.Ref$State@15e611a3) to class 
org.apache.cassandra.utils.concurrent.WrappedSharedCloseable$1@203187780:[[OffHeapBitSet]]
 was not released before the reference was garbage collected

Thanks !


Removing a DC

2016-04-07 Thread Anubhav Kale
Hello,

We removed a DC using instructions from 
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_decomission_dc_t.html

After all nodes were gone,


1.   System.peers don't have an entry for the nodes that were removed. 
(confirmed via a cqlsh query with consistency all)

2.   Nodetool describecluster don't show them

3.   Nodetool gossipinfo show them as "LEFT".

However, logs continue to spew below and restarting the node doesn't get rid of 
this. I am thinking a rolling restart of all nodes might fix it, but I am 
curious as to where is this information still held ? I don't think this is 
causing any badness to the cluster, but I would like to get rid of this if 
possible.

INFO  [GossipStage:83] 2016-04-07 16:38:07,859  Gossiper.java:998 - InetAddress 
/10.1.200.14 is now DOWN
INFO  [GossipStage:83] 2016-04-07 16:38:07,861  StorageService.java:1914 - 
Removing tokens[BLAH] for /10.1.200.14

Thanks !


RE: Speeding up "nodetool rebuild"

2016-03-31 Thread Anubhav Kale
Thanks, is there any way to determine that rebuild is complete.

Based on following line in StorageService.java, it's not logged. So, any other 
way to check besides checking data size through nodetool status ? 

finally
{
// rebuild is done (successfully or not)
isRebuilding.set(false);
}


-Original Message-
From: Eric Evans [mailto:eev...@wikimedia.org] 
Sent: Thursday, March 31, 2016 9:50 AM
To: user@cassandra.apache.org
Subject: Re: Speeding up "nodetool rebuild"

On Wed, Mar 30, 2016 at 3:44 PM, Anubhav Kale <anubhav.k...@microsoft.com> 
wrote:
> Any other ways to make the “rebuild” faster ?

TL;DR add more nodes

If you're encountering a per-stream bottleneck (easy to do if using 
compression), then having a higher node count will translate to higher stream 
concurrency, and greater throughput.

Another thing to keep in mind, the streamthroughput value is *outbound*, it 
doesn't matter what you have that set to on the rebuilding/bootstrapping node, 
it *does* matter what it is set to on the nodes that are sending to it 
(https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-11303=01%7c01%7cAnubhav.Kale%40microsoft.com%7c27fd8203aa364253b6fc08d3598493a8%7c72f988bf86f141af91ab2d7cd011db47%7c1=rnPHvE12p04CnRXkHgD%2bkllLOqGA4gnlSuM3QsCTpDE%3d
 aims to introduce an inbound tunable though).


--
Eric Evans
eev...@wikimedia.org


Speeding up "nodetool rebuild"

2016-03-30 Thread Anubhav Kale
Hello,

Will changing compactionthroughput and streamingthroughput help with reducing 
the "rebuild" time on a brand new node ? We will do it both on the new node, 
and the nodes in source DC from where data is streamed.

Any other ways to make the "rebuild" faster ?

Thanks !


RE: Rack aware question.

2016-03-23 Thread Anubhav Kale
The consistency ALL was only for my testing so there could be a logical 
explanation to this. We use LOCAL_QUORUM in prod.

 Original message 
From: Jack Krupansky <jack.krupan...@gmail.com>
Date: 3/23/2016 4:56 PM (GMT-08:00)
To: user@cassandra.apache.org
Subject: Re: Rack aware question.

CL=ALL also means that you won't have HA (High Availability) - if even a single 
node goes down, you're out of business. I mean, HA is the fundamental reason 
for using the rack-aware policy - to assure that each replica is on a separate 
power supply and network connection so that data can be retrieved even when a 
rack-level failure occurs.

In short, if CL=ALL is acceptable, then you might as well dump the rack-aware 
approach, which was how you got into this situation in the first place.

-- Jack Krupansky

On Wed, Mar 23, 2016 at 7:31 PM, Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>> wrote:
I ran into the following detail from : 
https://wiki.apache.org/cassandra/ReadRepair<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fwiki.apache.org%2fcassandra%2fReadRepair=01%7c01%7cAnubhav.Kale%40microsoft.com%7cc48559a6d9634a66a93f08d35376ca24%7c72f988bf86f141af91ab2d7cd011db47%7c1=PtWmKll8TYkFCi41zyoZMNQgn8fIxR6yxTCtXW54KSo%3d>

“If a lower ConsistencyLevel than ALL was specified, this is done in the 
background after returning the data from the closest replica to the client; 
otherwise, it is done before returning the data.”

I set consistency to ALL, and now I can get data all the time.

From: Anubhav Kale 
[mailto:anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>]
Sent: Wednesday, March 23, 2016 4:14 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: Rack aware question.

Thanks, Read repair is what I thought must be causing this, so I experimented 
some more with setting read_repair_chance and dc_local_read_repair_chance on 
the table to 0, and then 1.

Unfortunately, the results were somewhat random depending on which node I ran 
the queries from. For example, when chance = 1, running query from 127.0.0.3 
would sometimes return 0 results and sometimes 1. I do see 
digest-mismatch-kicking-off-read-repair in traces in both cases, so running out 
of ideas here.  If you / someone can shed light on why this could be happening, 
that would be great !

That said, is it expected that “read repair” or a regular “nodetool repair” 
will shift the data around based on new replica placement ? And, if so is the 
recommendation to “rebootstrap” to mainly avoid this humongous data movement ?

The rationale behind ignore_rack flag makes sense, thanks. Maybe, we should 
document it better ?

Thanks !

From: Paulo Motta [mailto:pauloricard...@gmail.com]
Sent: Wednesday, March 23, 2016 3:40 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Rack aware question.

> How come 127.0.0.1 is shown as an endpoint holding the ID when its token 
> range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for a 
> node or just the primary range ? I am thinking its only primary. Can someone 
> confirm ?

The primary replica of id=1 is always 127.0.0.3. What changes when you change 
racks is that the secondary replica will move to the next replica from a 
different rack, either 127.0.0.1 or 127.0.0.2.

> How come queries contact 127.0.0.1 ?

in the last case, 127.0.0.1 is the next node after the primary replica from a 
different rack (R2), so it should be contacted

> Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? To 
> prove / disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY 
> ALL, and it came back just fine meaning 127.0.0.1 indeed hold the data (SS 
> Tables also show it). So, does this mean that the data actually gets moved 
> around when racks change ?

probably during some of your queries 127.0.0.3 (the primary replica) replicated 
data to 127.0.0.1 with read repair. There is no automatic data move when rack 
is changed (at least in OSS C*, not sure if DSE has this ability)

> If we don’t want to support this ever, I’d think the ignore_rack flag should 
> just be deprecated.

ignore_rack flag can be useful if you move your data manually, with rsync or 
sstableloader.

2016-03-23 19:09 GMT-03:00 Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>>:
Thanks for the pointer – appreciate it.

My test is on the latest trunk and slightly different.

I am not exactly sure if the behavior I see is expected (in which case, is the 
recommendation to re-bootstrap just to avoid data movement?) or is the behavior 
not expected and is a bug.

If we don’t want to support this ever, I’d think the ignore_rack flag should 
just be deprecated.

From: Robert Coli [mailto:rc...@eventbrite.com<mailto:rc...@eventbrite.com>]
Sent: Wednesday, March 23, 2016 

RE: Rack aware question.

2016-03-23 Thread Anubhav Kale
I ran into the following detail from : 
https://wiki.apache.org/cassandra/ReadRepair

“If a lower ConsistencyLevel than ALL was specified, this is done in the 
background after returning the data from the closest replica to the client; 
otherwise, it is done before returning the data.”

I set consistency to ALL, and now I can get data all the time.

From: Anubhav Kale [mailto:anubhav.k...@microsoft.com]
Sent: Wednesday, March 23, 2016 4:14 PM
To: user@cassandra.apache.org
Subject: RE: Rack aware question.

Thanks, Read repair is what I thought must be causing this, so I experimented 
some more with setting read_repair_chance and dc_local_read_repair_chance on 
the table to 0, and then 1.

Unfortunately, the results were somewhat random depending on which node I ran 
the queries from. For example, when chance = 1, running query from 127.0.0.3 
would sometimes return 0 results and sometimes 1. I do see 
digest-mismatch-kicking-off-read-repair in traces in both cases, so running out 
of ideas here.  If you / someone can shed light on why this could be happening, 
that would be great !

That said, is it expected that “read repair” or a regular “nodetool repair” 
will shift the data around based on new replica placement ? And, if so is the 
recommendation to “rebootstrap” to mainly avoid this humongous data movement ?

The rationale behind ignore_rack flag makes sense, thanks. Maybe, we should 
document it better ?

Thanks !

From: Paulo Motta [mailto:pauloricard...@gmail.com]
Sent: Wednesday, March 23, 2016 3:40 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Rack aware question.

> How come 127.0.0.1 is shown as an endpoint holding the ID when its token 
> range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for a 
> node or just the primary range ? I am thinking its only primary. Can someone 
> confirm ?

The primary replica of id=1 is always 127.0.0.3. What changes when you change 
racks is that the secondary replica will move to the next replica from a 
different rack, either 127.0.0.1 or 127.0.0.2.

> How come queries contact 127.0.0.1 ?

in the last case, 127.0.0.1 is the next node after the primary replica from a 
different rack (R2), so it should be contacted

> Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? To 
> prove / disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY 
> ALL, and it came back just fine meaning 127.0.0.1 indeed hold the data (SS 
> Tables also show it). So, does this mean that the data actually gets moved 
> around when racks change ?

probably during some of your queries 127.0.0.3 (the primary replica) replicated 
data to 127.0.0.1 with read repair. There is no automatic data move when rack 
is changed (at least in OSS C*, not sure if DSE has this ability)

> If we don’t want to support this ever, I’d think the ignore_rack flag should 
> just be deprecated.

ignore_rack flag can be useful if you move your data manually, with rsync or 
sstableloader.

2016-03-23 19:09 GMT-03:00 Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>>:
Thanks for the pointer – appreciate it.

My test is on the latest trunk and slightly different.

I am not exactly sure if the behavior I see is expected (in which case, is the 
recommendation to re-bootstrap just to avoid data movement?) or is the behavior 
not expected and is a bug.

If we don’t want to support this ever, I’d think the ignore_rack flag should 
just be deprecated.

From: Robert Coli [mailto:rc...@eventbrite.com<mailto:rc...@eventbrite.com>]
Sent: Wednesday, March 23, 2016 2:54 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Rack aware question.

Actually, I believe you are seeing the behavior described in the ticket I meant 
to link to, with the detailed exploration :

https://issues.apache.org/jira/browse/CASSANDRA-10238<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d>

=Rob


On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>> wrote:
Oh, and the query I ran was “select * from racktest.racktable where id=1”

From: Anubhav Kale 
[mailto:anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>]
Sent: Wednesday, March 23, 2016 2:04 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: Rack aware question.

Thanks.

To test what happens when rack of a node changes in a running cluster without 
doing a decommission, I did the following.

The cluster looks like below (this was run through Eclipse, therefore the IP 
address hack)

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R1

R2

RE: Rack aware question.

2016-03-23 Thread Anubhav Kale
Thanks, Read repair is what I thought must be causing this, so I experimented 
some more with setting read_repair_chance and dc_local_read_repair_chance on 
the table to 0, and then 1.

Unfortunately, the results were somewhat random depending on which node I ran 
the queries from. For example, when chance = 1, running query from 127.0.0.3 
would sometimes return 0 results and sometimes 1. I do see 
digest-mismatch-kicking-off-read-repair in traces in both cases, so running out 
of ideas here.  If you / someone can shed light on why this could be happening, 
that would be great !

That said, is it expected that “read repair” or a regular “nodetool repair” 
will shift the data around based on new replica placement ? And, if so is the 
recommendation to “rebootstrap” to mainly avoid this humongous data movement ?

The rationale behind ignore_rack flag makes sense, thanks. Maybe, we should 
document it better ?

Thanks !

From: Paulo Motta [mailto:pauloricard...@gmail.com]
Sent: Wednesday, March 23, 2016 3:40 PM
To: user@cassandra.apache.org
Subject: Re: Rack aware question.

> How come 127.0.0.1 is shown as an endpoint holding the ID when its token 
> range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for a 
> node or just the primary range ? I am thinking its only primary. Can someone 
> confirm ?

The primary replica of id=1 is always 127.0.0.3. What changes when you change 
racks is that the secondary replica will move to the next replica from a 
different rack, either 127.0.0.1 or 127.0.0.2.

> How come queries contact 127.0.0.1 ?

in the last case, 127.0.0.1 is the next node after the primary replica from a 
different rack (R2), so it should be contacted

> Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? To 
> prove / disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY 
> ALL, and it came back just fine meaning 127.0.0.1 indeed hold the data (SS 
> Tables also show it). So, does this mean that the data actually gets moved 
> around when racks change ?

probably during some of your queries 127.0.0.3 (the primary replica) replicated 
data to 127.0.0.1 with read repair. There is no automatic data move when rack 
is changed (at least in OSS C*, not sure if DSE has this ability)

> If we don’t want to support this ever, I’d think the ignore_rack flag should 
> just be deprecated.

ignore_rack flag can be useful if you move your data manually, with rsync or 
sstableloader.

2016-03-23 19:09 GMT-03:00 Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>>:
Thanks for the pointer – appreciate it.

My test is on the latest trunk and slightly different.

I am not exactly sure if the behavior I see is expected (in which case, is the 
recommendation to re-bootstrap just to avoid data movement?) or is the behavior 
not expected and is a bug.

If we don’t want to support this ever, I’d think the ignore_rack flag should 
just be deprecated.

From: Robert Coli [mailto:rc...@eventbrite.com<mailto:rc...@eventbrite.com>]
Sent: Wednesday, March 23, 2016 2:54 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Rack aware question.

Actually, I believe you are seeing the behavior described in the ticket I meant 
to link to, with the detailed exploration :

https://issues.apache.org/jira/browse/CASSANDRA-10238<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d>

=Rob


On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>> wrote:
Oh, and the query I ran was “select * from racktest.racktable where id=1”

From: Anubhav Kale 
[mailto:anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>]
Sent: Wednesday, March 23, 2016 2:04 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: Rack aware question.

Thanks.

To test what happens when rack of a node changes in a running cluster without 
doing a decommission, I did the following.

The cluster looks like below (this was run through Eclipse, therefore the IP 
address hack)

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R1

R2


A table was created and a row inserted as follows:

Cqlsh 127.0.0.1
>create keyspace racktest with replication = { 'class' : 
>'NetworkTopologyStrategy', 'datacenter1' : 2 };
>create table racktest.racktable(id int, PRIMARY KEY(id));
>insert into racktest.racktable(id) values(1);

nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

Nodetool ring > ring_1.txt (attached)

So far so good.

Then I changed the racks to below and restarted DSE with 
–Dcassandra.ignore_rack=true.
This option from my finding simply avoids the ch

RE: Rack aware question.

2016-03-23 Thread Anubhav Kale
Thanks for the pointer – appreciate it.

My test is on the latest trunk and slightly different.

I am not exactly sure if the behavior I see is expected (in which case, is the 
recommendation to re-bootstrap just to avoid data movement?) or is the behavior 
not expected and is a bug.

If we don’t want to support this ever, I’d think the ignore_rack flag should 
just be deprecated.

From: Robert Coli [mailto:rc...@eventbrite.com]
Sent: Wednesday, March 23, 2016 2:54 PM
To: user@cassandra.apache.org
Subject: Re: Rack aware question.

Actually, I believe you are seeing the behavior described in the ticket I meant 
to link to, with the detailed exploration :

https://issues.apache.org/jira/browse/CASSANDRA-10238<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10238=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7741553cdb7c4ce7ee1f08d3536599a0%7c72f988bf86f141af91ab2d7cd011db47%7c1=3PY62w9X94T3fCkPZVJzN2dl8eda44Yj3zBvk83faWk%3d>

=Rob


On Wed, Mar 23, 2016 at 2:06 PM, Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>> wrote:
Oh, and the query I ran was “select * from racktest.racktable where id=1”

From: Anubhav Kale 
[mailto:anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>]
Sent: Wednesday, March 23, 2016 2:04 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: Rack aware question.

Thanks.

To test what happens when rack of a node changes in a running cluster without 
doing a decommission, I did the following.

The cluster looks like below (this was run through Eclipse, therefore the IP 
address hack)

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R1

R2


A table was created and a row inserted as follows:

Cqlsh 127.0.0.1
>create keyspace racktest with replication = { 'class' : 
>'NetworkTopologyStrategy', 'datacenter1' : 2 };
>create table racktest.racktable(id int, PRIMARY KEY(id));
>insert into racktest.racktable(id) values(1);

nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

Nodetool ring > ring_1.txt (attached)

So far so good.

Then I changed the racks to below and restarted DSE with 
–Dcassandra.ignore_rack=true.
This option from my finding simply avoids the check on startup that compares 
the rack in system.local with the one in rack-dc.properties.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R1

R2

R1


nodetool getendpoints racktest racktable 1

127.0.0.2
127.0.0.3

So far so good, cqlsh returns the queries fine.

Nodetool ring > ring_2.txt (attached)

Now comes the interesting part.

I changed the racks to below and restarted DSE.

IP

127.0.0.1

127.0.0.2

127.0.0.3

Rack

R2

R1

R1


nodetool getendpoints racktest racktable 1

127.0.0.1
127.0.0.3

This is very interesting, cqlsh returns the queries fine. With tracing on, it’s 
clear that the 127.0.0.1 is being asked for data as well.

Nodetool ring > ring_3.txt (attached)

There is no change in token information in ring_* files. The token under 
question for id=1 (from select token(id) from racktest.racktable) is 
-4069959284402364209.

So, few questions because things don’t add up:


  1.  How come 127.0.0.1 is shown as an endpoint holding the ID when its token 
range doesn’t contain it ? Does “nodetool ring” shows all token-ranges for a 
node or just the primary range ? I am thinking its only primary. Can someone 
confirm ?
  2.  How come queries contact 127.0.0.1 ?
  3.  Is “getendpoints” acting odd here and the data really is on 127.0.0.2 ? 
To prove / disprove that, I stopped 127.0.0.2 and ran a query with CONSISTENCY 
ALL, and it came back just fine meaning 127.0.0.1 indeed hold the data (SS 
Tables also show it).
  4.  So, does this mean that the data actually gets moved around when racks 
change ?

Thanks !


From: Robert Coli [mailto:rc...@eventbrite.com]
Sent: Wednesday, March 23, 2016 11:59 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Rack aware question.

On Wed, Mar 23, 2016 at 8:07 AM, Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>> wrote:
Suppose we change the racks on VMs on a running cluster. (We need to do this 
while running on Azure, because sometimes when the VM gets moved its rack 
changes).

In this situation, new writes will be laid out based on new rack info on 
appropriate replicas. What happens for existing data ? Is that data moved 
around as well and does it happen if we run repair or on its own ?

First, you should understand this ticket if relying on rack awareness :

https://issues.apache.org/jira/browse/CASSANDRA-3810<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-3810=01%7c01%7cAnubhav.Kale%40microsoft.com%7c7aeaaa44f712480a8e7608d3534d3485%7c72f988bf86f141af91ab2d7cd011db47%7c1=PIEK5w9ZycRYTymQXBCQOHQ9a1BuurGDFc6J3C%2fWvwQ%3d>

Second, in general nodes cannot move 

Rack aware question.

2016-03-23 Thread Anubhav Kale
Hello,

Suppose we change the racks on VMs on a running cluster. (We need to do this 
while running on Azure, because sometimes when the VM gets moved its rack 
changes).

In this situation, new writes will be laid out based on new rack info on 
appropriate replicas. What happens for existing data ? Is that data moved 
around as well and does it happen if we run repair or on its own ?


Thanks !


RE: Expiring Columns

2016-03-21 Thread Anubhav Kale
I think the answer is no. There are explicit checks in Read code path to ignore 
anything that’s past the TTL (based on local time of the node under question).

From: Anuj Wadehra [mailto:anujw_2...@yahoo.co.in]
Sent: Monday, March 21, 2016 5:19 AM
To: User 
Subject: Expiring Columns

Hi,

I want to understand how Expiring columns work in Cassandra.


Query:
Documentation says that once TTL of a column expires, tombstones are created/ 
marked when the sstable gets compacted. Is there a possibility that a query 
(range scan/ row query) returns expired column data just because the sstable 
never participated in a compaction after TTL of the column expired?

For Example:
  10 AM Data inserted with ttl=60 seconds
  10:05 AM A query is run on inserted data
  10:07 AM sstable is compacted and column is marked tombstone.

Will the query return expired data in above scenario? If yes/no, why?


Thanks
Anuj






Sent from Yahoo Mail on 
Android


DTCS Question

2016-03-19 Thread Anubhav Kale
I am using Cassandra 2.1.13 which has all the latest DTCS fixes (it does STCS 
within the DTCS windows). It also introduced a field called MAX_WINDOW_SIZE 
which defaults to one day.

So in my data folders, I may see SS Tables that span beyond a day (generated 
through old data through repairs or commit logs), but whenever I see a message 
in logs "Compacted Foo" (meaning the SS Table under question was definitely a 
result of compaction), the "Foo" SS Table should never have data beyond a day. 
Is this understanding accurate ?

If we have issues with repairs pulling in old data, should MAX_WINDOW_SIZE 
instead be set to a larger value so that we don't run the risk of too many SS 
Tables lying around and never getting compacted ?


RE: DTCS bucketing Question

2016-03-19 Thread Anubhav Kale
CIL

From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com]
Sent: Thursday, March 17, 2016 11:01 AM
To: user@cassandra.apache.org
Subject: Re: DTCS bucketing Question

>  am trying to concretely understand how DTCS makes buckets and I am looking 
> at the DateTieredCompactionStrategyTest.testGetBuckets method and played with 
> some of the parameters to GetBuckets method call (Cassandra 2.1.12). I don’t 
> think I fully understand something there.

Don’t feel bad, you’re not alone.

> In this case, the buckets should look like [0-4000] [4000-]. Is this correct 
> ? The buckets that I get back are different (“a” lives in its bucket and 
> everyone else in another). What I am missing here ?

The latest/newest window never gets combined, it’s ALWAYS the base size. Only 
subsequent windows get merged. First window will always be 0-1000. 
https://spotifylabscom.files.wordpress.com/2014/12/dtcs3.png
[Anubhav Kale] This doesn’t seem correct. In the original test (look at 
comments), the first window is pretty big and in many cases, the first window 
is big.

> Note, that if I keep the base to original (100L) or increase it and play with 
> min_threshold the results are exactly what I would expect.

Because the original base is lower than the lowest timestamp, which means 
you’re never looking in the first window (0-base).

> I am afraid that the math in Target class is somewhat hard to follow so I am 
> thinking about it this way.

The Target class is too clever for its own good. I couldn’t follow it. You’re 
having trouble following it.  Other smart people I’ve talked to couldn’t follow 
it. Last June I proposed an alternative (CASSANDRA-9666 / 
https://github.com/jeffjirsa/twcs ). It was never taken upstream, but it does 
get a fair bit of use by people with large time series clusters (we use it on 
one of our petabyte-scale clusters here). Significantly easier to reason about.

  *   Jeff


From: Anubhav Kale
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
Date: Thursday, March 17, 2016 at 10:24 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
Subject: DTCS bucketing Question



Hello,

I am trying to concretely understand how DTCS makes buckets and I am looking at 
the DateTieredCompactionStrategyTest.testGetBuckets method and played with some 
of the parameters to GetBuckets method call (Cassandra 2.1.12).

I don’t think I fully understand something there. Let me try to explain.

Consider the second test there. I changed the pairs a bit for easier 
explanation and changed base (initial window size)=1000L and Min_Threshold=2

pairs = Lists.newArrayList(
Pair.create("a", 200L),
Pair.create("b", 2000L),
Pair.create("c", 3600L),
Pair.create("d", 3899L),
Pair.create("e", 3900L),
Pair.create("f", 3950L),
Pair.create("too new", 4125L)
);
buckets = getBuckets(pairs, 1000L, 2, 4050L, Long.MAX_VALUE);

In this case, the buckets should look like [0-4000] [4000-]. Is this correct ? 
The buckets that I get back are different (“a” lives in its bucket and everyone 
else in another). What I am missing here ?

Another case,

pairs = Lists.newArrayList(
Pair.create("a", 200L),
Pair.create("b", 2000L),
Pair.create("c", 3600L),
Pair.create("d", 3899L),
Pair.create("e", 3900L),
Pair.create("f", 3950L),
Pair.create("too new", 4125L)
);
buckets = getBuckets(pairs, 50L, 4, 4050L, Long.MAX_VALUE);

Here, the buckets should be [0-3200] [3200-4000] [4000-4050] [4050-]. Is this 
correct ? Again, the buckets that come back are quite different.

Note, that if I keep the base to original (100L) or increase it and play with 
min_threshold the results are exactly what I would expect.

The way I think about DTCS is, try to make buckets of maximum possible sizes 
from 0, and once you can’t make do that , make smaller buckets (similar to what 
the comment suggests). Is this mental model wrong ? I am afraid that the math 
in Target class is somewhat hard to follow so I am thinking about it this way.

Thanks a lot in advance.

-Anubhav


DTCS bucketing Question

2016-03-19 Thread Anubhav Kale


Hello,

I am trying to concretely understand how DTCS makes buckets and I am looking at 
the DateTieredCompactionStrategyTest.testGetBuckets method and played with some 
of the parameters to GetBuckets method call (Cassandra 2.1.12).

I don't think I fully understand something there. Let me try to explain.

Consider the second test there. I changed the pairs a bit for easier 
explanation and changed base (initial window size)=1000L and Min_Threshold=2

pairs = Lists.newArrayList(
Pair.create("a", 200L),
Pair.create("b", 2000L),
Pair.create("c", 3600L),
Pair.create("d", 3899L),
Pair.create("e", 3900L),
Pair.create("f", 3950L),
Pair.create("too new", 4125L)
);
buckets = getBuckets(pairs, 1000L, 2, 4050L, Long.MAX_VALUE);

In this case, the buckets should look like [0-4000] [4000-]. Is this correct ? 
The buckets that I get back are different ("a" lives in its bucket and everyone 
else in another). What I am missing here ?

Another case,

pairs = Lists.newArrayList(
Pair.create("a", 200L),
Pair.create("b", 2000L),
Pair.create("c", 3600L),
Pair.create("d", 3899L),
Pair.create("e", 3900L),
Pair.create("f", 3950L),
Pair.create("too new", 4125L)
);
buckets = getBuckets(pairs, 50L, 4, 4050L, Long.MAX_VALUE);

Here, the buckets should be [0-3200] [3200-4000] [4000-4050] [4050-]. Is this 
correct ? Again, the buckets that come back are quite different.

Note, that if I keep the base to original (100L) or increase it and play with 
min_threshold the results are exactly what I would expect.

The way I think about DTCS is, try to make buckets of maximum possible sizes 
from 0, and once you can't make do that , make smaller buckets (similar to what 
the comment suggests). Is this mental model wrong ? I am afraid that the math 
in Target class is somewhat hard to follow so I am thinking about it this way.

Thanks a lot in advance.

-Anubhav


RE: DTCS Question

2016-03-18 Thread Anubhav Kale
Thanks for the explanation.

From: Marcus Eriksson [mailto:krum...@gmail.com]
Sent: Thursday, March 17, 2016 12:56 AM
To: user@cassandra.apache.org
Subject: Re: DTCS Question



On Wed, Mar 16, 2016 at 6:49 PM, Anubhav Kale 
<anubhav.k...@microsoft.com<mailto:anubhav.k...@microsoft.com>> wrote:
I am using Cassandra 2.1.13 which has all the latest DTCS fixes (it does STCS 
within the DTCS windows). It also introduced a field called MAX_WINDOW_SIZE 
which defaults to one day.

So in my data folders, I may see SS Tables that span beyond a day (generated 
through old data through repairs or commit logs), but whenever I see a message 
in logs “Compacted Foo” (meaning the SS Table under question was definitely a 
result of compaction), the “Foo” SS Table should never have data beyond a day. 
Is this understanding accurate ?
No - not until 
https://issues.apache.org/jira/browse/CASSANDRA-10496<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-10496=01%7c01%7cAnubhav.Kale%40microsoft.com%7c1dde7659fb8a420b61f308d34e3993dc%7c72f988bf86f141af91ab2d7cd011db47%7c1=7334rIfNRo0Oz5sXGAlATOmAkbmFJg4cqifXbGm23qA%3d>
 (read for explanation)


If we have issues with repairs pulling in old data, should MAX_WINDOW_SIZE 
instead be set to a larger value so that we don’t run the risk of too many SS 
Tables lying around and never getting compacted ?
No, with CASSANDRA-10280 that old data will get compacted if needed (assuming 
you have default settings). If the remote node is correctly date tiered, the 
streamed sstable will also be correctly date tiered. Then that streamed sstable 
will be put in a time window and if there are enough sstables in that old 
window, we do a compaction.

/Marcus