from:"Anuj Wadehra"

ApacheCon Europe 2019

2019-05-13 Thread Anuj Wadehra

Hi,
Do we have any plans for dedicated Apache Cassandra track or sessions at 
ApacheCon Berlin in Oct 2019?
CFP closes 26 May, 2019.
ThanksAnuj Wadehra

Re: Upgrade to v3.11.3

2019-01-16 Thread Anuj Wadehra

Hi Shalom,
Just a suggestion. Before upgrading to 3.11.3 make sure you are not impacted by 
any open crtitical defects especially related to RT which may cause data loss 
e.g.14861.
Please find my response below:

The upgrade process that I know of is from 2.0.14 to 2.1.x (higher than 2.1.9 I 
think) and then from 2.1.x to 3.x. Do I need to upgrade first to 3.0.x or can I 
upgraded directly from 2.1.x to 3.11.3?

Response: Yes, you can upgrade from 2.0.14 to some latest stable version of 
2.1.x (only 2.1.9+)  and then upgrade to 3.11.3.

Can I run upgradesstables on several nodes in parallel? Is it crucial to run it 
one node at a time?

Response: Yes, you can run in parallel.





When running upgradesstables on a node, does that node still serves writes and 
reads?

Response: Yes.





Can I use open JDK 8 (instead of Oracle JDK) with C* 3.11.3?

Response: We have not tried but it should be okay. See 
https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-13916.





Is there a way to speed up the upgradesstables process? (besides 
compaction_throughput)


Response: If clearing pending compactions caused by rewriting sstable is a 
concern,probably you can also try increasing concurrent compactors.






Disclaimer: The information provided in above response is my personal opinion 
based on the best of my knowledge and experience. We do not take any 
responsibility and we are not liable for any damage caused by actions taken 
based on above information.

ThanksAnuj
 
 
  On Wed, 16 Jan 2019 at 19:15, shalom sagges wrote:   
Hi All, 

I'm about to start a rolling upgrade process from version 2.0.14 to version 
3.11.3. 
I have a few small questions:   
   - The upgrade process that I know of is from 2.0.14 to 2.1.x (higher than 
2.1.9 I think) and then from 2.1.x to 3.x. Do I need to upgrade first to 3.0.x 
or can I upgraded directly from 2.1.x to 3.11.3?   
   

   - Can I run upgradesstables on several nodes in parallel? Is it crucial to 
run it one node at a time?   
   

   - When running upgradesstables on a node, does that node still serves writes 
and reads?   
   

   - Can I use open JDK 8 (instead of Oracle JDK) with C* 3.11.3?   
   

   - Is there a way to speed up the upgradesstables process? (besides 
compaction_throughput)   


Thanks!

Cassandra Upgrade with Different Protocol Version

2018-07-05 Thread Anuj Wadehra

Hi,
I woud like to know how people are doing rolling upgrade of Casandra clustes 
when there is a change in native protocol version say from 2.1 to 3.11. During 
rolling upgrade, if client application is restarted on nodes, the client driver 
may first contact an upgraded Cassandra node with v4 and permanently mark all 
old Casandra nodes on v3 as down. This may lead to request failures. Datastax 
recommends two ways to deal with this:
1. Before upgrade, set protocol version to lower protocol version. And move to 
higher version once entire cluster is upgraded.2. Make sure driver only 
contacts upraded Cassandra nodes during rolling upgrade.
Second workaround will lead to failures as you may not be able to meet required 
consistency for some time.
Lets consider first workaround. Now imagine an application where protocol 
version is not configurable and code uses default protocol version. You can not 
apply first workaroud because you have to upgrade your application on all nodes 
to first make the protocol version configurable. How would you upgrade such a 
cluster without downtime? Thoughts?
ThanksAnuj

Re: [External] Re: Whch version is the best version to run now?

2018-03-06 Thread Anuj Wadehra

We evaluated both 3.0.x and 3.11.x. +1 for 3.11.2 as we faced major performance 
issues with 3.0.x. We have NOT evaluated new features on 3.11.x.
Anuj

Sent from Yahoo Mail on Android 
 
  On Tue, 6 Mar 2018 at 19:35, Alain RODRIGUEZ wrote:   
Hello Tom,

It's good to hear this kind of feedbacks,
Thanks for sharing.

3.11.x seems to get more love from the community wrt patches. This is why I'd 
recommend 3.11.x for new projects.


I also agree with this analysis.

Stay away from any of the 2.x series, they're going EOL soonish and the newer 
versions are very stable.


+1 here as well. Maybe add that 3.11.x, that is described as 'very stable' 
above, aims at stabilizing Cassandra after the tick-tock releases and is a 'bug 
fix' series and brings features developed during this period, even though it is 
needed to be careful with of some the new features, even in latest 3.11.x 
versions.
I did not work that much with it yet, but I think I would pick 3.11.2 as well 
for a new cluster at the moment.
C*heers,
---Alain Rodriguez - @arodream - 
alain@thelastpickle.comFrance / Spain
The Last Pickle - Apache Cassandra Consultinghttp://www.thelastpickle.com

2018-03-05 12:39 GMT+00:00 Tom van der Woerdt :

We run on the order of a thousand Cassandra nodes in production. Most of that 
is 3.0.16, but new clusters are defaulting to 3.11.2 and some older clusters 
have been upgraded to it as well.
All of the bugs I encountered in 3.11.x were also seen in 3.0.x, but 3.11.x 
seems to get more love from the community wrt patches. This is why I'd 
recommend 3.11.x for new projects.
Stay away from any of the 2.x series, they're going EOL soonish and the newer 
versions are very stable.

Tom van der WoerdtSite Reliability Engineer

Booking.com B.V.
Vijzelstraat 66-80 Amsterdam 1017HL NetherlandsThe world's #1 accommodation 
site 
43 languages, 198+ offices worldwide, 120,000+ global destinations, 1,550,000+ 
room nights booked every day 
No booking fees, best price always guaranteed 
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)

On Sat, Mar 3, 2018 at 12:25 AM, Jeff Jirsa  wrote:

I’d personally be willing to run 3.0.16
3.11.2 or 3 whatever should also be similar, but I haven’t personally tested it 
at any meaningful scale

-- Jeff Jirsa

On Mar 2, 2018, at 2:37 PM, Kenneth Brotman  
wrote:



Seems like a lot of people are running old versions of Cassandra.  What is the 
best version, most reliable stable version to use now?

 

Kenneth Brotman

Re: LWT and non-LWT mixed

2017-10-10 Thread Anuj Wadehra

Hi Daniel,
What is the RF and CL for Delete?Are you using asynchronous writes?Are you 
firing both statements from same node sequentially?Are you firing these queries 
in a loop such that more than one delete and LWT is fired for same partition?
I think if you have the same client executing both statements sequentially in 
same thread i.e. one after another and delete is synchronous, it should work 
fine. LWT will be executed after Cassandra has written on Quorum of nodes and 
will see the data. Paxos of LWT shall only be initiated when delete completes. 
I think, LWT should not be mixed with normal write when you have such writes 
fired from multiple nodes/threads on the same partition.

ThanksAnuj
Sent from Yahoo Mail on Android 
 
  On Tue, 10 Oct 2017 at 14:10, Daniel Woo wrote:   The 
document explains you cannot mix 
themhttp://docs.datastax.com/en/archived/cassandra/2.2/cassandra/dml/dmlLtwtTransactions.html

But what happens under the hood if I do? e.g, 
DELETE INSERT ... IF NOT EXISTS
The coordinator has 4 steps to do the second statement (INSERT)1. 
prepare/promise a ballot2. read current row from replicas3. propose new value 
along with the ballot to replicas4. commit and wait for ack from replicas
My question is, once the row is DELETed, the next INSERT LWT should be able to 
see that row's tombstone in step 2, then successfully inserts the new value. 
But my tests shows that this often fails, does anybody know why? 
-- 
Thanks & Regards,
Daniel

Re: 回复：回复： tolerate how many nodes down in the cluster

2017-07-27 Thread Anuj Wadehra

Hi Peng, 
Racks can be logical (as defined with RAC attribute in Cassandra configuration 
files) or physical (racks in server rooms).  
In my view, for leveraging racks in your case, its important to understand the 
implication of following decisions:
1. Number of distinct logical RACs defined in Cassandra:If you want to leverage 
RACs optimally for operational efficiencies (like Brooke explained), you need 
to make sure that logical RACs are ALWAYS equal to RF irrespective of the fact 
whether physical Racks are equal to or greater than RF.  
Keeping logical Racks=RF, ensures that nodes allocated to a logical rack have 
exactly 1 replicas of the entire 100% data set.  So,  if your have RF=3 and you 
use QUORUM for read/write,  you can bring down ALL nodes allocated to a logical 
rack for maintenance activity and still achieve 100% availability. This makes 
operations faster and cuts down the risk involved. For example, imagine taking 
a Cassandra restart of entire cluster. If one node takes 3 minutes, a rolling 
restart of 30 nodes would take 90 minutes. But, if you use 3 logical RACs with 
RF=3 and assign 10 nodes to each logical RAC, you can restart 10 nodes within a 
RAC simultaneously (of course in off-peak hours so that remaining 20 nodes can 
take the load). Staring Cassandra on all RACs one by one will just take 9 
minutes rather than 90 minutes. If there are any issues during 
restart/maintenance, you can take all the nodes on a Logical RAC down, fix them 
and bring them back without affecting availability


2.Number of physical Racks : As per historical data, there are instances when 
more than one nodes in a physical rack fail together. When you are using VMs, 
there are three levels instead of two. VMs on a single physical machine are 
likely to fail together too due to hardware failure.
Physical Racks > Physical Machines > VMs
Ensure that all VMs on a physical machine map to single logical RAC. If you 
want to afford failure of physical racks in the server room, you also need to 
ensure that all physical servers on a physical rack must map to just one 
logical RAC. This way, you can afford failure of ALL VMs on ALL physical 
machines mapped to a single logical RAC and still be 100% available.
For Example: RF=3 , 6 physical racks, 2 physical servers per physical rack and 
3 VMs per physical server.Setup would be-
Physical Rack1 = [Physical1 (3 VM) + Physical2 (3 VM) ]= LogicalRAC1Physical 
Rack2 = [Physical3 (3 VM) + Physical4 (3 VM) ]= LogicalRAC1

Physical Rack3 = [Physical5 (3 VM) + Physical6 (3 VM) ]= LogicalRAC2Physical 
Rack4 = [Physical7 (3 VM) + Physical8 (3 VM) ]= LogicalRAC2
Physical Rack5 = [Physical9 (3 VM) + Physical10 (3 VM) ]= LogicalRAC3Physical 
Rack6 = [Physical11 (3 VM) + Physical12 (3 VM) ]= LogicalRAC3
Problem with this approach is scaling. What if you want to add a single 
physical server? If you do that and allocate it to one existing logical RAC, 
your cluster wont be balanced properly because the logical RAC to which the 
server is added will have additional capacity for same data as other two 
logical RACs.To keep your cluster balanced, you need to add at least 3 physical 
servers in 3 different physical Racks and assign each physical server to 
different logical RAC. This is wastage of resources and hard to digest.

If you have physical machines < logical RACs, every physical machine may have 
more than 1 replica. If entire physical machine fails, you will NOT have 100% 
availability as more than 1 replica may be unavailable. Similarly, if you have 
physical racks < logical RACs, every physical rack may have more than 1 
replica. If entire physical rack fails, you will NOT have 100% availability as 
more than 1 replica may be unavailable. 

Coming back to your example: RF=3 per DC (total RF=6), CL=QUORUM, 2 DCs, 6 
physical machines, 8 VMs per physical machine:
My Recommendation :1. In each DC, assign 3 physical machines in a DC to 3 
logical RACs in Cassandra configuration .  2 DCs can have same RAC names as 
RACs are uniquely identified with their DC names. So, these are 6 different 
logical RACs (multiple of RF) (i.e. 1 physical machine per logical RAC)


2. Add 6 physical machines (3 physical machines per DC) to scale the cluster 
and assign every machine to different logical RAC within the DC.
This way, even if you have Active-Passive DC setup, you can afford failure of 
any physical machine or physical rack in Active DC and still ensure 100% 
availability. You would also achieve operational benefits explained above. 

In multi-DC setup, you can also choose to do away with RACs and achieve 
operational benefits by doing maintenance on one entire DC at a time and 
leveraging other DC to handle client requests during that time. That will make 
your life simpler.


ThanksAnuj







Sent from Yahoo Mail on Android 
 


  On Thu, 27 Jul 2017 at 12:03, kurt greaves wrote:   
Note that if you use more racks than RF you lose some of the

Re: 回复： tolerate how many nodes down in the cluster

2017-07-26 Thread Anuj Wadehra

Hi Brooke,
Very nice presentation:https://www.youtube.com/watch?v=QrP7G1eeQTI !! Good to 
know that you are able to leverage Racks for gainingoperational efficiencies. I 
think vnodes have made life easier. 
I still see some concerns with Racks:

 1. Usually scaling needs are driven by business requirements. Customerswant 
value for every penny they spend. Adding 3 or 5 servers (because you haveRF=3 
or 5) instead of 1 server costs them dearly. It's difficult to justify 
theadditional cost as fault tolerance can only be improved but not guaranteed 
with racks.

2. You need to maintain mappings of Logical Racks (=RF) andphysical racks 
(multiple of RFs) for large clusters. 
3.  Using racks tightlycouples your hardware (rack size, rack count) / 
virtualization decisions (VMSize, VM count per physical node) with application 
RF.
Thanks
Anuj 

On Tuesday, 25 July 2017 3:56 AM, Brooke Thorley <bro...@instaclustr.com> 
wrote:
 

 Hello Peng. 
I think spending the time to set up your nodes into racks is worth it for the 
benefits that it brings. With RF3 and NTS you can tolerate the loss of a whole 
rack of nodes without losing QUORUM as each rack will contain a full set of 
data.  It makes ongoing cluster maintenance easier, as you can perform 
upgrades, repairs and restarts on a whole rack of nodes at once.  Setting up 
racks or adding nodes is not difficult particularly if you are using vnodes.  
You would simply add nodes in multiples of  to keep the racks 
balanced.  This is how we run all our managed clusters and it works very well.
You may be interested to watch my Cassandra Summit presentation from last year 
in which I discussed this very topic: 
https://www.youtube.com/watch?v=QrP7G1eeQTI (from 4:00)

If you were to consider changing your rack topology, I would recommend that you 
do this by DC migration rather than "in place". 

Kind Regards,Brooke ThorleyVP Technical Operations & Customer 
servicessupp...@instaclustr.com | support.instaclustr.com
Read our latest technical blog posts here.This email has been sent on 
behalf of Instaclustr Pty. Limited (Australia) and Instaclustr Inc (USA).This 
email and any attachments may contain confidential and legally privileged 
information.  If you are not the intended recipient, do not copy or disclose 
its content, but please reply to this email immediately and highlight the error 
to the sender and then immediately delete the message.

On 25 July 2017 at 03:06, Anuj Wadehra <anujw_2...@yahoo.co.in.invalid> wrote:

Hi Peng, 
Three things are important when you are evaluating fault tolerance and 
availability for your cluster:
1. RF2. CL3. Topology -  how data is replicated in racks. 
If you assume that N  nodes from ANY rack may fail at the same time,  then you 
can afford failure of RF-CL nodes and still be 100% available.  E. g.  If you 
are reading at quorum and RF=3, you can only afford one (3-2) node failure. 
Thus, even if you have a 30 node cluster,  10 node failure can not provide you 
100% availability. RF impacts availability rather than total number of nodes in 
a cluster. 
If you assume that N nodes failing together will ALWAYS be from the same rack,  
you can spread your servers in RF physical racks and use 
NetworkTopologyStrategy. While allocating replicas for any data, Cassandra will 
ensure that 3 replicas are placed in 3 different racks E.g. you can have 10 
nodes in 3 racks and then even a 10 node failure within SAME rack shall ensure 
that you have 100% availability as two replicas are there for 100% data and 
CL=QUORUM can be met. I have not tested this but that how the rack concept is 
expected to work.  I agree, using racks generally makes operations tougher.

ThanksAnuj

 
 
  On Mon, 24 Jul 2017 at 20:10, Peng Xiao<2535...@qq.com> wrote:   Hi 
Bhuvan,From the following link,it doesn't suggest us to use RAC and it looks 
reasonable.http://www.datastax.com/dev/ blog/multi-datacenter- replication
Defining one rack for the entire cluster is the simplest and most common 
implementation. Multiple racks should be avoided for the following reasons: • 
Most users tend to ignore or forget rack requirements that state racks should 
be in an alternating order to allow the data to get distributed safely and 
appropriately. • Many users are not using the rack information effectively by 
using a setup with as many racks as they have nodes, or similar non-beneficial 
scenarios. • When using racks correctly, each rack should typically have the 
same number of nodes. • In a scenario that requires a cluster expansion while 
using racks, the expansion procedure can be tedious since it typically involves 
several node moves and has has to ensure to ensure that racks will be 
distributing data correctly and evenly. At times when clusters need immediate 
expansion, racks should be the last things to worry about.




-- 原始邮件 -- 发件人: "Bhuvan 
Rawal";<bhu1ra...@gmail.com&g

Re: 回复： tolerate how many nodes down in the cluster

2017-07-24 Thread Anuj Wadehra

Hi Peng, 
Three things are important when you are evaluating fault tolerance and 
availability for your cluster:
1. RF2. CL3. Topology -  how data is replicated in racks. 
If you assume that N  nodes from ANY rack may fail at the same time,  then you 
can afford failure of RF-CL nodes and still be 100% available.  E. g.  If you 
are reading at quorum and RF=3, you can only afford one (3-2) node failure. 
Thus, even if you have a 30 node cluster,  10 node failure can not provide you 
100% availability. RF impacts availability rather than total number of nodes in 
a cluster. 
If you assume that N nodes failing together will ALWAYS be from the same rack,  
you can spread your servers in RF physical racks and use 
NetworkTopologyStrategy. While allocating replicas for any data, Cassandra will 
ensure that 3 replicas are placed in 3 different racks E.g. you can have 10 
nodes in 3 racks and then even a 10 node failure within SAME rack shall ensure 
that you have 100% availability as two replicas are there for 100% data and 
CL=QUORUM can be met. I have not tested this but that how the rack concept is 
expected to work.  I agree, using racks generally makes operations tougher.

ThanksAnuj

 
 
  On Mon, 24 Jul 2017 at 20:10, Peng Xiao<2535...@qq.com> wrote:   Hi 
Bhuvan,From the following link,it doesn't suggest us to use RAC and it looks 
reasonable.http://www.datastax.com/dev/blog/multi-datacenter-replication
Defining one rack for the entire cluster is the simplest and most common 
implementation. Multiple racks should be avoided for the following reasons: • 
Most users tend to ignore or forget rack requirements that state racks should 
be in an alternating order to allow the data to get distributed safely and 
appropriately. • Many users are not using the rack information effectively by 
using a setup with as many racks as they have nodes, or similar non-beneficial 
scenarios. • When using racks correctly, each rack should typically have the 
same number of nodes. • In a scenario that requires a cluster expansion while 
using racks, the expansion procedure can be tedious since it typically involves 
several node moves and has has to ensure to ensure that racks will be 
distributing data correctly and evenly. At times when clusters need immediate 
expansion, racks should be the last things to worry about.




-- 原始邮件 --发件人: "Bhuvan 
Rawal";;发送时间: 2017年7月24日(星期一) 晚上7:17收件人: 
"user"; 主题: Re: tolerate how many nodes down in the 
cluster
Hi Peng ,
This really depends on how you have configured your topology. Say if you have 
segregated your dc into 3 racks with 10 servers each. With RF of 3 you can 
safely assume your data to be available if one rack goes down. 
But if different servers amongst the racks fail then i guess you are not 
guaranteeing data integrity with RF of 3 in that case you can at max lose 2 
servers to be available. Best idea would be to plan failover modes 
appropriately and letting cassandra know of the same.
Regards,Bhuvan
On Mon, Jul 24, 2017 at 3:28 PM, Peng Xiao <2535...@qq.com> wrote:

Hi,
Suppose we have a 30 nodes cluster in one DC with RF=3,how many nodes can be 
down?can we tolerate 10 nodes down?it seems that we are not able to avoid  the 
data distribution 3 replicas in the 10 nodes?,then we can only tolerate 1 node 
down even we have 30 nodes?Could anyone please advise?
Thanks

RE: MUTATION messages were dropped in last 5000 ms for cross node timeout

2017-07-21 Thread Anuj Wadehra

Hi Asad, 
You can increase it by 2 at a time.  For example if its currently 2, try 
increasing it to 4 and retest. 
We flush 5-6 tables at a time and use 3 memtable_flush_writers. It works 
great!! There were dropped mutations when it was set to one. The idea is to 
make sure that writes are not blocked. 
ThanksAnuj
Sent from Yahoo Mail on Android 
 
  On Fri, 21 Jul 2017 at 20:04, ZAIDI, ASAD A<az1...@att.com> wrote:   
#yiv0831784205 #yiv0831784205 -- _filtered #yiv0831784205 {panose-1:2 4 5 3 5 4 
6 3 2 4;} _filtered #yiv0831784205 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 
3 2 4;}#yiv0831784205 #yiv0831784205 p.yiv0831784205MsoNormal, #yiv0831784205 
li.yiv0831784205MsoNormal, #yiv0831784205 div.yiv0831784205MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv0831784205 a:link, 
#yiv0831784205 span.yiv0831784205MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv0831784205 a:visited, #yiv0831784205 
span.yiv0831784205MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv0831784205 
p.yiv0831784205msonormal, #yiv0831784205 li.yiv0831784205msonormal, 
#yiv0831784205 div.yiv0831784205msonormal 
{margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv0831784205 
p.yiv0831784205msochpdefault, #yiv0831784205 li.yiv0831784205msochpdefault, 
#yiv0831784205 div.yiv0831784205msochpdefault 
{margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv0831784205 
span.yiv0831784205msohyperlink {}#yiv0831784205 
span.yiv0831784205msohyperlinkfollowed {}#yiv0831784205 
span.yiv0831784205emailstyle17 {}#yiv0831784205 p.yiv0831784205msonormal1, 
#yiv0831784205 li.yiv0831784205msonormal1, #yiv0831784205 
div.yiv0831784205msonormal1 
{margin:0in;margin-bottom:.0001pt;font-size:11.0pt;}#yiv0831784205 
span.yiv0831784205msohyperlink1 
{color:#0563C1;text-decoration:underline;}#yiv0831784205 
span.yiv0831784205msohyperlinkfollowed1 
{color:#954F72;text-decoration:underline;}#yiv0831784205 
span.yiv0831784205emailstyle171 {color:windowtext;}#yiv0831784205 
p.yiv0831784205msochpdefault1, #yiv0831784205 li.yiv0831784205msochpdefault1, 
#yiv0831784205 div.yiv0831784205msochpdefault1 
{margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv0831784205 
span.yiv0831784205EmailStyle27 {color:#1F497D;}#yiv0831784205 
.yiv0831784205MsoChpDefault {} _filtered #yiv0831784205 {margin:1.0in 1.0in 
1.0in 1.0in;}#yiv0831784205 div.yiv0831784205WordSection1 {}#yiv0831784205 
Thank you for your reply. I’ll increase memTable_flush_writes and report back 
if it helps.
 
  
 
Is there any formula that we can use to arrive at correct number of 
memTable_flush_writers ? or the exercise would windup be like “try and error” 
taking much time to arrive at some number that may not be optimal.  
 
  
 
Thank you again.
 
  
 
  
 
  
 
From: Anuj Wadehra [mailto:anujw_2...@yahoo.co.in]
Sent: Thursday, July 20, 2017 12:17 PM
To: ZAIDI, ASAD A <az1...@att.com>; user@cassandra.apache.org
Subject: Re: MUTATION messages were dropped in last 5000 ms for cross node 
timeout
 
  
 
Hi Asad, 
 
  
 
You can do following things:
 
  
 
1.Increase memtable_flush_writers especially if you have a write heavy load. 
 
  
 
2.Make sure there are no big gc pauses on your nodes. If yes,  go for heap 
tuning. 
 
  
 
  
 
Please let us know whether above measures fixed your problem or not. 
 
  
 
  
 
Thanks
 
Anuj
 
  
 
Sent from Yahoo Mail on Android
 
  
 

On Thu, 20 Jul 2017 at 20:57, ZAIDI, ASAD A
 
<az1...@att.com> wrote:
 
Hello Folks –
 
 
 
I’m using apache-cassandra 2.2.8.
 
 
 
I see many messages like below in my system.log file. In Cassandra.yaml file [ 
cross_node_timeout: true] is set and NTP server is also running correcting 
clock drift on 16node cluster. I do not see pending or blocked HintedHandoff  
in tpstats output though there are bunch of MUTATIONS dropped observed.
 
 
 

 
INFO  [ScheduledTasks:1] 2017-07-20 08:02:52,511 MessagingService.java:946 - 
MUTATION messages were dropped in last 5000 ms: 822 for internal timeout and 
2152 for cross node timeout
 

 
 
 
I’m seeking help here if you please let me know what I need to check in order 
to address these cross node timeouts.
 
 
 
Thank you,
 
Asad

Re: MUTATION messages were dropped in last 5000 ms for cross node timeout

2017-07-20 Thread Anuj Wadehra

Hi Asad, 
You can do following things:
1.Increase memtable_flush_writers especially if you have a write heavy load. 
2.Make sure there are no big gc pauses on your nodes. If yes,  go for heap 
tuning. 

Please let us know whether above measures fixed your problem or not. 

ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Thu, 20 Jul 2017 at 20:57, ZAIDI, ASAD A wrote:

Hello Folks –
 
  
 
I’m using apache-cassandra 2.2.8.
 
  
 
I see many messages like below in my system.log file. In Cassandra.yaml file [ 
cross_node_timeout: true] is set and NTP server is also running correcting 
clock drift on 16node cluster. I do not see pending or blocked HintedHandoff  
in tpstats output though there are bunch of MUTATIONS dropped observed.
 
  
 

 
INFO  [ScheduledTasks:1] 2017-07-20 08:02:52,511 MessagingService.java:946 - 
MUTATION messages were dropped in last 5000 ms: 822 for internal timeout and 
2152 for cross node timeout
 

 
  
 
I’m seeking help here if you please let me know what I need to check in order 
to address these cross node timeouts.
 
  
 
Thank you,
 
Asad

Re: "nodetool repair -dc"

2017-07-11 Thread Anuj Wadehra

Hi, 
I have not used dc local repair specifically but generally repair syncs all 
local tokens of the node with other replicas (full repair) or a subset of local 
tokens (-pr and subrange). Full repair with - Dc option should only sync data 
for all the tokens present on the node where the command is run with other 
replicas in local dc.
You should run full repair on all nodes of the DC unless RF of all keyspaces in 
local DC =number of nodes in DC. E.g if you have 3 nodes in dc1 and RF is 
DC1:3, repairing single node should sync all data within a DC. This doesnt hold 
true if you have 5 nodes and no node holds 100% data. 
Running full repair on all nodes in a dc may lead to repairing every data RF 
times. Inefficient!!  And you cant use pr with dc option.   Even if its allowed 
it wont repair entire ring as a dc owns a subset of entire token ring. 
ThanksAnuj
 
 
  On Tue, 11 Jul 2017 at 20:08, vasu gunja wrote:   Hi ,
My Question specific to -dc option 
Do we need to run this on all nodes that belongs to that DC ?Or only on one of 
the nodes that belongs to that DC then it will repair all nodes ?

On Sat, Jul 8, 2017 at 10:56 PM, Varun Gupta  wrote:

I do not see the need to run repair, as long as cluster was in healthy state on 
adding new nodes.
On Fri, Jul 7, 2017 at 8:37 AM, vasu gunja  wrote:

Hi , 
I have a question regarding "nodetool repair -dc" option. recently we added 
multiple nodes to one DC center, we want to perform repair only on current DC. 
Here is my question.
Do we need to perform "nodetool repair -dc" on all nodes belongs to that DC ? 
or only one node of that DC?


thanks,V

Re: private interface for interdc messaging

2017-07-10 Thread Anuj Wadehra

Hi, 
I am not sure why you would want to connect clients on public interface. Are 
you making db calls from clients outside the DC? 
Also, not sure why you expect two DCs to communicate on private networks unless 
they are two logical DCs within same physical DC. 
Generally,  you configure multi dc setup in yaml as follows:
-use GossipingPropertyFileSnitch and set prefer_local to true in 
cassandra-rackdc.properties. This would ensure that local node to node 
communication within a dc happens on private. 
-set Rpc_address to private ip so that clients connect to private interface. 
-set listen_address to private IP.  Cassandra would communicate to nodes in 
local dc using this address. 
-set broadcast_address to public ip.Cassandra would communicate to nodes in 
other dc using this address. 
-set listen_on_broadcast_address to true
ThanksAnuj

 
 
  On Fri, 7 Jul 2017 at 22:58, CPC wrote:   Hi,
We are building 2 datacenters with each machine have one public(for native 
client connections) and one for private(internode communication). What we 
noticed that nodes in one datacenters trying to communicate with other nodes in 
other dc over their public interfaces. I mean:DC1 Node1 public interface -> DC2 
Node1 private interfaceBut what we perefer is:DC1 Node1 private interface -> 
DC2 Node1 private interface

Is there any configuration so a node make interdc connection over its private 
network?
Thank you...

Re: Merkle trees requests hanging

2017-07-05 Thread Anuj Wadehra

Hi Jean, 
Ensure that your firewall is not timing out idle connections. Nodes should time 
out idle connections first (using tcp keep alive settings before firewall does 
it). Please refer 
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html.
 

ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Tue, 4 Jul 2017 at 19:41, Jean Carlo wrote:   
Hello.

What if a node send a merkle tree to its replica but this one would never 
received by any network issues. The repair will be hanging eternally ? or 
Should I modify  the parameter 
# streaming_socket_timeout_in_ms: 0 to avoid this ?

Saludos

Jean Carlo
"The best way to predict the future is to invent it" Alan Kay

Re: Cassandra Cluster Expansion Criteria

2017-06-29 Thread Anuj Wadehra

Hi Asad,
First, you need to understand the factors impacting cluster capacity. Some of 
the important factors to be considered while doing capacity planning of 
Cassandra are:
1.  Compaction strategy: It impacts disk space requirements and IO/CPU/memory 
overhead for compactions.
2. Replication Factor: Impacts disk space.
3. Business SLAs and Data Access patterns (read/write)
4. Type of storage: SSD will ensure that IO is rarely a problem. You may become 
CPU bound first.
Some trigger points for expanding your cluster:
1. Disk crunch. Unable to meet free disk requirements for various compaction 
strategies.
2. Overloaded nodes: tpstats/logs show  frequent dropped mutations. 
Consistently high CPU load.
3. Business SLAs not being met due to increase in reads/writes per second.
Please note that this is not an exhaustive list.
ThanksAnuj







Sent from Yahoo Mail on Android 
 
  On Thu, Jun 29, 2017 at 7:15 PM, ZAIDI, ASAD A wrote:

Hello Folks,
 
  
 
I’m on Cassandra 2.2.8 cluster with 14 nodes , each with around 2TB of data 
volume. I’m looking for a criteria /or data points that can help me decide when 
or  if I should add more nodes to the cluster and by how many nodes.
 
  
 
I’ll really appreciate if you guys can share your insights.
 
  
 
Thanks/Asad

Re: Restore Snapshot

2017-06-28 Thread Anuj Wadehra

Thanks Kurt. 

I think the main scenario which MUST be addressed by snapshot is Backup/Restore 
so that a node can be restored with minimal time and the lengthy procedure of 
boottsrapping with join_ring=false followed by full repair can be avoided. The 
plain restore snapshot + repair scenario seems to be broken. The situation is 
less critical when you use join_ring=false.   .
Changing consistency level to ALL is not an optimal solution or workaround 
because it may impact performance. Moreover, it is an unreasonable and unstated 
assumption that Cassandra users can dynamically change CL and then revert it 
back after the repair.
Ideal restore process should be :1. Restore Snapshot2. Start the node with 
join_ring=false
3. Cassandra should ACCEPT writes in this phase just like bootstrap with 
join_ring=false.4. Repair the node5. Join the node.
Point 3 seems to be missing in current implementation of join_ring. Thus, at 
step 5 when the node joins the ring, it will NOT lead to inconsistent writes as 
all the data updates after the snapshot was taken and before the snapshot was 
restored are consistent on all the nodes. BUT now, the node has missed on 
important updates done while the repair was going on. So, full repair didn't 
synced entire data. It fixed inconsistencies and prevented inconsistent reeads 
and lead to NEW inconsistencies. You need another full repair on the node :(
I will conduct a test to be 100% sure that join_ring is not accepting writes 
and if  I get same results, I will create a JIRA.
We are updating file system on nodes and doing it one node a time to avoid 
downtime. Snapshot cuts down on excessive streaming and lengthy procedure 
(boostrap+repair), so we were evaluating snapshot restore as an option.

ThanksAnuj


   

On Wednesday, 28 June 2017 5:56 PM, kurt greaves  
wrote:
 

 There are many scenarios where it can be useful, but to address what seems to 
be your main concern; you could simply restore and then only read at ALL until 
your repair completes.
If you use snapshot restore with commitlog archiving you're in a better state, 
but granted the case you described can still occur. To some extent, if you have 
to restore a snapshot you will have to perform some kind of repair. It's not 
really possible to restore to an older point and expect strong consistency.

Snapshots are also useful for creating a clone of a cluster/node.
But really why are you only restoring a snapshot on one node? If you lost all 
the data, it would be much simpler to just replace the node.

Re: Linux version update on DSE

2017-06-27 Thread Anuj Wadehra

Also, if you restore exactly same data with different IP, you may need to clear 
gossip state on the node.
Anuj

Sent from Yahoo Mail on Android 
 
  On Tue, Jun 27, 2017 at 11:56 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: 
  Hi Nitan,
I asked for adding autobootstrap false to avoid streaming. Generally, 
replace_address is used for bootstrapping new node with new ip but same token 
range. As you already had data, I asked you to try it with autoboostrap false 
and see if that works for you. If you can bring it back without replace_address 
option, good to go.
ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Tue, Jun 27, 2017 at 11:23 PM, Nitan Kainth<ni...@bamlabs.com> wrote:   
Anuj,
We did that in past, even if data was not removed, replace_node caused data 
streaming. So changing IP is simplest and safest option.


On Jun 27, 2017, at 12:43 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:
replace_address_first_boot

Re: Linux version update on DSE

2017-06-27 Thread Anuj Wadehra

Hi Nitan,
I think it would be simpler to take one node down at a time and replace it by 
bringing the new node up after linux upgrade, doing same cassandra setup, using 
replace_address option and setting autobootstrap=false ( as data is already 
there). No downtime as it would be a rolling upgrade. No streaming as same 
tokens would work.
If you have latest C*, use replace_address_first_boot. If option not available, 
use replace_address and make sure you remove it once new node is up.
Try it and let us know if it works for you.
ThanksAnuj

  On Tue, Jun 27, 2017 at 4:56 AM, Nitan Kainth wrote:   
Right, we are just upgrading Linux on AWS. C* will remain at same version.

On Jun 26, 2017, at 6:05 PM, Hannu Kröger  wrote:
I understood he is updating linux, not C*
Hannu
On 27 June 2017 at 02:04:34, Jonathan Haddad (j...@jonhaddad.com) wrote:

It sounds like you're suggesting adding new nodes in to replace existing ones.  
You can't do that because it requires streaming between versions, which isn't 
supported. 
You need to take a node down, upgrade the C* version, then start it back up.  
Jon
On Mon, Jun 26, 2017 at 3:56 PM Nitan Kainth  wrote:

It's vnodes. We will add to replace new ip in yaml as well.

Thank you.

Sent from my iPhone

> On Jun 26, 2017, at 4:47 PM, Hannu Kröger  wrote:
>
> Looks Ok. Step 1.5 would be to stop cassandra on existing node but apart from 
> that looks fine. Assuming you are using same configs and if you have hard 
> coded the token(s), you use the same.
>
> Hannu
>
>> On 26 Jun 2017, at 23.24, Nitan Kainth  wrote:
>>
>> Hi,
>>
>> We are planning to update linux for C* nodes version 3.0. Anybody has steps 
>> who did it recent past.
>>
>> Here are draft steps, we are thinking:
>> 1. Create new node. It might have a different IP address.
>> 2. Detach mounts from existing node
>> 3. Attach mounts to new Node
>> 4. Start C*

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Restore Snapshot

2017-06-27 Thread Anuj Wadehra

Hi,

I am curious to know how people practically use Snapshot restore provided that 
snapshot restore may lead to inconsistent reads until full repair is run on the 
node being restored ( if you have dropped mutations in your cluster).

Example:
9 am snapshot taken on all 3 nodes10 am mutation drop on node 3 11 am snapshot 
restore on node 1. Now the data is only on node 2 if we are writing at quorum 
and we will observe inconsistent reads till we repair node 1.


If you use restore snapshot with join_ring equal to false, repair the node and 
then join the restored node when repair completes, the node will not lead to 
inconsistent reads but will miss writes during the time its being repaired as 
simply booting the node with join_ring=false would also stop pushing writes to 
the node ( unlike boostrap with join_ring=false where writes are pushed to the 
node being bootstrapped) and thus you would need another full repair to make 
data of the node restored via snapshot in sync with other nodes.


Its hard to believe that a simple snapshot restore scenario is still broken and 
people are not complaining. So, I thought of asking the community members..how 
do you practically use snapshot restore while addressing the read inconsistency 
issue.

ThanksAnuj

Re: Hints files are not getting truncated

2017-06-27 Thread Anuj Wadehra

Hi Meg,
max_hint_window_in_ms =3 hrs means that if a node is down/unresponsive for more
than 3 hrs, hints would not be stored for it any further until it becomes
responsive again. It should not mean that already stored hints would be
truncated after 3 hours.
Regarding connection timeouts between DCs, please check your firewall settings
and tcp settings on node. Firewall between the DC must not kill an idle
connection which is still considered to be usable by Cassandra. Please see
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html
.
In multi dc setup, documentation recommends to increase number of threads. You
can try increasing it and check whether it improves the situation.
ThanksAnuj

On Tue, Jun 27, 2017 at 9:47 PM, Meg Mara wrote:

Hello,

I am facing an issue with Hinted Handoff files in Cassandra v3.0.10. A DC1 node
is storing large number of hints for DC2 nodes (we are facing connection
timeout issues). The problem is that the hint files which are created on DC1
are not getting deleted after the 3 hour window. Hints are now being stored as
flat files in the Cassandra home directory and I can see that old hints are
being deleted but at a very slow pace. It still contains hints from May.

max_hint_window_in_ms: 1080

max_hints_delivery_threads: 2

Why do you suppose this is happening? Any suggestions or recommendations would
be much appreciated.

Thanks for your time.

Meg Mara

Re: Re-adding Decommissioned node

2017-06-27 Thread Anuj Wadehra

Hi Mark,
Please ensure that the node is not defined as seed node in the yaml. Seed nodes 
don't bootstrap.
ThanksAnuj

 
 
  On Tue, Jun 27, 2017 at 9:56 PM, Mark Furlong wrote:   
 
I have a node that has been decommissioned and it showed ‘UL’, the data volume 
and the commitlogs have been removed, and I now want to add that node back into 
my ring. When I add this node, (bootstrap=true, start cassandra service) it 
comes back up in the ring as an existing node and shows as ‘UN’ instead of 
‘UJ’. Why is this? It has no data.
 
  
 
| 
Mark Furlong
  |
| 
Sr. Database Administrator
  |
| 
mfurl...@ancestry.com
M: 801-859-7427
 
O: 801-705-7115
 
1300 W Traverse Pkwy
 
Lehi, UT 84043
 
| 
  
  | 
  
  |

 |
| 

  |


  
 
  
   

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Partition range incremental repairs

2017-06-06 Thread Anuj Wadehra

Hi Chris,
Can your share following info:
1. Exact repair commands you use for inc repair and pr repair
2. Repair time should be measured at cluster level for inc repair. So, whats 
the total time it takes to run repair on all nodes for incremental vs pr 
repairs?
3. You are repairing one dc DC3. How many DCs are there in total and whats the 
RF for keyspaces? Running pr on a specific dc would not repair entire data.
4. 885 ranges? From where did you get this number? Logs? Can you share the 
number ranges printed in logs for both inc and pr case?

ThanksAnuj

Sent from Yahoo Mail on Android 

  On Tue, Jun 6, 2017 at 9:33 PM, Chris 
Stokesmore<chris.elsm...@demandlogic.co> wrote:   Thank you for the excellent 
and clear description of the different versions of repair Anuj, that has 
cleared up what I expect to be happening.
The problem now is in our cluster, we are running repairs with options 
(parallelism: parallel, primary range: false, incremental: true, job threads: 
1, ColumnFamilies: [], dataCenters: [DC3], hosts: [], # of ranges: 885) and 
when we do our repairs are taking over a day to complete when previously when 
running with the partition range option they were taking more like 8-9 hours.
As I understand it, using incremental should have sped this process up as all 
three sets of data on each repair job should be marked as repaired however this 
does not seem to be the case. Any ideas?
Chris

On 6 Jun 2017, at 16:08, Anuj Wadehra <anujw_2...@yahoo.co.in.INVALID> wrote:
Hi Chris,
Using pr with incremental repairs does not make sense. Primary range repair is 
an optimization over full repair. If you run full repair on a n node cluster 
with RF=3, you would be repairing each data thrice. E.g. in a 5 node cluster 
with RF=3, a range may exist on node A,B and C . When full repair is run on 
node A, the entire data in that range gets synced with replicas on node B and 
C. Now, when you run full repair on nodes B and C, you are wasting resources on 
repairing data which is already repaired. 
Primary range repair ensures that when you run repair on a node, it ONLY 
repairs the data which is owned by the node. Thus, no node repairs data which 
is not owned by it and must be repaired by other node. Redundant work is 
eliminated. 
Even in pr, each time you run pr on all nodes, you repair 100% of data. Why to 
repair complete data in each cycle?? ..even data which has not even changed 
since the last repair cycle?
This is where Incremental repair comes as an improvement. Once repaired, a data 
would be marked repaired so that the next repair cycle could just focus on 
repairing the delta. Now, lets go back to the example of 5 node cluster with RF 
=3.This time we run incremental repair on all nodes. When you repair entire 
data on node A, all 3 replicas are marked as repaired. Even if you run inc 
repair on all ranges on the second node, you would not re-repair the already 
repaired data. Thus, there is no advantage of repairing only the data owned by 
the node (primary range of the node). You can run inc repair on all the data 
present on a node and Cassandra would make sure that when you repair data on 
other nodes, you only repair unrepaired data.
ThanksAnuj

Sent from Yahoo Mail on Android 

 On Tue, Jun 6, 2017 at 4:27 PM, Chris Stokesmore<chris.elsm...@demandlogic.co> 
wrote:  Hi all,

Wondering if anyone had any thoughts on this? At the moment the long running 
repairs cause us to be running them on two nodes at once for a bit of time, 
which obivould increases the cluster load.

On 2017-05-25 16:18 (+0100), Chris Stokesmore <c...@demandlogic.co> wrote: 
> Hi,> 
> 
> We are running a 7 node Cassandra 2.2.8 cluster, RF=3, and had been running 
> repairs with the -pr option, via a cron job that runs on each node once per 
> week.> 
> 
> We changed that as some advice on the Cassandra IRC channel said it would 
> cause more anticompaction and  
> http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/tools/toolsRepair.html
>   says 'Performing partitioner range repairs by using the -pr option is 
> generally considered a good choice for doing manual repairs. However, this 
> option cannot be used with incremental repairs (default for Cassandra 2.2 and 
> later)'
> 
> Only problem is our -pr repairs were taking about 8 hours, and now the non-pr 
> repair are taking 24+ - I guess this makes sense, repairing 1/7 of data 
> increased to 3/7, except I was hoping to see a speed up after the first loop 
> through the cluster as each repair will be marking much more data as 
> repaired, right?> 
> 
> 
> Is running -pr with incremental repairs really that bad? > 
-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Partition range incremental repairs

2017-06-06 Thread Anuj Wadehra

Hi Chris,
Using pr with incremental repairs does not make sense. Primary range repair is 
an optimization over full repair. If you run full repair on a n node cluster 
with RF=3, you would be repairing each data thrice. E.g. in a 5 node cluster 
with RF=3, a range may exist on node A,B and C . When full repair is run on 
node A, the entire data in that range gets synced with replicas on node B and 
C. Now, when you run full repair on nodes B and C, you are wasting resources on 
repairing data which is already repaired. 
Primary range repair ensures that when you run repair on a node, it ONLY 
repairs the data which is owned by the node. Thus, no node repairs data which 
is not owned by it and must be repaired by other node. Redundant work is 
eliminated. 
Even in pr, each time you run pr on all nodes, you repair 100% of data. Why to 
repair complete data in each cycle?? ..even data which has not even changed 
since the last repair cycle?
This is where Incremental repair comes as an improvement. Once repaired, a data 
would be marked repaired so that the next repair cycle could just focus on 
repairing the delta. Now, lets go back to the example of 5 node cluster with RF 
=3.This time we run incremental repair on all nodes. When you repair entire 
data on node A, all 3 replicas are marked as repaired. Even if you run inc 
repair on all ranges on the second node, you would not re-repair the already 
repaired data. Thus, there is no advantage of repairing only the data owned by 
the node (primary range of the node). You can run inc repair on all the data 
present on a node and Cassandra would make sure that when you repair data on 
other nodes, you only repair unrepaired data.
ThanksAnuj

Sent from Yahoo Mail on Android 

  On Tue, Jun 6, 2017 at 4:27 PM, Chris 
Stokesmore wrote:   Hi all,

Wondering if anyone had any thoughts on this? At the moment the long running 
repairs cause us to be running them on two nodes at once for a bit of time, 
which obivould increases the cluster load.

On 2017-05-25 16:18 (+0100), Chris Stokesmore  wrote: 
> Hi,> 
> 
> We are running a 7 node Cassandra 2.2.8 cluster, RF=3, and had been running 
> repairs with the -pr option, via a cron job that runs on each node once per 
> week.> 
> 
> We changed that as some advice on the Cassandra IRC channel said it would 
> cause more anticompaction and  
> http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/tools/toolsRepair.html
>   says 'Performing partitioner range repairs by using the -pr option is 
> generally considered a good choice for doing manual repairs. However, this 
> option cannot be used with incremental repairs (default for Cassandra 2.2 and 
> later)'
> 
> Only problem is our -pr repairs were taking about 8 hours, and now the non-pr 
> repair are taking 24+ - I guess this makes sense, repairing 1/7 of data 
> increased to 3/7, except I was hoping to see a speed up after the first loop 
> through the cluster as each repair will be marking much more data as 
> repaired, right?> 
> 
> 
> Is running -pr with incremental repairs really that bad? > 
-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Weird error: InvalidQueryException: unconfigured table table2

2017-03-25 Thread Anuj Wadehra

Ensure that all the nodes are on same schema version such that table2 schema is 
replicated properly on all the nodes.
ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Sat, Mar 25, 2017 at 3:19 AM, S G wrote:   Hi,
I have a keyspace with two tables.
I run a different query for each table:
Table 1:  Select * from table1 where id = ?
Table 2:  Select * from table2 where id1 = ? and id = ?

My code using datastax fires above two queries one after the other.While it 
never fails for table 1, it never succeeds for table 2And gives an error:

com.datastax.driver.core.exceptions.InvalidQueryException: unconfigured table 
table2 at 
com.datastax.driver.core.Responses$Error.asException(Responses.java:136)  at 
com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:179)
  at 
com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:177) 
 at com.datastax.driver.core.RequestHandler.access$2500(RequestHandler.java:46) 
 at 
com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:799)
  at 
com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:633)
  at 
com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1070)
  at 
com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:993)
  at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
  at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
  at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
  at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
  at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
  at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
  at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
  at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
  at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
  at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
  at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
  at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
  at 
io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
  at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
  at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
  at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
  at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
  at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1280)
  at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
  at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
  at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:890)
  at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
  at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:564)  at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:505)
  at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:419)  
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:391) 
Any idea what might be wrong?
I have confirmed that all table-names and columns names are lowercase.Datastax 
java version tried : 3.1.2  and 3.1.4Cassandra version: 3.10

ThanksSG

Repair while upgradesstables is running

2017-03-12 Thread Anuj Wadehra

Hi,
What is the implication of running inc repair when all nodes have upgraded to 
new Cassandra rpm but parallel upgradesstables is still running on one or more 
of the nodes?
So upgrade is like:1. Rolling upgrade of all nodes (rpm install)2. Parallel 
upgrade sstable on all nodes ( no issues with IO. We can afford it)3. Repair 
inc while step is running??
ThanksAnuj

Incremental Repair

2017-03-12 Thread Anuj Wadehra

Hi,

Our setup is as follows:
2 DCS with N nodes, RF=DC1:3,DC2:3, Hinted Handoff=3 hours, Incremental Repair 
scheduled once on every node (ALL DCs) within the gc grace period.

I have following queries regarding incremental repairs:

1. When a node is down for X hours (where x > hinted handoff hours and less 
than gc grace time), I think incremental repair is sufficient rather than doing 
the full repair. Is the understanding correct ? 

2. DataStax recommends "Run incremental repair daily, run full repairs weekly 
to monthly". Does that mean that I have to run full repairs every week to month 
EVEN IF I do daily incremental repairs? If yes, whats the reasoning of running 
full repair when inc repair is already run?

Reference: 
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesWhen.html

3. We run inc repair at least once in gc grace instead of the general 
recommendation that inc repair should be run daily. Do you see any problem with 
the approach? 

As per my understanding, if we run inc repair less frequently, compaction 
between unrepaired and repaired data wont happen on a node until some node runs 
inc repair on the unrepaired data range. Thus, there can be some impact on disk 
space and read performance but immediate compaction is anyways never guaranteed 
by Cassandra. So, I see minimal impact on performance and that too just on the 
reads of delta data generated between repairs. 

Thanks
Anuj

Re: Is it possible to recover a deleted-in-future record?

2017-03-08 Thread Anuj Wadehra

DISCLAIMER: This is only my personal opinion. Evaluate the situation carefully 
and if you find below suggestions useful, follow them at your own risk.
If I have understood the problem correctly, malicious deletes would actually 
lead to deletion of data.  I am not sure how everything is normal after the 
deletes?
If data is critical,you could:

1. Take a database snapshot immediately so that you dont lose information if 
delete entrues in sstables are compacted together with original data. 
2. Transfer snapshot to suitable place and Run some utility such as 
sstabletojson to get the keys impacted by the deletes and original data for 
keys. Data has to be consolidated from all the nodes.
3. Devise a strategy to restore deleted data.
ThanksAnuj

 
 
  On Tue, Mar 7, 2017 at 8:44 AM, Michael Fong 
wrote:   
Hi, all,
 
  
 
  
 
We recently encountered an issue in production that some records were 
mysteriously deleted with a timestamp 100+ years from now. Everything is normal 
as of now, and how the deletion happened and accuracy of system timestamp at 
that moment are unknown. We were wondering if there is a general way to recover 
the mysteriously-deleted data when the timestamp meta is screwed up.
 
  
 
Thanks in advanced,
 
  
 
Regards,
 
  
 
Michael Fong

Re: Read after Write inconsistent at times

2017-02-25 Thread Anuj Wadehra

Hi Charulata,
Please share details on how data is being inserted and read. 
Is the client which is reading the data same as the one which inserted it? Is 
the read happening only when insertion is successful? Are you using client 
timestamps?
How did you verify that NTP is working properly? How NTP is configured in your 
cluster- sample NTP conf ?


ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Sat, 25 Feb, 2017 at 2:02 AM, Charulata Sharma 
(charshar) wrote:   Hi All,      Thanks for your replies. I 
do not see an issue with NTP or with dropped messages.However the tombstones 
count on the specific CF shows me this. This essentially indicates that there 
are as many tombstones as Live cells in the CF isin't?Now is that an issue and 
can this cause inconsistent read ?

Average live cells per slice (last five minutes): 0.8631938498408056

Maximum live cells per slice (last five minutes): 1.0

Average tombstones per slice (last five minutes): 1.1477603751799115E-5

Maximum tombstones per slice (last five minutes): 1.0




Thanks,

Charu

From: Jonathan Haddad 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, February 24, 2017 at 9:42 AM
To: "user@cassandra.apache.org" 
Subject: Re: Read after Write inconsistent at times

WRT to NTP, I first encountered this issue on my first cluster.  The problem 
with ntp isn't just if you're doing inserts, it's if you're doing inserts in 
combination with deletes, and using server timestamps with a greater variance 
than the period between the delete and the insert.  Basically, you end up with 
a delete in the future and an insert in the past, and the delete timestamp > 
insert timestamp.
+1 to Jan's recommendation on checking for dropped messages.


On Fri, Feb 24, 2017 at 9:35 AM Petrus Gomes  wrote:

Hi,

Check the tombstone count, If is it to high, your query will be impacted.

If tombstone is a problem, you can try to reduce your "gc_grace_seconds" to 
reduce tombstone count(Carefully because you use crossdata centers).

Tchau,
Petrus Silva



On Fri, Feb 24, 2017 at 12:07 AM, Jan Kesten  wrote:

Hi,

are your nodes at high load? Are there any dropped messages (nodetool tpstats) 
on any node?

Also have a look at your system clocks. C* needs them in thight sync - via ntp 
for example. Side hint: if you use ntp use the same set of upstreams on all of 
your nodes - ideal your own one. Usingpool.ntp.org might lead to minimal dirfts 
in time across your cluster.

Another thing that could help you out is using client side timestamps: 
https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/ 
(of course only when you are using a single client or all clients are in sync 
via ntp).


Am 24.02.2017 um 07:29 schrieb Charulata Sharma (charshar):


Hi All,

In my application sometimes I cannot read data that just got inserted. This 
happens very intermittently. Both write and read use LOCAL QUOROM.

We have a cluster of 12 nodes which spans across 2 Data Centers and a RF of 3.

Has anyone encountered this problem and if yes what steps have you taken to 
solve it

Thanks,
Charu


--
Jan Kesten, mailto:j.kes...@enercast.de
Tel.: +49 561/4739664-0 FAX: -9 Mobil: +49 160 / 90 98 41 68
enercast GmbH Universitätsplatz 12 D-34127 Kassel HRB15471
http://www.enercast.de Online-Prognosen für erneuerbare Energien
Geschäftsführung: Thomas Landgraf (CEO), Bernd Kratz (CTO), Philipp Rinder (CSO)

Diese E-Mail und etwaige Anhänge können vertrauliche und/oder rechtlich 
geschützte Informationen enthalten. Falls Sie nicht der angegebene Empfänger 
sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, 
benachrichtigen Sie uns bitte sofort durch Antwort-E-Mail und löschen Sie diese 
E-Mail nebst etwaigen Anlagen von Ihrem System. Ebenso dürfen Sie diese E-Mail 
oder ihre Anlagen nicht kopieren oder an Dritte weitergeben. Vielen Dank.

This e-mail and any attachment may contain confidential and/or privileged 
information. If you are not the named addressee or if this transmission has 
been addressed to you in error, please notify us immediately by reply e-mail 
and then delete this e-mail and any attachment from your system. Please 
understand that you must not copy this e-mail or any attachment or disclose the 
contents to any other person. Thank you for your cooperation.

Re: Cluster scaling

2017-02-08 Thread Anuj Wadehra

Hi Branislav,
I quickly went through the code and noticed that you are updating RF from code 
and expecting that Cassandra would automatically distribute replicas as per the 
new RF. I think this is not how it works. After updating the RF, you need to 
run repair on all the nodes to make sure that data replicas are as per the new 
RF. Please refer to 
https://docs.datastax.com/en/cql/3.1/cql/cql_using/update_ks_rf_t.html . This 
would give you reliable results.
It would be good if you explain the exact purpose of your exercise. Tests seem 
more in academic interest. You are adding several variables in your tests but 
each of these params have entirely different purpose:
1. Batch/No Batch depends on business atomicity needs. 
2. Read/ No read is dependent on business requirement
3. RF depends on fault tolerance needed

ThanksAnuj
 
 
  On Wed, 8 Feb, 2017 at 9:09 PM, Branislav Janosik -T (bjanosik - AAP3 INC at 
Cisco) wrote:   
Hi all,
 
 
 
I have a cluster of three nodes and would like to ask some questions about the 
performance.
 
I wrote a small benchmarking tool in java that mirrors (read, write) operations 
that we do in the real project.
 
Problem is that it is not scaling like it should. The program runs two tests: 
one using batch statement and one without using the batch.
 
The operation sequence is: optional select, insert, update, insert. I run the 
tool on my server with 128 threads (# of threads has no influence on the 
performance),
 
creating usually 100K resources for testing purposes.
 
 
 
The average results (operations per second) with the use of batch statement are:
 
 
 
Replication Factor = 1   with reading    without reading
 
    1-node cluster 37K 46K
 
    2-node cluster     37K 47K
 
    3-node cluster 39K 70K
 
 
 
Replication Factor = 2   with reading    without reading
 
    2-node cluster     21K 40K
 
    3-node cluster 30K 48K
 
 
 
The average results (operations per second) without the use of batch statement 
are:
 
 
 
Replication Factor = 1   with reading    without reading
 
    1-node cluster 31K 20K
 
    2-node cluster     38K 39K
 
    3-node cluster 45K 87K
 
 
 
Replication Factor = 2   with reading    without reading
 
    2-node cluster     19K 22K
 
    3-node cluster 26K 36K
 
 
 
The Cassandra VMs specs are: 16 CPUs,  16GB and two 32GB of RAM, at least 30GB 
of disk space for each node. Non SSD, each VM is on separate physical server.
 
 
 
The code is available herehttps://github.com/bjanosik/CassandraBenchTool.git . 
It can be built with Maven and then you can use jar in target directory 
withjava -jar target/cassandra-test-1.0-SNAPSHOT-jar-with-dependencies.jar.
 
Thank you for any help.

Re: Disc size for cluster

2017-01-26 Thread Anuj Wadehra

Adding to what Benjamin said..
It is hard to estimate disk space if you are using STCS for a table where rows 
are updated frequently leading to lot of fragmentation. STCS may also lead to 
scenarios where tombstones are not evicted for long times. You may go live and 
everything goes well for months. Then gradually you realize that large sstables 
are holding on to tombstones as they are not getting compacted.  It is not easy 
to test disk space requirements with precision upfront unless you test your 
system with data patterns for some time.
Your life can be easy much easier if you take care of following points with 
STCS:
1. If you can afford some extra IO, go for slightly aggressive STCS strategy 
using one or more of following settings: min_threshold=2, 
bucket_high=2,unchecked_tombstone_compactions=true. Which one of these to use 
depends on your use case.Study these settings.
2. Estimate free disk required for compactions at any point of time. 
For example, suppose you have 5 tables with 3 TB data in total and you estimate 
that data distribution will be as follows:A: 800 gb B:700gb C:600gb D:500gb 
E:400gb
If you have concurrent_compactors=3 and 90% data of your largest tables are 
getting compacted simultaneously, you will need 90/100*(800+700+600)gb =1.9 TB 
free disk space. So you wont need 6 TB disk for 3 TB data. Only 4.9 TB would do.
3. Take 10-15% buffer for future schema changes and calculation errors. Better 
safe than sorry :)

Thanks
Anuj 
 
  On Thu, 26 Jan, 2017 at 2:41 PM, Benjamin Roth 
wrote:   Hi!
This is basically right, but:1. How do you know the 3TB storage will be 3TB on 
cassandra? This depends how the data is serialized, compressed and how often it 
changes and it depends on your compaction settings2. 50% free space on STCS is 
only required if you do a full compaction of a single CF that takes all the 
space. Normally you need as much free space as the target SSTable of a 
compaction will take. If you split your data across more CFs, its unlikely you 
really hit this value.
.. probably you should do some tests. But in the end it is always good to have 
some headroom. I personally would scale out if free space is < 30% but that 
always depends on your model.

2017-01-26 9:56 GMT+01:00 Raphael Vogel :

HiJust want to validate my estimation for a C* cluster which should have around 
3 TB of usable storage.Assuming a RF of 3 and SizeTiered Compaction Strategy.Is 
it correct, that SizeTiered Compaction Strategy needs (in the worst case) 50% 
free disc space during compaction? So this would then result in a cluster of 
3TB x 3 x 2 == 18 TB of raw storage? Thanks and RegardsRaphael Vogel



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: Join_ring=false Use Cases

2016-12-21 Thread Anuj Wadehra

Thanks All !!
I think the intent of the JIRA https://issues.apache.org/ 
jira/browse/CASSANDRA-6961 was to primarily deal with stale information after 
outages and give opportunity for repairing the data before a node joins the 
cluster. If a node started with join_ring=false doesn't accept writes while the 
repair is happening, the purpose of JIRA is defeated as it will anyways lead to 
stale information. Seems to be a defect.

ThanksAnuj


On Wednesday, 21 December 2016 2:53 AM, kurt Greaves <k...@instaclustr.com> 
wrote:
 

 It seems that you're correct in saying that writes don't propagate to a node 
that has join_ring set to false, so I'd say this is a flaw. In reality I can't 
see many actual use cases in regards to node outages with the current 
implementation. The main usage I'd think would be to have additional 
coordinators for CPU heavy workloads.

It seems to make it actually useful for repairs/outages we'd need to have 
another option to turn on writes so that it behaved similarly to write survey 
mode (but on already bootstrapped nodes).

Is there a reason we don't have this already? Or does it exist somewhere I'm 
not aware of? 

On 20 December 2016 at 17:40, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

No responses yet :)
Any C* expert who could help on join_ring use case and the concern raised?
Thanks
Anuj 
 
 On Tue, 13 Dec, 2016 at 11:31 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote:  
 Hi,
I need to understand the use case of join_ring=false in case of node outages. 
As per https://issues.apache.org/ jira/browse/CASSANDRA-6961, you would want 
join_ring=false when you have to repair a node before bringing a node back 
after some considerable outage. The problem I see with join_ring=false is that 
unlike autobootstrap, the node will NOT accept writes while you are running 
repair on it. If a node was down for 5 hours and you bring it back with 
join_ring=false, repair the node for 7 hours and then make it join the ring, it 
will STILL have missed writes because while the time repair was running (7 
hrs), writes only went to other others. So, if you want to make sure that reads 
served by the restored node at CL ONE will return consistent data after the 
node has joined, you wont get that as writes have been missed while the node is 
being repaired. And if you work with Read/Write CL=QUORUM, even if you bring 
back the node without join_ring=false, you would anyways get the desired 
consistency. So, how join_ring would provide any additional consistency in this 
case ??
I can see join_ring=false useful only when I am restoring from Snapshot or 
bootstrapping and there are dropped mutations in my cluster which are not fixed 
by hinted handoff.
For Example: 3 nodes A,B,C working at Read/Write CL QUORUM. Hinted Handoff=3 
hrs.10 AM Snapshot taken on all 3 nodes11 AM: Node B goes down for 4 hours3 PM: 
Node B comes up but data is not repaired. So, 1 hr of dropped mutations (2-3 
PM) not fixed via Hinted Handoff.5 PM: Node A crashes.6 PM: Node A restored 
from 10 AM Snapshot, Node A started with join_ring=false, repaired and then 
joined the cluster.
In above restore snapshot example, updates from 2-3 PM were outside hinted 
handoff window of 3 hours. Thus, node B wont get those updates. Node A data for 
2-3 PM is already lost. So, 2-3 PM updates are only on one replica i.e. node C 
and minimum consistency needed is QUORUM so join_ring=false would help. But 
this is very specific use case.  
ThanksAnuj

Re: Join_ring=false Use Cases

2016-12-20 Thread Anuj Wadehra

No responses yet :)
Any C* expert who could help on join_ring use case and the concern raised?
Thanks
Anuj 
 
  On Tue, 13 Dec, 2016 at 11:31 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: 
   Hi,
I need to understand the use case of join_ring=false in case of node outages. 
As per https://issues.apache.org/jira/browse/CASSANDRA-6961, you would want 
join_ring=false when you have to repair a node before bringing a node back 
after some considerable outage. The problem I see with join_ring=false is that 
unlike autobootstrap, the node will NOT accept writes while you are running 
repair on it. If a node was down for 5 hours and you bring it back with 
join_ring=false, repair the node for 7 hours and then make it join the ring, it 
will STILL have missed writes because while the time repair was running (7 
hrs), writes only went to other others. So, if you want to make sure that reads 
served by the restored node at CL ONE will return consistent data after the 
node has joined, you wont get that as writes have been missed while the node is 
being repaired. And if you work with Read/Write CL=QUORUM, even if you bring 
back the node without join_ring=false, you would anyways get the desired 
consistency. So, how join_ring would provide any additional consistency in this 
case ??
I can see join_ring=false useful only when I am restoring from Snapshot or 
bootstrapping and there are dropped mutations in my cluster which are not fixed 
by hinted handoff.
For Example: 3 nodes A,B,C working at Read/Write CL QUORUM. Hinted Handoff=3 
hrs.10 AM Snapshot taken on all 3 nodes11 AM: Node B goes down for 4 hours3 PM: 
Node B comes up but data is not repaired. So, 1 hr of dropped mutations (2-3 
PM) not fixed via Hinted Handoff.5 PM: Node A crashes.6 PM: Node A restored 
from 10 AM Snapshot, Node A started with join_ring=false, repaired and then 
joined the cluster.
In above restore snapshot example, updates from 2-3 PM were outside hinted 
handoff window of 3 hours. Thus, node B wont get those updates. Node A data for 
2-3 PM is already lost. So, 2-3 PM updates are only on one replica i.e. node C 
and minimum consistency needed is QUORUM so join_ring=false would help. But 
this is very specific use case.  
ThanksAnuj

Re: Join_ring=false Use Cases

2016-12-14 Thread Anuj Wadehra

Can anyone help me with join_ring and address my concerns?

Thanks
Anuj 
 
  On Tue, 13 Dec, 2016 at 11:31 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: 
   Hi,
I need to understand the use case of join_ring=false in case of node outages. 
As per https://issues.apache.org/jira/browse/CASSANDRA-6961, you would want 
join_ring=false when you have to repair a node before bringing a node back 
after some considerable outage. The problem I see with join_ring=false is that 
unlike autobootstrap, the node will NOT accept writes while you are running 
repair on it. If a node was down for 5 hours and you bring it back with 
join_ring=false, repair the node for 7 hours and then make it join the ring, it 
will STILL have missed writes because while the time repair was running (7 
hrs), writes only went to other others. So, if you want to make sure that reads 
served by the restored node at CL ONE will return consistent data after the 
node has joined, you wont get that as writes have been missed while the node is 
being repaired. And if you work with Read/Write CL=QUORUM, even if you bring 
back the node without join_ring=false, you would anyways get the desired 
consistency. So, how join_ring would provide any additional consistency in this 
case ??
I can see join_ring=false useful only when I am restoring from Snapshot or 
bootstrapping and there are dropped mutations in my cluster which are not fixed 
by hinted handoff.
For Example: 3 nodes A,B,C working at Read/Write CL QUORUM. Hinted Handoff=3 
hrs.10 AM Snapshot taken on all 3 nodes11 AM: Node B goes down for 4 hours3 PM: 
Node B comes up but data is not repaired. So, 1 hr of dropped mutations (2-3 
PM) not fixed via Hinted Handoff.5 PM: Node A crashes.6 PM: Node A restored 
from 10 AM Snapshot, Node A started with join_ring=false, repaired and then 
joined the cluster.
In above restore snapshot example, updates from 2-3 PM were outside hinted 
handoff window of 3 hours. Thus, node B wont get those updates. Node A data for 
2-3 PM is already lost. So, 2-3 PM updates are only on one replica i.e. node C 
and minimum consistency needed is QUORUM so join_ring=false would help. But 
this is very specific use case.  
ThanksAnuj

Re: Configure NTP for Cassandra

2016-12-14 Thread Anuj Wadehra

Thanks Martin. Agree, setting up our own internal servers will help save some 
firewall traffic, simplify security management and reduce load on public 
servers which is an ethical thing to do. As the blog recommended setting up own 
internal servers for Cassandra, I wanted to make sure that there are no 
Cassandra specific benefits e.g. better relative time synchronization achieved 
with an internal setup. So, I would conclude it this way : Even though its not 
a good practice to directly access external NTP servers via Cassandra nodes, 
Cassandra can still achieve tight relative time synchronization using reliable 
external servers. There is no madate to setup your own pool of internal NTP 
servers for BETTER time synchronization.
Thanks for your inputs.Anuj 

  On Wed, 14 Dec, 2016 at 3:22 AM, Martin Schröder<mar...@oneiros.de> wrote:   
2016-11-26 20:20 GMT+01:00 Anuj Wadehra <anujw_2...@yahoo.co.in>:
> 1. If my ISP provider is providing me a pool of reliable NTP servers, should
> I setup my own internal servers anyway or can I sync Cassandra nodes
> directly to the ISP provided servers and select one of the servers as
> preferred for relative clock synchronization?

Set up three ntp servers which uses the provider servers _and_ pool servers
and sync your other machines from these servers (and maybe get gps receivers
for your ntp servers). This reduces ntp traffic at your firewall (your servers
act as proxies) and reduces load on public servers.

> 2. As per my understanding, peer association is ONLY for backup scenario .
> If a peer loses time synchronization source, then other peers can be used
> for time synchronization. Thus providing a HA service. But when everything
> is ok (happy path), does defining NTP servers synced from different sources
> as peers lead them to converge time as mentioned in some forums?

Maybe; but the difference will be negligible (sub milliseconds).
I wouldn't worry about that.

Best
  Martin

Re: Configure NTP for Cassandra

2016-12-13 Thread Anuj Wadehra

Thanks for the NTP link. Most of us are Cassandra users and must be using NTP 
(or other time synchronization methods) for ensuring relative time 
synchronization in our Cassandra clusters. I hope there are people on the 
mailing list who can answer these questions with respect to Cassandra. 
There is just one detailed blog on NTP best practices for Cassandra and I think 
answering these questions is important rather than just creating an internal 
NTP pool with recommended settings.

Thanks
Anuj 
 
  On Wed, 14 Dec, 2016 at 12:07 AM, Jim Witschey<jim.witsc...@datastax.com> 
wrote:   You might find more NTP experts on the NTP questions mailing list:
http://lists.ntp.org/listinfo/questions

On Tue, Dec 13, 2016 at 1:25 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:
> Any NTP experts willing to take up these questions?
>
> Thanks
> Anuj
>
> On Sun, 27 Nov, 2016 at 12:52 AM, Anuj Wadehra
> <anujw_2...@yahoo.co.in> wrote:
> Hi,
>
> One popular NTP setup recommended for Cassandra users is described at
> Thankshttps://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-2-solutions/
> .
>
> Summary of article is:
> Setup recommends a dedicated pool of internal NTP servers which are
> associated as peers to provide a HA NTP service. Cassandra nodes sync to
> this dedicated pool but define one internal NTP server as preferred server
> to ensure relative clock synchronization. Internal NTP servers sync to
> external NTP servers.
>
> My questions:
>
> 1. If my ISP provider is providing me a pool of reliable NTP servers, should
> I setup my own internal servers anyway or can I sync Cassandra nodes
> directly to the ISP provided servers and select one of the servers as
> preferred for relative clock synchronization?
>
>
> I agree. If you have to rely on public NTP pool which selects random servers
> for sync, having an internal NTP server pool is justified for getting tight
> relative sync as described in the blog
>
> 2. As per my understanding, peer association is ONLY for backup scenario .
> If a peer loses time synchronization source, then other peers can be used
> for time synchronization. Thus providing a HA service. But when everything
> is ok (happy path), does defining NTP servers synced from different sources
> as peers lead them to converge time as mentioned in some forums?
>
> e.g. if A and B are peers and thier times are 9:00:00 and 9:00:10 after
> syncing with respective time sources, then will they converge their clocks
> as 9:00:05?
>
> I doubt the above claim regarding time converge. Also no formal doc says
> that. Comments?
>
>
> Thanks
> Anuj
>

Re: Configure NTP for Cassandra

2016-12-13 Thread Anuj Wadehra

Any NTP experts willing to take up these questions?

Thanks
Anuj

On Sun, 27 Nov, 2016 at 12:52 AM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote:
Hi,
One popular NTP setup recommended for Cassandra users is described at
Thankshttps://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-2-solutions/
.
Summary of article is:Setup recommends a dedicated pool of internal NTP servers
which are associated as peers to provide a HA NTP service. Cassandra nodes sync
to this dedicated pool but define one internal NTP server as preferred server
to ensure relative clock synchronization. Internal NTP servers sync to external
NTP servers.
My questions:
1. If my ISP provider is providing me a pool of reliable NTP servers, should I
setup my own internal servers anyway or can I sync Cassandra nodes directly to
the ISP provided servers and select one of the servers as preferred for
relative clock synchronization?

I agree. If you have to rely on public NTP pool which selects random servers
for sync, having an internal NTP server pool is justified for getting tight
relative sync as described in the blog
2. As per my understanding, peer association is ONLY for backup scenario . If a
peer loses time synchronization source, then other peers can be used for time
synchronization. Thus providing a HA service. But when everything is ok (happy
path), does defining NTP servers synced from different sources as peers lead
them to converge time as mentioned in some forums?
e.g. if A and B are peers and thier times are 9:00:00 and 9:00:10 after syncing
with respective time sources, then will they converge their clocks as 9:00:05?
I doubt the above claim regarding time converge. Also no formal doc says that.
Comments?

ThanksAnuj

Join_ring=false Use Cases

2016-12-13 Thread Anuj Wadehra

 Hi,
I need to understand the use case of join_ring=false in case of node outages. 
As per https://issues.apache.org/jira/browse/CASSANDRA-6961, you would want 
join_ring=false when you have to repair a node before bringing a node back 
after some considerable outage. The problem I see with join_ring=false is that 
unlike autobootstrap, the node will NOT accept writes while you are running 
repair on it. If a node was down for 5 hours and you bring it back with 
join_ring=false, repair the node for 7 hours and then make it join the ring, it 
will STILL have missed writes because while the time repair was running (7 
hrs), writes only went to other others. So, if you want to make sure that reads 
served by the restored node at CL ONE will return consistent data after the 
node has joined, you wont get that as writes have been missed while the node is 
being repaired. And if you work with Read/Write CL=QUORUM, even if you bring 
back the node without join_ring=false, you would anyways get the desired 
consistency. So, how join_ring would provide any additional consistency in this 
case ??
I can see join_ring=false useful only when I am restoring from Snapshot or 
bootstrapping and there are dropped mutations in my cluster which are not fixed 
by hinted handoff.
For Example: 3 nodes A,B,C working at Read/Write CL QUORUM. Hinted Handoff=3 
hrs.10 AM Snapshot taken on all 3 nodes11 AM: Node B goes down for 4 hours3 PM: 
Node B comes up but data is not repaired. So, 1 hr of dropped mutations (2-3 
PM) not fixed via Hinted Handoff.5 PM: Node A crashes.6 PM: Node A restored 
from 10 AM Snapshot, Node A started with join_ring=false, repaired and then 
joined the cluster.
In above restore snapshot example, updates from 2-3 PM were outside hinted 
handoff window of 3 hours. Thus, node B wont get those updates. Node A data for 
2-3 PM is already lost. So, 2-3 PM updates are only on one replica i.e. node C 
and minimum consistency needed is QUORUM so join_ring=false would help. But 
this is very specific use case.  
ThanksAnuj

Re: Single cluster node restore

2016-12-02 Thread Anuj Wadehra

Hi Petr,
If data corruption means accidental data deletions via Cassandra commands, you 
have to restore entire cluster with latest snapshots. This may lead to data 
loss as there may be valid updates after the snapshot was taken but before the 
data deletion. Restoring single node with snapshot wont help as Cassandra 
replicated the accidental deletes to all nodes.
If data corruption means accidental deletion of some sstable files from file 
system of a node, repair would fix it.
If data corruption means unreadable data due to hardware issues etc, you will 
have two options after replacing the disk: bootstrap or restore snapshot on the 
single affected node. If you have huge data per node e.g. 300Gb , you may want 
to restore from Snapshot followed by repair. Restoring snapshot on single node 
is faster than streaming all data via bootstrap. If the node is not recoverable 
and must be replaced, you should be able to do auto-boostrap or restore from 
snapshot with auto-bootstrap set to false. I havent replaced a dead node with 
snapshot but there should not be any issues as token ranges dont change when 
you replace a node.



Thanks
Anuj 
 
  On Tue, 29 Nov, 2016 at 11:08 PM, Petr Malik wrote:   


Hi.

I have a question about Cassandra backup-restore strategies.


As far as I understand Cassandra has been designed to survive hardware failures 
by relying on data replication.




It seems like people still want backup/restore for case when somebody 
accidentally deletes data or the data gets otherwise corrupted.

In that case restoring all keyspace/table snapshots on all nodes should bring 
it back.




I am asking because I often read directions on restoring a single node in a 
cluster. I am just wondering under what circumstances could this be done safely.





Please correct me if i am wrong but restoring just a single node does not 
really roll back the data as the newer (corrupt) data will be served by other 
replicas and eventually propagated to the restored node. Right?

In fact by doing so one may end up reintroducing deleted data back...




Also since Cassandra distributes the data throughout the cluster it is not 
clear on which mode any particular (corrupt) data resides and hence which to 
restore.




I guess this is a long way of asking whether there is an advantage of trying to 
restore just a single node in a Cassandra cluster as opposed to say replacing 
the dead node and letting Cassandra handle the replication.




Thanks.

Configure NTP for Cassandra

2016-11-26 Thread Anuj Wadehra

Hi,
One popular NTP setup recommended for Cassandra users is described at
Thankshttps://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-2-solutions/
.
Summary of article is:Setup recommends a dedicated pool of internal NTP servers
which are associated as peers to provide a HA NTP service. Cassandra nodes sync
to this dedicated pool but define one internal NTP server as preferred server
to ensure relative clock synchronization. Internal NTP servers sync to external
NTP servers.
My questions:
1. If my ISP provider is providing me a pool of reliable NTP servers, should I
setup my own internal servers anyway or can I sync Cassandra nodes directly to
the ISP provided servers and select one of the servers as preferred for
relative clock synchronization?

ThanksAnuj

Re: Some questions to updating and tombstone

2016-11-14 Thread Anuj Wadehra

Hi Boying,
I agree with Vladimir.If compaction is not compacting the two sstables with 
updates soon, disk space issues will be wasted. For example, if the updates are 
not closer in time, first update might be in a big table by the time second 
update is being written in a new small table. STCS wont compact them together 
soon.
Just adding column values with new timestamp shouldnt create any tombstones. 
But if data is not merged for long, disk space issues may arise. If you are 
STCS,just  yo get an idea about the extent of the problem you can run major 
compaction and see the amount of disk space created with that( dont do this in 
production as major compaction has its own side effects).
Which compaction strategy are you using? Are these updates done with TTL?
Thanks
Anuj 
 
  On Mon, 14 Nov, 2016 at 1:54 PM, Vladimir Yudovin 
wrote:   Hi Boying,

UPDATE write new value with new time stamp. Old value is not tombstone, but 
remains until compaction. gc_grace_period is not related to this.

Best regards, Vladimir Yudovin, 
Winguzone - Hosted Cloud Cassandra
Launch your cluster in minutes.

 On Mon, 14 Nov 2016 03:02:21 -0500Lu, Boying  wrote 




Hi, All,


 


Will the Cassandra generates a new tombstone when updating a column by using 
CQL update statement?


 


And is there any way to get the number of tombstones of a column family since 
we want to void generating


too many tombstones within gc_grace_period?


 


Thanks


 


Boying

Re: Handle Leap Seconds with Cassandra

2016-11-02 Thread Anuj Wadehra

Thanks Ben for taking out time for the detailed reply !!
We dont need strict ordering for all operations but we are looking for 
scenarios where 2 quick updates to same column of same row are possible. By 
quick updates, I mean >300 ms. Configuring NTP properly (as mentioned in some 
blogs in your link) should give fair relative accuracy between the Cassandra 
nodes. But leap second takes the clock back for an ENTIRE one sec (huge) and 
the probability of old write overwriting the new one increases drastically. So, 
we want to be proactive with things.
I agree that you should avoid such scebaruos with design (if possible).
Good to know that you guys have setup your own NTP servers as per the 
recommendation. Curious..Do you also do some monitoring around NTP?


Thanks
Anuj 
 
 On Fri, 28 Oct, 2016 at 12:25 AM, Ben Bromhead<b...@instaclustr.com> wrote:  
If you need guaranteed strict ordering in a distributed system, I would not use 
Cassandra, Cassandra does not provide this out of the box. I would look to a 
system that uses lamport or vector clocks. Based on your description of how 
your systems runs at the moment (and how close your updates are together), you 
have either already experienced out of order updates or there is a real 
possibility you will in the future. 
Sorry to be so dire, but if you do require causal consistency / strict 
ordering, you are not getting it at the moment. Distributed systems theory is 
really tricky, even for people that are "experts" on distributed systems over 
unreliable networks (I would certainly not put myself in that category). People 
have made a very good name for themselves by showing that the vast majority of 
distributed databases have had bugs when it comes to their various consistency 
models and the claims these databases make.
So make sure you really do need guaranteed causal consistency/strict ordering 
or if you can design around it (e.g. using conflict free replicated data types) 
or choose a system that is designed to provide it.
Having said that... here are some hacky things you could do in Cassandra to try 
and get this behaviour, which I in no way endorse doing :)    
   - Cassandra counters do leverage a logical clock per shard and you could 
hack something together with counters and lightweight transactions, but you 
would want to do your homework on counters accuracy during before diving into 
it... as I don't know if the implementation is safe in the context of your 
question. Also this would probably require a significant rework of your 
application plus a significant performance hit. I would invite a counter guru 
to jump in here... 
   
   - You can leverage the fact that timestamps are monotonic if you isolate 
writes to a single node for a single shared... but you then loose Cassandra's 
availability guarantees, e.g. a keyspace with an RF of 1 and a CL of > ONE will 
get monotonic timestamps (if generated on the server side). 
   
   - Continuing down the path of isolating writes to a single node for a given 
shard you could also isolate writes to the primary replica using your client 
driver during the leap second (make it a minute either side of the leap), but 
again you lose out on availability and you are probably already experiencing 
out of ordered writes given how close your writes and updates are.   


A note on NTP: NTP is generally fine if you use it to keep the clocks synced 
between the Cassandra nodes. If you are interested in how we have implemented 
NTP at Instaclustr, see our blogpost on it 
https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/.


Ben  

On Thu, 27 Oct 2016 at 10:18 Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Hi Ben,
Thanks for your reply. We dont use timestamps in primary key. We rely on server 
side timestamps generated by coordinator. So, no functions at client side would 
help. 
Yes, drifts can create problems too. But even if you ensure that nodes are 
perfectly synced with NTP, you will surely mess up the order of updates during 
the leap second(interleaving). Some applications update same column of same row 
quickly (within a second ) and reversing the order would corrupt the data.
I am interested in learning how people relying on strict order of updates 
handle leap second scenario when clock goes back one second(same second is 
repeated). What kind of tricks people use  to ensure that server side 
timestamps are monotonic ?
As per my understanding NTP slew mode may not be suitable for Cassandra as it 
may cause unpredictable drift amongst the Cassandra nodes. Ideas ?? 

ThanksAnuj


Sent from Yahoo Mail on Android 
 
  On Thu, 20 Oct, 2016 at 11:25 PM, Ben Bromhead

<b...@instaclustr.com> wrote:   
http://www.datastax.com/dev/blog/preparing-for-the-leap-second gives a pretty 
good overview

If you are using a timestamp as part of your primary key, this is the situation 
where you could end up overwriting data. I would suggest using t

Re: Handle Leap Seconds with Cassandra

2016-10-27 Thread Anuj Wadehra

Hi Ben,
Thanks for your reply. We dont use timestamps in primary key. We rely on server 
side timestamps generated by coordinator. So, no functions at client side would 
help. 
Yes, drifts can create problems too. But even if you ensure that nodes are 
perfectly synced with NTP, you will surely mess up the order of updates during 
the leap second(interleaving). Some applications update same column of same row 
quickly (within a second ) and reversing the order would corrupt the data.
I am interested in learning how people relying on strict order of updates 
handle leap second scenario when clock goes back one second(same second is 
repeated). What kind of tricks people use  to ensure that server side 
timestamps are monotonic ?
As per my understanding NTP slew mode may not be suitable for Cassandra as it 
may cause unpredictable drift amongst the Cassandra nodes. Ideas ?? 

ThanksAnuj


Sent from Yahoo Mail on Android 
 
  On Thu, 20 Oct, 2016 at 11:25 PM, Ben Bromhead<b...@instaclustr.com> wrote:   
http://www.datastax.com/dev/blog/preparing-for-the-leap-second gives a pretty 
good overview

If you are using a timestamp as part of your primary key, this is the situation 
where you could end up overwriting data. I would suggest using timeuuid instead 
which will ensure that you get different primary keys even for data inserted at 
the exact same timestamp.
The blog post also suggests using certain monotonic timestamp classes in Java 
however these will not help you if you have multiple clients that may overwrite 
data.
As for the interleaving or out of order problem, this is hard to address in 
Cassandra without resorting to external coordination or LWTs. If you are 
relying on a wall clock to guarantee order in a distributed system you will get 
yourself into trouble even without leap seconds (clock drift, NTP inaccuracy 
etc).  
On Thu, 20 Oct 2016 at 10:30 Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Hi,
I would like to know how you guys handle leap seconds with Cassandra. 
I am not bothered about the livelock issue as we are using appropriate versions 
of Linux and Java. I am more interested in finding an optimum answer for the 
following question:
How do you handle wrong ordering of multiple writes (on same row and column) 
during the leap second? You may overwrite the new value with old one (disaster).

And Downtime is no option :)
I can see that CASSANDRA-9131 is still open..
FYI..we are on 2.0.14 ..

ThanksAnuj
-- 
Ben BromheadCTO | Instaclustr+1 650 284 9692Managed Cassandra / Spark on AWS, 
Azure and Softlayer

Handle Leap Seconds with Cassandra

2016-10-20 Thread Anuj Wadehra

Hi,
I would like to know how you guys handle leap seconds with Cassandra. 
I am not bothered about the livelock issue as we are using appropriate versions 
of Linux and Java. I am more interested in finding an optimum answer for the 
following question:
How do you handle wrong ordering of multiple writes (on same row and column) 
during the leap second? You may overwrite the new value with old one (disaster).

And Downtime is no option :)
I can see that CASSANDRA-9131 is still open..
FYI..we are on 2.0.14 ..

ThanksAnuj

Re: Cassandra installation best practices

2016-10-17 Thread Anuj Wadehra

Hi Mehdi,
You can refer 
https://docs.datastax.com/en/landing_page/doc/landing_page/recommendedSettings.html
 .
ThanksAnuj
On Mon, 17 Oct, 2016 at 10:20 PM, Mehdi Bada
 wrote:   Hi all, 

It is exist some best practices when installing Cassandra on production 
environment? Some standard to follow? For instance, the file system type etc..

Re: Repair in Multi Datacenter - Should you use -dc Datacenter repair or repair with -pr

2016-10-15 Thread Anuj Wadehra

Hi Leena,

Do you have a firewall between the two DCs? If yes, connection 
reset can be caused by Cassandra trying to use a TCP connection which is 
already closed by the firewall. Please make sure that you set high connection 
timeout at firewall. Also, make sure your servers are not overloaded. Please 
see 
https://developer.ibm.com/answers/questions/231996/why-do-we-get-the-error-connection-reset-by-peer-d.html

for general causes of connection reset. Also, as I told earlier, Cassandra 
troubleshooting explains it well 
https://docs.datastax.com/en/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html
 . Make sure firewall and node tcp settings are in sync such that nodes close a 
tcp connection before firewall does that.

With firewall timeout, we generally see merkle tree request/response failing 
between nodes in two DCs and then repair is hung for ever. Not sure how merkle 
tree creation  which is node specific would get impacted by multi dc setup. Are 
repairs with -local options completing without problems?

Thanks
Anuj

Re: Repair in Multi Datacenter - Should you use -dc Datacenter repair or repair with -pr

2016-10-12 Thread Anuj Wadehra

Hi Leena,

First thing you should be concerned about is : Why the repair -pr operation 
doesnt complete ?
Second comes the question : Which repair option is best?


One probable cause of stuck repairs is : if the firewall between DCs is closing 
TCP connections and Cassandra is trying to use such connections, repairs will 
hang. Please refer 
https://docs.datastax.com/en/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html
 . We faced that.

Also make sure you comply with basic bandwidth requirement between DCs. 
Recommended is 1000 Mb/s (1 gigabit) or greater.

Answers for specific questions:
1.As per my understanding, all replicas will not participate in dc local 
repairs and thus repair would be ineffective. You need to make sure that all 
replicas of a data in all dcs are in sync.

2. Every DC is not a ring. All DCs together form a token ring. So, I think yes 
you should run repair -pr on all nodes.

3. Yes. I dont have experience with incremental repairs. But you can run repair 
-pr on all nodes of all DCs.

Regarding Best approach of repair, you should see some repair presentations of 
Cassandra Summit 2016. All are online now.

I attended the summit and people using large clusters generally use sub range 
repairs to repair their clusters. But such large deployments are on older 
Cassandra versions and these deployments generally dont use vnodes. So people 
know easily which nodes hold which token range.



Thanks
Anuj

Re: Multiple Network Interfaces in non-EC2

2016-10-12 Thread Anuj Wadehra

Hi Amir,

I would like to understand your requirement first. Why do you need multiface 
interface configuration mentioned at 
http://docs.datastax.com/en/cassandra/3.x/cassandra/configuration/configMultiNetworks.html
 with single DC setup?

As per my understanding, you could simply set listen address to private IP and 
dont set broadcast_address and listen_on_broadcast address properties at all. 
You could use your private IP everywhere because you dont have any other DC 
which would connect using public IP.

In multiple DCs, you need public IP for communicating with nodes in other DCs 
and thats where you need private IP for internal communication and public IP 
for across DC communication.

Let me know if using private IP solves your problem.

Also, if you have a specific use case for using multiple interface 
configuration, you could add a NAT rule to route your traffic on public IP to 
your private IP (route traffic on Cassandra port only). This could act as a 
workaround till the JIRA is fixed. Let me know if you see any issues with the 
workaround.


Thanks
Anuj

Re: Open File Handles for Deleted sstables

2016-09-28 Thread Anuj Wadehra

Restarting may be a temporary workaround but cant be a permanent solution. 
After some days, the problem will come back again.
ThanksAnuj



Sent from Yahoo Mail on Android 
 
  On Thu, 29 Sep, 2016 at 12:54 AM, sai krishnam raju 
potturi<pskraj...@gmail.com> wrote:   restarting the cassandra service helped 
get rid of those files in our situation.
thanksSai
On Wed, Sep 28, 2016 at 3:15 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Hi,
We are facing an issue where Cassandra has open file handles for deleted 
sstable files. These open file handles keep on increasing with time and 
eventually lead to disk crisis. This is visible via lsof command. 
There are no Exceptions in logs.We are suspecting a race condition where 
compactions/repairs and reads are done on same sstable. I have gone through few 
JIRAs but somehow not able to coorelate the issue with those tickets. 
We are using 2.0.14. OS is Red Hat Linux.
Any suggestions?

ThanksAnuj

Open File Handles for Deleted sstables

2016-09-28 Thread Anuj Wadehra

Hi,
We are facing an issue where Cassandra has open file handles for deleted 
sstable files. These open file handles keep on increasing with time and 
eventually lead to disk crisis. This is visible via lsof command. 
There are no Exceptions in logs.We are suspecting a race condition where 
compactions/repairs and reads are done on same sstable. I have gone through few 
JIRAs but somehow not able to coorelate the issue with those tickets. 
We are using 2.0.14. OS is Red Hat Linux.
Any suggestions?

ThanksAnuj

Re: Preferred IP is NULL

2016-08-24 Thread Anuj Wadehra

Finally, multiple DC setup is working as expected in 2.0.
So the recipe for configuring multiple interfaces in 2.0 is as follows: use 
GossipingPropertyFileSnitch so that preferred IP (private IP) is used for intra 
DC communication. Morever, you need to have a NAT rule to route traffic on 
public interface to private interface. NAT rule is needed due to CASSANDRA-9748 
(No process listens on broadcast address).

ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Mon, 22 Aug, 2016 at 11:55 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: 
  We are using PropertyFileSnitch.
ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Mon, 22 Aug, 2016 at 7:09 PM, Paulo Motta<pauloricard...@gmail.com> wrote: 
  What snitch are you using? If GPFS you need to enable the prefer_local=true 
flag (this is automatic on EC2MRS).

2016-08-21 22:24 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>:

Hi Paulo,
I am aware of CASSANDRA-9748. It says that Cassandra only listens at 
listen_address and not broadcast_address. To overcome that I can add a NAT rule 
to route all traffiic on public IP to private IP. 
But, why preferred IP is set to null in peers table? Whats the expected value?
Even if I add a NAT rule as workaround for CASSANDRA-9748, what if my public 
interface is down on a node? My traffic would still fail.
I want that at least nodes in my local DC should contact at each other on 
private IP. I thought preferred IP is for that purpose so focussing on fixing 
the null value of preferred IPs.

ThanksAnuj
 
 
 On Sun, 21 Aug, 2016 at 7:10 PM, Paulo Motta<pauloricard...@gmail.com> wrote:  
 See CASSANDRA-9748, I think it might be related.

2016-08-20 15:20 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>:

Hi,
We use multiple interfaces in multi DC setup.Broadcast address is public IP 
while listen address is private IP.
I dont understand why prefeerred IP in peers table is null for all rows.
There is very little documentation on the role of preferred IP and when it is 
set. As per code TCP connections use preferred IP. According to my 
understanding, nodes in local dc should use preferrably private IP to connect 
and that's referred as preferred IP for a node.
Setup: Cassandra 2.0.14
ThanksAnuj

Re: Preferred IP is NULL

2016-08-22 Thread Anuj Wadehra

We are using PropertyFileSnitch.
ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Mon, 22 Aug, 2016 at 7:09 PM, Paulo Motta<pauloricard...@gmail.com> wrote: 
  What snitch are you using? If GPFS you need to enable the prefer_local=true 
flag (this is automatic on EC2MRS).

2016-08-21 22:24 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>:

Hi Paulo,
I am aware of CASSANDRA-9748. It says that Cassandra only listens at 
listen_address and not broadcast_address. To overcome that I can add a NAT rule 
to route all traffiic on public IP to private IP. 
But, why preferred IP is set to null in peers table? Whats the expected value?
Even if I add a NAT rule as workaround for CASSANDRA-9748, what if my public 
interface is down on a node? My traffic would still fail.
I want that at least nodes in my local DC should contact at each other on 
private IP. I thought preferred IP is for that purpose so focussing on fixing 
the null value of preferred IPs.

ThanksAnuj
 
 
 On Sun, 21 Aug, 2016 at 7:10 PM, Paulo Motta<pauloricard...@gmail.com> wrote:  
 See CASSANDRA-9748, I think it might be related.

2016-08-20 15:20 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>:

Hi,
We use multiple interfaces in multi DC setup.Broadcast address is public IP 
while listen address is private IP.
I dont understand why prefeerred IP in peers table is null for all rows.
There is very little documentation on the role of preferred IP and when it is 
set. As per code TCP connections use preferred IP. According to my 
understanding, nodes in local dc should use preferrably private IP to connect 
and that's referred as preferred IP for a node.
Setup: Cassandra 2.0.14
ThanksAnuj

Re: Preferred IP is NULL

2016-08-21 Thread Anuj Wadehra

Hi Paulo,
I am aware of CASSANDRA-9748. It says that Cassandra only listens at 
listen_address and not broadcast_address. To overcome that I can add a NAT rule 
to route all traffiic on public IP to private IP. 
But, why preferred IP is set to null in peers table? Whats the expected value?
Even if I add a NAT rule as workaround for CASSANDRA-9748, what if my public 
interface is down on a node? My traffic would still fail.
I want that at least nodes in my local DC should contact at each other on 
private IP. I thought preferred IP is for that purpose so focussing on fixing 
the null value of preferred IPs.

ThanksAnuj
 
 
  On Sun, 21 Aug, 2016 at 7:10 PM, Paulo Motta<pauloricard...@gmail.com> wrote: 
  See CASSANDRA-9748, I think it might be related.

2016-08-20 15:20 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>:

Hi,
We use multiple interfaces in multi DC setup.Broadcast address is public IP 
while listen address is private IP.
I dont understand why prefeerred IP in peers table is null for all rows.
There is very little documentation on the role of preferred IP and when it is 
set. As per code TCP connections use preferred IP. According to my 
understanding, nodes in local dc should use preferrably private IP to connect 
and that's referred as preferred IP for a node.
Setup: Cassandra 2.0.14
ThanksAnuj

Preferred IP is NULL

2016-08-20 Thread Anuj Wadehra

Hi,
We use multiple interfaces in multi DC setup.Broadcast address is public IP 
while listen address is private IP.
I dont understand why prefeerred IP in peers table is null for all rows.
There is very little documentation on the role of preferred IP and when it is 
set. As per code TCP connections use preferred IP. According to my 
understanding, nodes in local dc should use preferrably private IP to connect 
and that's referred as preferred IP for a node.
Setup: Cassandra 2.0.14
ThanksAnuj

Re: Public Interface Failure in Multiple DC setup

2016-08-11 Thread Anuj Wadehra

Hi 
Can someone take these questions?
ThanksAnuj

 
 
 On Thu, 11 Aug, 2016 at 8:30 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote:  
Hi,

Setup: Cassandra 2.0.14 with PropertyFileSnitch. 2 Data Centers. 
Every node has broadcast address= Public IP (bond0) & listen address=Private IP 
(bond1).

As per DataStax docs,
(https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configMultiNetworks.html),

"For intra-network or region traffic, Cassandra switches to the private IP 
after establishing a connection". 

This means that even for traffic within a DC, Cassandra would contact a node by 
Broadcast Address i.e. Public IP and then switch to Private IP.

Query:

If we shut down bond0 (public interface) on a node:

1. Will read/write requests from coordinators in local DC be routed to the node 
on listen address/private IP/bond1 or will it be treated as DOWN ?

2. Will gossip be able to discover the node. If the node is using its private 
interface (bond1) to send Gossip messages to other nodes on public/broadcast 
address, will other nodes in local and remote DC see the node (with bond0 down) 
as UP?

I am aware that https://issues.apache.org/jira/browse/CASSANDRA-9748 is an open 
issue in 2.0.14. 
But even in later releases, I am interested in the behavior when public 
interface is down and PropertyFileSnitch is used.

Thanks
Anuj

Public Interface Failure in Multiple DC setup

2016-08-11 Thread Anuj Wadehra


Hi,

Setup: Cassandra 2.0.14 with PropertyFileSnitch. 2 Data Centers. 
Every node has broadcast address= Public IP (bond0) & listen address=Private IP 
(bond1).

As per DataStax docs,
(https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configMultiNetworks.html),

"For intra-network or region traffic, Cassandra switches to the private IP 
after establishing a connection". 

This means that even for traffic within a DC, Cassandra would contact a node by 
Broadcast Address i.e. Public IP and then switch to Private IP.

Query:

If we shut down bond0 (public interface) on a node:

1. Will read/write requests from coordinators in local DC be routed to the node 
on listen address/private IP/bond1 or will it be treated as DOWN ?

2. Will gossip be able to discover the node. If the node is using its private 
interface (bond1) to send Gossip messages to other nodes on public/broadcast 
address, will other nodes in local and remote DC see the node (with bond0 down) 
as UP?

I am aware that https://issues.apache.org/jira/browse/CASSANDRA-9748 is an open 
issue in 2.0.14. 
But even in later releases, I am interested in the behavior when public 
interface is down and PropertyFileSnitch is used.

Thanks
Anuj

Re: (C)* stable version after 3.5

2016-07-13 Thread Anuj Wadehra

Hi Alain,
This caught my attention:
"Also I am not sure if the 2.2 major version is something you can skip while 
upgrading through a rolling restart. I believe you can, but it is not what is 
recommended."

Why do you think that skipping 2.2 is not recommended when NEWS.txt suggests 
otherwise? Can you elaborate?

ThanksAnuj

 
 
  On Tue, 12 Jul, 2016 at 7:31 PM, Alain RODRIGUEZ wrote:   
Hi,
The only "fix" release after 3.5 is 3.7. Yet hard to say if it is more stable, 
we can hope so.
For Tic-Toc releases (on 3.X)
Odd numbers are fix releases.
Even numbers are feature releases.
Not sure why you want something above 3.5, but take care, those versions are 
really recent, and less tested so maybe not that "stable". If you want 
something more stable, I believe you can go with 3.0.8.
Yet I am not telling you not to do that, some people need to start testing new 
things right... So if you choose 3.7 because you want some feature from there, 
it is perfectly ok, just move carefully, maybe read some opened tickets and 
previous experiences from the community and test the upgrade process first on a 
dev cluster.
Also I am not sure if the 2.2 major version is something you can skip while 
upgrading through a rolling restart. I believe you can, but it is not what is 
recommended. Testing will let you know anyway.
Good luck and tell us how it went :-).
C*heers,---Alain Rodriguez - alain@thelastpickle.comFrance
The Last Pickle - Apache Cassandra Consultinghttp://www.thelastpickle.com
2016-07-12 11:05 GMT+02:00 Varun Barala :

Hi all users,


Currently we are using cassandra-2.1.13 but we want to upgrade 2.1.13 to 3.x in 
production.

Could anyone please tell me which is the most stable cassandra version after 
3.5.

Thanking You!!


Regards,
Varun Barala

Evict Tombstones with STCS

2016-05-28 Thread Anuj Wadehra

Hi,
We are using C* 2.0.x . What options are available if disk space is too full to 
do compaction on huge sstables formed by STCS (created around long ago but not 
getting compacted due to min_compaction_threshold being 4).
We suspect that huge space will be released when 2 largest sstables get 
compacted together such that tombstone eviction is possible. But there is not 
enough space for compacting them together assuming that compaction would need 
at least free disk=size of sstable1 + size of sstable 2 ??
I read STCS code and if no sstables are available for compactions, it should 
pick individual sstable for compaction. But somehow, huge sstables are not 
participating in individual compaction.. is it due to default 20% tombstone 
threshold?? And if it so, forceUserdefinedcompaction or setting 
unchecked_tombstone_compactions to true wont help either as tombstones are less 
than 20% and not much disk would be recovered.
It is not possible to add additional disks too.
We see huge difference in disk utilization of different nodes. May be some 
nodes were able to get away with tombstones while others didnt manage to evict 
tombstones.

Would be good to know more alternatives from community.

ThanksAnuj






Sent from Yahoo Mail on Android

Re: Production Ready/Stable DataStax Java Driver

2016-05-08 Thread Anuj Wadehra

Thanks Alex !!
We are starting to use CQL for the first time (using Thrift till now), so I 
think it makes sense to directly use Java driver 3.0.1 instead of 2.1.10.

As 3.x driver supports all 1.2+ Cassandra versions, I would also like to better 
understand the motivation of having 2.1 releases simultaneously with 3.x 
releases of Java driver.
One obvious reason should be the "Breaking changes" in 3.x. So, 2.1.x bug fix 
releases give some breathing time to existing 2.1 users for getting ready for 
accomodating those breaking changes in their code instead of forcing them to do 
those changes at short notice and upgrade to 3.x immediately. Is that 
understanding correct?



ThanksAnuj
Sent from Yahoo Mail on Android 
 
  On Sun, 8 May, 2016 at 9:01 PM, Alex Popescu<al...@datastax.com> wrote:   Hi 
Anuj,
All released versions of the DataStax Java driver are production ready:
1. they all go through the complete QA cycle2. we merge all bug fixes and 
improvements upstream.
Now, if you are asking which is currently the most deployed version, that's 2.1 
(latest version 2.1.10.1 [1]).
If you want to be ready for future Cassandra upgrades and benefit of the latest 
features of the Java driver, thenthat's the 3.0 branch (latest version 3.0.1 
[2]).
Last but not least, you should also consider when making the decision that our 
current focus and main development goes into the 3.x branch and that the 2.1 is 
in maintenance mode (meaning that no new features will be added and itwill only 
see critical bug fixes). 
Bottom line, if your application is not already developed against the 2.1 
version of the Java driver, you should use the latest 3.0 release. 

[1]: 
https://groups.google.com/a/lists.datastax.com/d/msg/java-driver-user/bYQSUvKQm5k/JduPTt7cGAAJ
[2]: 
https://groups.google.com/a/lists.datastax.com/d/msg/java-driver-user/tOWZm4RVbm4/5E_aDAc8IAAJ

On Sun, May 8, 2016 at 7:39 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Hi,
Which DataStax Java Driver release is most stable (production ready) for 
Cassandra 2.1?
ThanksAnuj






-- 
Bests,
Alex Popescu | @al3xandruSen. Product Manager @ DataStax



» DataStax Enterprise - the database for cloud applications. «

RE: Inconsistent Reads after Restoring Snapshot

2016-04-28 Thread Anuj Wadehra

Sean,
I meant commit log archival was never part of "restoring snapshot" DataStax 
documentation. How commitlog archival is related to my concern? Please 
elaborate.
ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Thu, 28 Apr, 2016 at 9:24 PM, 
sean_r_dur...@homedepot.com<sean_r_dur...@homedepot.com> wrote:   
https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configLogArchive_t.html
 
  
 
Sean Durity
 
  
 
From: Anuj Wadehra [mailto:anujw_2...@yahoo.co.in]
Sent: Wednesday, April 27, 2016 10:44 PM
To: user@cassandra.apache.org
Subject: RE: Inconsistent Reads after Restoring Snapshot
 
  
 
No.We are not saving them.I have never read that in DataStax documentation.
 
  
 
Thanks
 
Anuj
 
Sent from Yahoo Mail on Android
 
  
 

On Thu, 28 Apr, 2016 at 12:45 AM, sean_r_dur...@homedepot.com
 
<sean_r_dur...@homedepot.com> wrote:
 
What about the commitlogs? Are you saving those off anywhere in between the 
snapshot and the crash?
 
 
 
 
 
Sean Durity
 
 
 
From: Anuj Wadehra [mailto:anujw_2...@yahoo.co.in]
Sent: Monday, April 25, 2016 10:26 PM
To: User
Subject: Inconsistent Reads after Restoring Snapshot
 
 
 
Hi,
 
 
 
We have 2.0.14. We use RF=3 and read/write at Quorum. Moreover, we dont use 
incremental backups. As per the documentation at 
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html
 , if i need to restore a Snapshot on SINGLE node in a cluster, I would run 
repair at the end. But while the repair is going on, reads may get inconsistent.
 
 
 
 
 
Consider following scenario:
 
10 AM Daily Snapshot taken of node A and moved to backup location
 
11 AM A record is inserted such that node A and B insert the record but there 
is a mutation drop on node C.
 
1 PM Node A crashes and data is restored from latest 10 AM snapshot. Now, only 
Node B has the record.
 
 
 
Now, my question is:
 
 
 
Till the repair is completed on node A,a read at Quorum may return inconsistent 
result based on the nodes from which data is read.If data is read from node A 
and node C, nothing is returned and if data is read from node A and node B, 
record is returned. This is a vital point which is not highlighted anywhere.
 
 
 
 
 
Please confirm my understanding.If my understanding is right, how to make sure 
that my reads are not inconsistent while a node is being repair after restoring 
a snapshot.
 
 
 
I think, autobootstrapping the node without joining the ring till the repair is 
completed, is an alternative option. But snapshots save lot of streaming as 
compared to bootstrap.
 
 
 
Will incremental backups guarantee that 
 
 
 
Thanks
 
Anuj
 
 
 
 
 
Sent from Yahoo Mail on Android
 
  
 

The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.
 
  
 


The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

RE: Inconsistent Reads after Restoring Snapshot

2016-04-27 Thread Anuj Wadehra

No.We are not saving them.I have never read that in DataStax documentation.
ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Thu, 28 Apr, 2016 at 12:45 AM, 
sean_r_dur...@homedepot.com<sean_r_dur...@homedepot.com> wrote:   
What about the commitlogs? Are you saving those off anywhere in between the 
snapshot and the crash?
 
  
 
  
 
Sean Durity
 
  
 
From: Anuj Wadehra [mailto:anujw_2...@yahoo.co.in]
Sent: Monday, April 25, 2016 10:26 PM
To: User
Subject: Inconsistent Reads after Restoring Snapshot
 
  
 
Hi,
 
  
 
We have 2.0.14. We use RF=3 and read/write at Quorum. Moreover, we dont use 
incremental backups. As per the documentation at 
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html
 , if i need to restore a Snapshot on SINGLE node in a cluster, I would run 
repair at the end. But while the repair is going on, reads may get inconsistent.
 
  
 
  
 
Consider following scenario:
 
10 AM Daily Snapshot taken of node A and moved to backup location
 
11 AM A record is inserted such that node A and B insert the record but there 
is a mutation drop on node C.
 
1 PM Node A crashes and data is restored from latest 10 AM snapshot. Now, only 
Node B has the record.
 
  
 
Now, my question is:
 
  
 
Till the repair is completed on node A,a read at Quorum may return inconsistent 
result based on the nodes from which data is read.If data is read from node A 
and node C, nothing is returned and if data is read from node A and node B, 
record is returned. This is a vital point which is not highlighted anywhere.
 
  
 
  
 
Please confirm my understanding.If my understanding is right, how to make sure 
that my reads are not inconsistent while a node is being repair after restoring 
a snapshot.
 
  
 
I think, autobootstrapping the node without joining the ring till the repair is 
completed, is an alternative option. But snapshots save lot of streaming as 
compared to bootstrap.
 
  
 
Will incremental backups guarantee that 
 
  
 
Thanks
 
Anuj
 
  
 
  
 

Sent from Yahoo Mail on Android 

The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

Re: Inconsistent Reads after Restoring Snapshot

2016-04-26 Thread Anuj Wadehra

Thanks Romain !! So just to clarify, you are suggesting following steps:

10 AM Daily Snapshot taken of node A and moved to backup location
11 AM A record is inserted such that node A and B insert the record but there
is a mutation drop on node C.
1 PM Node A crashes
1 PM Follow following steps to restore the 10 AM snapshot on node A:
1. Restore the data as mentioned in
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html

with ONE EXCEPTION >> start node A with -Dcassandra.join_ring=false
.
2. Run repair
3. Retstart the node A with -Dcassandra.join_ring=true

Please confirm.

I was not aware that join_ring can also be used using a normal reboot. I
thought it was only an option during autobootstrap :)

Thanks
Anuj

On Tue, 26/4/16, Romain Hardouin <romainh...@yahoo.fr> wrote:

Subject: Re: Inconsistent Reads after Restoring Snapshot
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Tuesday, 26 April, 2016, 12:47 PM

You can make a restore on the new node A (don't
forget to set the token(s) in cassandra.yaml), start the
node with -Dcassandra.join_ring=false and then run a repair
on it. Have a look at https://issues.apache.org/jira/browse/CASSANDRA-6961
Best,
Romain

Le Mardi 26 avril
2016 4h26, Anuj Wadehra <anujw_2...@yahoo.co.in> a
écrit :

Hi,
We
have 2.0.14. We use RF=3 and read/write at Quorum. Moreover,
we dont use incremental backups. As per the documentation
at ,
if i need to restore a Snapshot on SINGLE node in a cluster,
I would run repair at the end. But while the repair is going
on, reads may get inconsistent.

Consider
following scenario:10
AM Daily Snapshot taken of node A and moved to backup
location11
AM A record is inserted such that node A and B insert the
record but there is a mutation drop on node C.1
PM Node A crashes and data is restored from latest 10 AM
snapshot. Now, only Node B has the record.
Now,
my question is:
Till
the repair is completed on node A,a read at Quorum may
return inconsistent result based on the nodes from which
data is read.If data is read from node A and node C, nothing
is returned and if data is read from node A and node B,
record is returned. This is a vital point which is not
highlighted anywhere.

Please
confirm my understanding.If my understanding is right, how
to make sure that my reads are not inconsistent while a node
is being repair after restoring a snapshot.
I
think, autobootstrapping the node without joining the ring
till the repair is completed, is an alternative option. But
snapshots save lot of streaming as compared to
bootstrap.
Will
incremental backups guarantee that
ThanksAnuj

Sent
from Yahoo Mail on Android

Re: Upgrading to SSD

2016-04-26 Thread Anuj Wadehra

Thanks All !!

Anuj

On Mon, 25/4/16, Alain RODRIGUEZ <arodr...@gmail.com> wrote:

 Subject: Re: Upgrading to SSD
 To: user@cassandra.apache.org
 Date: Monday, 25 April, 2016, 2:45 PM

 Hi
 Anuj, You could do the following instead
 to minimize server downtime:
 1. rsync while the server is
 running2. rsync again
 to get any new files3.
 shut server down4.
 rsync for the 3rd time 5. change directory in yaml and
 start back up
 +1
 Here
 are some more details about that process and a script doing
 most of the job: 
thelastpickle.com/blog/2016/02/25/removing-a-disk-mapping-from-cassandra.html
 Hope it will be useful to
 you
 C*heers,---Alain
 Rodriguez - alain@thelastpickle.comFrance
 The Last Pickle - Apache Cassandra
 Consultinghttp://www.thelastpickle.com
 2016-04-23 21:47 GMT+02:00
 Jonathan Haddad <j...@jonhaddad.com>:
 You could do the
 following instead to minimize server downtime:
 1. rsync while the server is
 running2. rsync again to get any new
 files3. shut server down4. rsync for
 the 3rd time5. change directory in yaml and start
 back up

 On Sat, Apr 23, 2016 at 12:23 PM Clint Martin
 <clintlmar...@coolfiretechnologies.com>
 wrote:
 As long as you shut down the node before you start
 copying and moving stuff around it shouldn't matter if
 you take backups or snapshots or whatever. 
 When you add the filesystem for the ssd will
 you be removing the existing filesystem? Or will you be able
 to keep both filesystems mounted at the same time for the
 migration?  If you can keep them both at the same time then
 an of system backup isn't strictly necessary.  Just
 change your data dir config in your yaml. Copy the data and
 commit from the old dir to the new ssd and restart the node.

 If you can't keep both filesystems mounted
 concurrently then a remote system is necessary to copy the
 data to. But the steps and procedure are the same.
 Running repair before you do the migration
 isn't strictly necessary. Not a bad idea if you
 don't mind spending the time. Definitely run repair
 after you restart the node, especially if you take longer
 than the hint interval to perform the work.
 As for your filesystems, there is really
 nothing special to worry about. Depending on which
 filesystem you use there are recommendations for tuning and
 configuration that you should probably follow. 
 (Datastax's recommendations as well as AL tobey's
 tuning guide are great resources. 
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
 )

 Clint
 On Apr 23, 2016 3:05
 PM, "Anuj Wadehra" <anujw_2...@yahoo.co.in>
 wrote:
 Hi
 We have a 3 node cluster of 2.0.14.
 We use Read/Write Qorum and RF is 3.  We want to move data
 and commitlog directory from a SATA HDD to SSD. We have
 planned to do a rolling upgrade.
 We plan to run repair -pr on all
 nodes  to sync data upfront and then execute following
 steps on each server one by one:
 1. Take backup of data/commitlog
 directory to external server.2. Change mount
 points so that Cassandra data/commitlog directory now points
 to SSD.3. Copy files from external backup to
 SSD.4. Start Cassandra.5. Run full
 repair on the node before starting step 1 on next
 node.
 Questions:1. Is copying
 commitlog and data directory good enough or we should go for
 taking snapshot of each node and restoring data from that
 snapshot?
 2. Any
 precautions we need to take while moving data to new SSD?
 File system format of two disks etc.
 3. Should we drain data before
 taking backup? We are also restoring commitlog directory
 from backup.
 4. I have
 added repair to sync full data upfront and a repair after
 restoring data on each node. Sounds safe and
 logical?
 5. Any
 problems you see with mentioned approach? Any better
 approach?
 ThanksAnuj

 Sent
 from Yahoo Mail on
 Android

Inconsistent Reads after Restoring Snapshot

2016-04-25 Thread Anuj Wadehra

Hi,
We have 2.0.14. We use RF=3 and read/write at Quorum. Moreover, we dont use 
incremental backups. As per the documentation at 
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html
 , if i need to restore a Snapshot on SINGLE node in a cluster, I would run 
repair at the end. But while the repair is going on, reads may get inconsistent.

Consider following scenario:10 AM Daily Snapshot taken of node A and moved to 
backup location11 AM A record is inserted such that node A and B insert the 
record but there is a mutation drop on node C.1 PM Node A crashes and data is 
restored from latest 10 AM snapshot. Now, only Node B has the record.
Now, my question is:
Till the repair is completed on node A,a read at Quorum may return inconsistent 
result based on the nodes from which data is read.If data is read from node A 
and node C, nothing is returned and if data is read from node A and node B, 
record is returned. This is a vital point which is not highlighted anywhere.

Please confirm my understanding.If my understanding is right, how to make sure 
that my reads are not inconsistent while a node is being repair after restoring 
a snapshot.
I think, autobootstrapping the node without joining the ring till the repair is 
completed, is an alternative option. But snapshots save lot of streaming as 
compared to bootstrap.
Will incremental backups guarantee that 
ThanksAnuj

Sent from Yahoo Mail on Android

Re: Unable to reliably count keys on a thrift CF

2016-04-25 Thread Anuj Wadehra

Hi Carlos,
Please check if the JIRA : 
https://issues.apache.org/jira/browse/CASSANDRA-11467 fixes your problem.
We had been facing row count issue with thrift cf / compact storage and this 
fixed it.
Above is fixed in latest 2.1.14. Its a two line fix. So, you can also prepare a 
custom jar and check if that works.
ThanksAnuj
Sent from Yahoo Mail on Android 
 
  On Thu, 21 Apr, 2016 at 9:29 PM, Carlos Alonso wrote:   
Hi guys.
I've been struggling for the last days to find a reliable and stable way to 
count keys in a thrift column family.
My idea is to basically iterate the whole ring using the token function, as 
documented here: 
https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html in batches of 
1 records
The only corner case is that if there were more than 1 records in a single 
partition (not the case, but the program should still handle it) it explores 
the partition in depth by getting all records for that particular token (see 
below). In the end, all keys are saved into a hash to guarantee uniqueness. The 
count of unique keys is always different (and random, sometimes more keys, 
sometimes less are retrieved) and, of course, I'm sure no activity is going on 
in that cf.
I'm running Cassandra 2.1.11 with MurMur3 partitioner. RF=3 and CL=QUORUM
the column family structure is
CREATE TABLE tbl (    key blob,    column1 ascii,    value blob,    PRIMARY 
KEY(key, column1))
and I'm running the following script
connection = open_cql_connectionresults = connection.execute("SELECT 
token(key), key FROM tbl LIMIT 1")
keys_hash = {} // Hash to save the keys to guarantee uniquenesslast_token = 
niltoken = nil
while results != nil  results.each do |row|    keys_hash[row['key']] = true    
token = row['token(key)']  end  if token == last_token    results = 
connection.execute("SELECT token(key), key FROM tbl WHERE token(key) = 
#{token}")  else    results = connection.execute("SELECT token(key), key FROM 
tbl WHERE token(key) >= #{token} LIMIT 1")  end  last_token = tokenend

puts keys.keys.count
What am I missing?
Thanks!
Carlos Alonso | Software Engineer | @calonso

Re: StatusLogger is logging too many information

2016-04-25 Thread Anuj Wadehra

Hi,
You can set the property gc_warn_threshold_in_ms in yaml.For example, if your 
application is ok with a 2000ms pause, you can set the value to 2000 such that 
only gc pauses greater than 2000ms will lead to gc and status log.
Please refer 
https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-8907 for 
details.

ThanksAnuj
Sent from Yahoo Mail on Android 
 
  On Mon, 25 Apr, 2016 at 3:20 PM, jason zhao yang 
wrote:   Hi,
Currently StatusLogger will log info when there are dropped messages or GC more 
than 200 ms.
In my use case, there are about 1000 tables.  The status-logger is logging too 
many information for each tables. 
I wonder is there a way to reduce this log? for example, only print the thread 
pool information.
Thanks.

Upgrading to SSD

2016-04-23 Thread Anuj Wadehra

Hi
We have a 3 node cluster of 2.0.14. We use Read/Write Qorum and RF is 3.  We 
want to move data and commitlog directory from a SATA HDD to SSD. We have 
planned to do a rolling upgrade.
We plan to run repair -pr on all nodes  to sync data upfront and then execute 
following steps on each server one by one:
1. Take backup of data/commitlog directory to external server.2. Change mount 
points so that Cassandra data/commitlog directory now points to SSD.3. Copy 
files from external backup to SSD.4. Start Cassandra.5. Run full repair on the 
node before starting step 1 on next node.
Questions:1. Is copying commitlog and data directory good enough or we should 
go for taking snapshot of each node and restoring data from that snapshot?
2. Any precautions we need to take while moving data to new SSD? File system 
format of two disks etc.
3. Should we drain data before taking backup? We are also restoring commitlog 
directory from backup.
4. I have added repair to sync full data upfront and a repair after restoring 
data on each node. Sounds safe and logical?
5. Any problems you see with mentioned approach? Any better approach?
ThanksAnuj

Sent from Yahoo Mail on Android

Re: Efficient Paging Option in Wide Rows

2016-04-23 Thread Anuj Wadehra

Hi,
Can anyone take this question?
ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Sat, 23 Apr, 2016 at 2:30 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote:  
 I think I complicated the question..so I am trying to put the question 
crisply..
We have a table defined with clustering key/column. We have  5 different 
clustering key values. 
If we want to fetch all 5 rowd,Which query option would be faster and why?
1. Given a single primary key/partition key with 5 clustering keys..we will 
page through the single partition 500 records at a time.Thus, we will do 
5/500=100 db hits but for same partition key.
2. Given 100 different primary keys with each primary key having just 500 
clustering key columns. Here also we will need 100 db hits but for different 
partitions.

Basically I want to understand any optimizations built into CQL/Cassandra which 
make paging through a single partition more efficient than querying data from 
different partitions.

ThanksAnuj
Sent from Yahoo Mail on Android 
 
  On Fri, 22 Apr, 2016 at 8:27 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote:  
 Hi,
I have a wide row index table so that I can fetch all row keys corresponding to 
a column value. 
Row of index_table will look like:
ColValue1:bucket1 >> rowkey1, rowkey2.. rowkeyn..ColValue1:bucketn>> 
rowkey1, rowkey2.. rowkeyn
We will have buckets to avoid hotspots. Row keys of main table are random 
numbers and we will never do column slice like:

Select * from index_table where key=xxx and Col > rowkey1 and col < rowkey10
Also, we will ALWAYS fetch all data for a given value of index column. Thus all 
buckets havr to be read.
Each index column value can map to thousands-millions of row keys in main table.
Based on our use case, there are two design choices in front of me:
1. Have large number of buckets/rows for an index column value and have lesser 
data ( around few thousands) in each row.
Thus, every time we want to fetch all row keys for an index col value, we will 
query more rows and for each row we will have to page through data 500 records 
at a time.
2. Have fewer buckets/rows for an index column value.
Every time we want to fetch all row keys for an index col value, we will query 
data less numner of wider rows and then page through each wide row reading 500 
columns at a time.

Which approach is more efficient?
 Approach1: More number of rows with less data in each row.

OR
Approach 2: less number of  rows with more data in each row

Either ways,  we are fetching only 500 records at a time in a query. Even in 
approach 2 (wider rows) , we can query only small data of 500 at a time.

ThanksAnuj

Re: Efficient Paging Option in Wide Rows

2016-04-23 Thread Anuj Wadehra

I think I complicated the question..so I am trying to put the question crisply..
We have a table defined with clustering key/column. We have  5 different 
clustering key values. 
If we want to fetch all 5 rowd,Which query option would be faster and why?
1. Given a single primary key/partition key with 5 clustering keys..we will 
page through the single partition 500 records at a time.Thus, we will do 
5/500=100 db hits but for same partition key.
2. Given 100 different primary keys with each primary key having just 500 
clustering key columns. Here also we will need 100 db hits but for different 
partitions.

Basically I want to understand any optimizations built into CQL/Cassandra which 
make paging through a single partition more efficient than querying data from 
different partitions.

ThanksAnuj
Sent from Yahoo Mail on Android 
 
  On Fri, 22 Apr, 2016 at 8:27 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote:  
 Hi,
I have a wide row index table so that I can fetch all row keys corresponding to 
a column value. 
Row of index_table will look like:
ColValue1:bucket1 >> rowkey1, rowkey2.. rowkeyn..ColValue1:bucketn>> 
rowkey1, rowkey2.. rowkeyn
We will have buckets to avoid hotspots. Row keys of main table are random 
numbers and we will never do column slice like:

Select * from index_table where key=xxx and Col > rowkey1 and col < rowkey10
Also, we will ALWAYS fetch all data for a given value of index column. Thus all 
buckets havr to be read.
Each index column value can map to thousands-millions of row keys in main table.
Based on our use case, there are two design choices in front of me:
1. Have large number of buckets/rows for an index column value and have lesser 
data ( around few thousands) in each row.
Thus, every time we want to fetch all row keys for an index col value, we will 
query more rows and for each row we will have to page through data 500 records 
at a time.
2. Have fewer buckets/rows for an index column value.
Every time we want to fetch all row keys for an index col value, we will query 
data less numner of wider rows and then page through each wide row reading 500 
columns at a time.

Which approach is more efficient?
 Approach1: More number of rows with less data in each row.

OR
Approach 2: less number of  rows with more data in each row

Either ways,  we are fetching only 500 records at a time in a query. Even in 
approach 2 (wider rows) , we can query only small data of 500 at a time.

ThanksAnuj

Efficient Paging Option in Wide Rows

2016-04-22 Thread Anuj Wadehra

Hi,
I have a wide row index table so that I can fetch all row keys corresponding to 
a column value. 
Row of index_table will look like:
ColValue1:bucket1 >> rowkey1, rowkey2.. rowkeyn..ColValue1:bucketn>> 
rowkey1, rowkey2.. rowkeyn
We will have buckets to avoid hotspots. Row keys of main table are random 
numbers and we will never do column slice like:

Select * from index_table where key=xxx and Col > rowkey1 and col < rowkey10
Also, we will ALWAYS fetch all data for a given value of index column. Thus all 
buckets havr to be read.
Each index column value can map to thousands-millions of row keys in main table.
Based on our use case, there are two design choices in front of me:
1. Have large number of buckets/rows for an index column value and have lesser 
data ( around few thousands) in each row.
Thus, every time we want to fetch all row keys for an index col value, we will 
query more rows and for each row we will have to page through data 500 records 
at a time.
2. Have fewer buckets/rows for an index column value.
Every time we want to fetch all row keys for an index col value, we will query 
data less numner of wider rows and then page through each wide row reading 500 
columns at a time.

Which approach is more efficient?
 Approach1: More number of rows with less data in each row.

OR
Approach 2: less number of  rows with more data in each row

Either ways,  we are fetching only 500 records at a time in a query. Even in 
approach 2 (wider rows) , we can query only small data of 500 at a time.

ThanksAnuj

Re: Migrating to CQL and Non Compact Storage

2016-04-11 Thread Anuj Wadehra

Thanks Jim. I think you understand the pain of migrating TBs of data to new
tables. There is no command to change from compact to non compact storage and
the fastest solution to migrate data using Spark is too slow for production
systems.
And the pain gets bigger when your performance dips after moving to non compact
storage table. Thats because non compact storage is quite inefficient storage
format till 3.x and its incurs heavy penalty on Row Scan performance in
Analytics workload.Please go throught the link to understand how old Compact
storage gives much better performance than non compact storage as far as Row
Scans are concerned:
https://www.oreilly.com/ideas/apache-cassandra-for-analytics-a-performance-and-storage-analysis
The flexibility of Cql comes at heavy cost until 3.x.

ThanksAnujSent from Yahoo Mail on Android

On Mon, 11 Apr, 2016 at 10:35 PM, Jim Ancona<j...@anconafamily.com> wrote:
Jack, the Datastax link he posted
(http://www.datastax.com/dev/blog/thrift-to-cql3) says that for column families
with mixed dynamic and static columns: "The only solution to be able to access
the column family fully is to remove the declared columns from the thrift
schema altogether..." I think that page describes the problem and the potential
solutions well. I haven't seen an answer to Anuj's question about why the
native CQL solution using collections doesn't perform as well.
Keep in mind that some of us understand CQL just fine but have working pre-CQL
Thrift-based systems storing hundreds of terabytes of data and with
requirements that mean that saying "bite the bullet and re-model your data" is
not really helpful. Another quote from that Datastax link: "Thrift isn't going
anywhere." Granted that that link is three-plus years old, but Thrift now *is*
now going away, so it's not unexpected that people will be trying to figure out
how to deal with that. It's bad enough that we need to rewrite our clients to
use CQL instead of Thrift. It's not helpful to say that we should also re-model
and migrate all our data.
Jim
On Mon, Apr 11, 2016 at 11:29 AM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

Sorry, but your message is too confusing - you say "reading dynamic columns in
CQL" and "make the table schema less", but neither has any relevance to CQL! 1.
CQL tables always have schemas. 2. All columns in CQL are statically declared
(even maps/collections are statically declared columns.) Granted, it is a
challenge for Thrift users to get used to the terminology of CQL, but it is
required. If necessary, review some of the free online training videos for data
modeling.
Unless your data model is very simply and does directly translate into CQL, you
probably do need to bite the bullet and re-model your data to exploit the
features of CQL rather than fight CQL trying to mimic Thrift per se.
In any case, take another shot at framing the problem and then maybe people
here can help you out.
-- Jack Krupansky
On Mon, Apr 11, 2016 at 10:39 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Any comments or suggestions on this one?
ThanksAnuj

Sent from Yahoo Mail on Android

On Sun, 10 Apr, 2016 at 11:39 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote:
Hi
We are on 2.0.14 and Thrift. We are planning to migrate to CQL soon but facing
some challenges.
We have a cf with a mix of statically defined columns and dynamic columns
(created at run time). For reading dynamic columns in CQL, we have two options:
1. Drop all columns and make the table schema less. This way, we will get a Cql
row for each column defined for a row key--As mentioned here:
http://www.datastax.com/dev/blog/thrift-to-cql3
2.Migrate entire data to a new non compact storage table and create collections
for dynamic columns in new table.
In our case, we have observed that approach 2 causes 3 times slower performance
in Range scan queries used by Spark. This is not acceptable. Cassandra 3 has
optimized storage engine but we are not comfortable moving to 3.x in production.
Moreover, data migration to new table using Spark takes hours.

Any suggestions for the two issues?

ThanksAnuj

Sent from Yahoo Mail on Android

Re: Migrating to CQL and Non Compact Storage

2016-04-11 Thread Anuj Wadehra

Any comments or suggestions on this one? 
ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Sun, 10 Apr, 2016 at 11:39 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: 
  Hi
We are on 2.0.14 and Thrift. We are planning to migrate to CQL soon but facing 
some challenges.
We have a cf with a mix of statically defined columns and dynamic columns 
(created at run time). For reading dynamic columns in CQL, we have two options:
1. Drop all columns and make the table schema less. This way, we will get a Cql 
row for each column defined for a row key--As mentioned here: 
http://www.datastax.com/dev/blog/thrift-to-cql3
2.Migrate entire data to a new non compact storage table and create collections 
for dynamic columns in new table.
In our case, we have observed that approach 2 causes 3 times slower performance 
in Range scan queries used by Spark. This is not acceptable. Cassandra 3 has 
optimized storage engine but we are not comfortable moving to 3.x in production.
Moreover, data migration to new table using Spark takes hours. 

Any suggestions for the two issues?

ThanksAnuj

Sent from Yahoo Mail on Android

Migrating to CQL and Non Compact Storage

2016-04-10 Thread Anuj Wadehra

Hi
We are on 2.0.14 and Thrift. We are planning to migrate to CQL soon but facing 
some challenges.
We have a cf with a mix of statically defined columns and dynamic columns 
(created at run time). For reading dynamic columns in CQL, we have two options:
1. Drop all columns and make the table schema less. This way, we will get a Cql 
row for each column defined for a row key--As mentioned here: 
http://www.datastax.com/dev/blog/thrift-to-cql3
2.Migrate entire data to a new non compact storage table and create collections 
for dynamic columns in new table.
In our case, we have observed that approach 2 causes 3 times slower performance 
in Range scan queries used by Spark. This is not acceptable. Cassandra 3 has 
optimized storage engine but we are not comfortable moving to 3.x in production.
Moreover, data migration to new table using Spark takes hours. 

Any suggestions for the two issues?

ThanksAnuj

Sent from Yahoo Mail on Android

Re: DataStax OpsCenter with Apache Cassandra

2016-04-10 Thread Anuj Wadehra

Thanks Jeff. If one needs to use OpsCenter with 2.2 or earlier versions of 
Apache Cassandra, is he required to buy license for it separately? What are the 
options if someone wants to use OpsCenter with Apache Distributed 3.x 
(commercial use)?

ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Sun, 10 Apr, 2016 at 10:42 PM, Jeff Jirsa<jeff.ji...@crowdstrike.com> 
wrote:   It is possible to use OpsCenter for open source / community versions 
up to 2.2.x. It will not be possible in 3.0+


From:  Anuj Wadehra
Reply-To:  "user@cassandra.apache.org"
Date:  Sunday, April 10, 2016 at 9:28 AM
To:  User
Subject:  DataStax OpsCenter with Apache Cassandra

DataStax OpsCenter with Apache Cassandra

2016-04-10 Thread Anuj Wadehra

Hi,
Is it possible to use DataStax OpsCenter for monitoring Apache distributed 
Cassandra in Production? 
OR
 Is it possible to use DataStax OpsCenter if you are not using DataStax 
Enterprise in production?

ThanksAnuj

Re: Does saveToCassandra work with Cassandra Lucene plugin ?

2016-03-28 Thread Anuj Wadehra

I used it with Java and there, every field of Pojo must map to column names of 
the table. I think someone with Scala syntax knowledge can help you better.

ThanksAnuj
Sent from Yahoo Mail on Android 
 
  On Mon, 28 Mar, 2016 at 11:47 pm, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: 
  With my limited experience with Spark, I can tell you that you need to make 
sure that all columns mentioned in somecolumns must be part of CQL schema of 
table.

ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Mon, 28 Mar, 2016 at 11:38 pm, Cleosson José Pirani de 
Souza<cso...@daitangroup.com> wrote:   



Hello,




 
I am implementing the example on the github 
(https://github.com/Stratio/cassandra-lucene-index) and when I try to save the 
data using saveToCassandra I get the exception NoSuchElementException. If I use 
CassandraConnector.withSessionDo I am able to add elements into Cassandra and 
no exception is raised.

 The code :import org.apache.spark.{SparkConf, SparkContext, Logging}import 
com.datastax.spark.connector.cql.CassandraConnectorimport 
com.datastax.spark.connector._
object App extends Logging{    def main(args: Array[String]) {
        // Get the cassandra IP and create the spark context        val 
cassandraIP = System.getenv("CASSANDRA_IP");        val sparkConf = new 
SparkConf(true)                        .set("spark.cassandra.connection.host", 
cassandraIP)                        .set("spark.cleaner.ttl", "3600")           
             .setAppName("Simple Spark Cassandra Example")





        val sc = new SparkContext(sparkConf)
        // Works        CassandraConnector(sparkConf).withSessionDo { session 
=>           session.execute("INSERT INTO demo.tweets(id, user, body, time, 
latitude, longitude) VALUES (19, 'Name', 'Body', '2016-03-19 09:00:00-0300', 
39, 39)")        }
        // Does not work        val demo = sc.parallelize(Seq((9, "Name", 
"Body", "2016-03-29 19:00:00-0300", 29, 29)))        // Raises the exception    
    demo.saveToCassandra("demo", "tweets", SomeColumns("id", "user", "body", 
"time", "latitude", "longitude"))
    }
}











 The exception:16/03/28 14:15:41 INFO CassandraConnector: Connected to 
Cassandra cluster: Test ClusterException in thread "main" 
java.util.NoSuchElementException: Column  not found in demo.tweetsat 
com.datastax.spark.connector.cql.StructDef$$anonfun$columnByName$2.apply(Schema.scala:60)at
 
com.datastax.spark.connector.cql.StructDef$$anonfun$columnByName$2.apply(Schema.scala:60)at
 scala.collection.Map$WithDefault.default(Map.scala:52)at 
scala.collection.MapLike$class.apply(MapLike.scala:141)at 
scala.collection.AbstractMap.apply(Map.scala:58)at 
com.datastax.spark.connector.cql.TableDef$$anonfun$9.apply(Schema.scala:153)at 
com.datastax.spark.connector.cql.TableDef$$anonfun$9.apply(Schema.scala:152)at 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)at
 scala.collection.immutable.Map$Map1.foreach(Map.scala:109)at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)at 
com.datastax.spark.connector.cql.TableDef.(Schema.scala:152)at 
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:283)at
 
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:271)at
 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)at
 scala.collection.immutable.Set$Set4.foreach(Set.scala:137)at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)at 
com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchTables$1(Schema.scala:271)at
 
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:295)at
 
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:294)at
 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)at
 scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)at 
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)at 
com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1(Schema.scala:294)at
 
com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:307)at
 
com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:304)at
 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:121)at
 
com.datastax.spark.connector.cql.

Re: Does saveToCassandra work with Cassandra Lucene plugin ?

2016-03-28 Thread Anuj Wadehra

With my limited experience with Spark, I can tell you that you need to make 
sure that all columns mentioned in somecolumns must be part of CQL schema of 
table.

ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Mon, 28 Mar, 2016 at 11:38 pm, Cleosson José Pirani de 
Souza wrote:   



Hello,




 
I am implementing the example on the github 
(https://github.com/Stratio/cassandra-lucene-index) and when I try to save the 
data using saveToCassandra I get the exception NoSuchElementException. If I use 
CassandraConnector.withSessionDo I am able to add elements into Cassandra and 
no exception is raised.

 The code :import org.apache.spark.{SparkConf, SparkContext, Logging}import 
com.datastax.spark.connector.cql.CassandraConnectorimport 
com.datastax.spark.connector._
object App extends Logging{    def main(args: Array[String]) {
        // Get the cassandra IP and create the spark context        val 
cassandraIP = System.getenv("CASSANDRA_IP");        val sparkConf = new 
SparkConf(true)                        .set("spark.cassandra.connection.host", 
cassandraIP)                        .set("spark.cleaner.ttl", "3600")           
             .setAppName("Simple Spark Cassandra Example")





        val sc = new SparkContext(sparkConf)
        // Works        CassandraConnector(sparkConf).withSessionDo { session 
=>           session.execute("INSERT INTO demo.tweets(id, user, body, time, 
latitude, longitude) VALUES (19, 'Name', 'Body', '2016-03-19 09:00:00-0300', 
39, 39)")        }
        // Does not work        val demo = sc.parallelize(Seq((9, "Name", 
"Body", "2016-03-29 19:00:00-0300", 29, 29)))        // Raises the exception    
    demo.saveToCassandra("demo", "tweets", SomeColumns("id", "user", "body", 
"time", "latitude", "longitude"))
    }
}











 The exception:16/03/28 14:15:41 INFO CassandraConnector: Connected to 
Cassandra cluster: Test ClusterException in thread "main" 
java.util.NoSuchElementException: Column  not found in demo.tweetsat 
com.datastax.spark.connector.cql.StructDef$$anonfun$columnByName$2.apply(Schema.scala:60)at
 
com.datastax.spark.connector.cql.StructDef$$anonfun$columnByName$2.apply(Schema.scala:60)at
 scala.collection.Map$WithDefault.default(Map.scala:52)at 
scala.collection.MapLike$class.apply(MapLike.scala:141)at 
scala.collection.AbstractMap.apply(Map.scala:58)at 
com.datastax.spark.connector.cql.TableDef$$anonfun$9.apply(Schema.scala:153)at 
com.datastax.spark.connector.cql.TableDef$$anonfun$9.apply(Schema.scala:152)at 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)at
 scala.collection.immutable.Map$Map1.foreach(Map.scala:109)at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)at 
com.datastax.spark.connector.cql.TableDef.(Schema.scala:152)at 
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:283)at
 
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:271)at
 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)at
 scala.collection.immutable.Set$Set4.foreach(Set.scala:137)at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)at 
com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchTables$1(Schema.scala:271)at
 
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:295)at
 
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:294)at
 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)at
 scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)at 
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)at 
com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1(Schema.scala:294)at
 
com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:307)at
 
com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:304)at
 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:121)at
 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:120)at
 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)at
 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)at
 
com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)at

Expiring Columns

2016-03-21 Thread Anuj Wadehra

Hi,
I want to understand how Expiring columns work in Cassandra.

Query:Documentation says that once TTL of a column expires, tombstones are 
created/ marked when the sstable gets compacted. Is there a possibility that a 
query (range scan/ row query) returns expired column data just because the 
sstable never participated in a compaction after TTL of the column expired?
For Example:  10 AM Data inserted with ttl=60 seconds  10:05 AM A query is run 
on inserted data  10:07 AM sstable is compacted and column     is marked 
tombstone.
Will the query return expired data in above scenario? If yes/no, why?

ThanksAnuj





Sent from Yahoo Mail on Android

IOException: MkDirs Failed to Create in Spark

2016-03-01 Thread Anuj Wadehra

Hi
 
We are using Spark with Cassandra. While using rdd.saveAsTextFile("/tmp/dr"), 
we are getting following error when we run the application with root access. 
Spark is able to create two level of directories but fails after that with 
Exception:

16/03/01 22:59:48 WARN TaskSetManager: Lost task 73.3 in stage 0.0 (TID 144, 
host1): java.io.IOException: Mkdirs failed to create 
file:/tmp/dr/_temporary/0/_temporary/attempt_201603012259__m_73_144
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438)
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1056)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Permissions on /tmp:
chmod -R 777 /tmp has been executed and permissions look like:
drwxrwxrwx.  31 root root 1.2K Mar  1 22:54 tmp

Forgive me for raising this question in Cassandra Mailing list. I think Spark & 
Cassandra user base is overlapping, so I expected help here.
I am not yet part of Spark mailing list.

Thanks
Anuj

Re: how to read parent_repair_history table?

2016-02-25 Thread Anuj Wadehra

Hi Jimmy,
We are on 2.0.x. We are planning to use JMX notifications for getting repair 
status. To repair database, we call forceTableRepairPrimaryRange JMX operation 
from our Java client application on each node. You can call other latest JMX 
methods for repair.
I would be keen in knowing the pros/cons of handling repair status via JMX 
notifications Vs via database tables.
We are planning to implement it as follows:
1. Before repairing each keyspace via JMX, register two listeners: one for 
listening to StorageService MBean notifications about repair status and other 
the connection listener for detecting connection failures and lost JMX 
notifications.
2. We ensure that if 256 success session notifications are received, keyspace 
repair is successful. We have 256 ranges on each node.
3.If there are connection closed notifications, we will re-register the Mbean 
listener and retry repair once.
4. If there are Lost Notifications we retry the repair once before failing it.

ThanksAnuj

Sent from Yahoo Mail on Android 

 On Thu, 25 Feb, 2016 at 7:18 pm, Paulo Motta wrote:  
Hello Jimmy,

The parent_repair_history table keeps track of start and finish information of 
a repair session.  The other table repair_history keeps track of repair status 
as it progresses. So, you must first query the parent_repair_history table to 
check if a repair started and finish, as well as its duration, and inspect the 
repair_history table to troubleshoot more specific details of a given repair 
session.

Answering your questions below:

> Is every invocation of nodetool repair execution will be recorded as one 
> entry in parent_repair_history CF regardless if it is across DC, local node 
> repair, or other options ?
Actually two entries, one for start and one for finish.

> A repair job is done only if "finished" column contains value? and a repair 
> job is successfully done only if there is no value in exce ption_messages or 
> exception_stacktrace ?

correct

> what is the purpose of successful_ranges column? do i have to check they are 
> all matched with requested_range to ensure a successful run?
correct

-
> Ultimately, how to find out the overall repair health/status in a given 
> cluster?

Check if repair is being executed on all nodes within gc_grace_seconds, and 
tune that value or troubleshoot problems otherwise.

> Scanning through parent_repair_history and making sure all the known 
> keyspaces has a good repair run in recent days?

Sounds good.

You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more 
information.

2016-02-25 3:13 GMT-03:00 Jimmy Lin :

hi all,
few questions regarding how to read or digest the 
system_distributed.parent_repair_history CF, that I am very intereted to use to 
find out our repair status... 

-
Is every invocation of nodetool repair execution will be recorded as one entry 
in parent_repair_history CF regardless if it is across DC, local node repair, 
or other options ?
-
A repair job is done only if "finished" column contains value? and a repair job 
is successfully done only if there is no value in exce
ption_messages or exception_stacktrace ?
what is the purpose of successful_ranges column? do i have to check they are 
all matched with requested_range to ensure a successful run?
-
Ultimately, how to find out the overall repair health/status in a given cluster?
Scanning through parent_repair_history and making sure all the known keyspaces 
has a good repair run in recent days?
---
CREATE TABLE system_distributed.parent_repair_history (
    parent_id timeuuid PRIMARY KEY,
    columnfamily_names set,
    exception_message text,
    exception_stacktrace text,
    finished_at timestamp,
    keyspace_name text,
    requested_ranges set,
    started_at timestamp,
    successful_ranges set
)

Re: Restart Cassandra automatically

2016-02-23 Thread Anuj Wadehra

Hi Subharaj,
Cassandra is built to be a Fault tolerant distributed db and suitable for 
building HA systems. As Cassandra provides multiple replicas for the same data, 
if a single nide goes down in Production, it wont bring down the cluster.
In my opinion, if you target to start one or more failed Cassandra nodes 
without investigating the issue, you can damage system health rather than 
preserve it.
Please set RF amd CL appropriately to ensure that system can afford node 
failures.
ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Fri, 5 Feb, 2016 at 9:56 am, Debraj Manna wrote: 
  Hi,


What is the best way to keep cassandra running? My requirement is if for some 
reason cassandra stops then it should get started automatically. 

I tried to achieve this by adding cassandra to supervisord. My supervisor conf 
for cassandra looks like below:-
[program:cassandra]
command=/bin/bash -c 'sleep 10 && bin/cassandra'
directory=/opt/cassandra/
autostart=true
autorestart=true
startretries=3
stderr_logfile=/var/log/cassandra_supervisor.err.log
stdout_logfile=/var/log/cassandra_supervisor.out.log

But it does not seem to work properly. Even if I stop cassandra from 
supervisor then the cassandra process seem to be running if I do 
ps -ef | grep cassandra


I also tried the configuration mentioned in this question but still no luck.

Can someone let me know what is the best way to keep cassandra running on 
production environment?
Environment
   
   - Cassandra 2.2.4
   - Debian 8

Thanks,

Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-19 Thread Anuj Wadehra

Hi Mike,

Using batches with many rows puts heavy load on the coordinator and is 
generally not considered a good practice. With 1500 rows in a batch with 
different partition keys, even on a large cluster, you will eventually end up 
waiting for every node in the cluster. This increases the likelyhood of 
timeouts. As per my understanding of batches, I think you should revisit the 
batch size. May be people having expertise on batches can comment here.

ThanksAnuj
Sent from Yahoo Mail on Android 

  On Fri, 19 Feb, 2016 at 8:18 pm, Mike Heffner<m...@librato.com> wrote:   Anuj,
So we originally started testing with Java8 + G1, however we were able to 
reproduce the same results with the default CMS settings that ship in the 
cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses during 
the runs.
Query pattern during our testing was 100% writes, batching (via Thrift mostly) 
to 5 tables, between 6-1500 rows per batch.
Mike
On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Whats the GC overhead? Can you your share your GC collector and settings ?

Whats your query pattern? Do you use secondary indexes, batches, in clause etc?

Anuj

Sent from Yahoo Mail on Android 

 On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner<m...@librato.com> wrote:   Alain,
Thanks for the suggestions.

Sure, tpstats are here: https://gist.github.com/mheffner/a979ae1a0304480b052a. 
Looking at the metrics across the ring, there were no blocked tasks nor dropped 
messages.
Iowait metrics look fine, so it doesn't appear to be blocking on disk. 
Similarly, there are no long GC pauses.
We haven't noticed latency on any particular table higher than others or 
correlated around the occurrence of a timeout. We have noticed with further 
testing that running cassandra-stress against the ring, while our workload is 
writing to the same ring, will incur similar 10 second timeouts. If our 
workload is not writing to the ring, cassandra stress will run without hitting 
timeouts. This seems to imply that our workload pattern is causing something to 
block cluster-wide, since the stress tool writes to a different keyspace then 
our workload.
I mentioned in another reply that we've tracked it to something between 2.0.x 
and 2.1.x, so we are focusing on narrowing which point release it was 
introduced in.
Cheers,
Mike
On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <arodr...@gmail.com> wrote:

Hi Mike,
What about the output of tpstats ? I imagine you have dropped messages there. 
Any blocked threads ? Could you paste this output here ?
May this be due to some network hiccup to access the disks as they are EBS ? 
Can you think of anyway of checking this ? Do you have a lot of GC logs, how 
long are the pauses (use something like: grep -i 'GCInspector' 
/var/log/cassandra/system.log) ?
Something else you could check are local_writes stats to see if only one table 
if affected or this is keyspace / cluster wide. You can use metrics exposed by 
cassandra or if you have no dashboards I believe a: 'nodetool cfstats  | 
grep -e 'Table:' -e 'Local'' should give you a rough idea of local latencies.
Those are just things I would check, I have not a clue on what is happening 
here, hope this will help.
C*heers,-Alain RodriguezFrance
The Last Picklehttp://www.thelastpickle.com
2016-02-18 5:13 GMT+01:00 Mike Heffner <m...@librato.com>:

Jaydeep,
No, we don't use any light weight transactions.
Mike
On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <chovatia.jayd...@gmail.com> 
wrote:

Are you guys using light weight transactions in your write path?
On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <fabrice.faco...@gmail.com> 
wrote:

Are your commitlog and data on the same disk ? If yes, you should put
commitlogs on a separate disk which don't have a lot of IO.

Others IO may have great impact impact on your commitlog writing and
it may even block.

An example of impact IO may have, even for Async writes:
https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic

2016-02-11 0:31 GMT+01:00 Mike Heffner <m...@librato.com>:
> Jeff,
>
> We have both commitlog and data on a 4TB EBS with 10k IOPS.
>
> Mike
>
> On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
> wrote:
>>
>> What disk size are you using?
>>
>>
>>
>> From: Mike Heffner
>> Reply-To: "user@cassandra.apache.org"
>> Date: Wednesday, February 10, 2016 at 2:24 PM
>> To: "user@cassandra.apache.org"
>> Cc: Peter Norton
>> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>>
>> Paulo,
>>
>> Thanks for the suggestion, we ran some tests against CMS and saw the same
>> timeouts. On that note though, we are going to try doubling the instance
>> sizes and testing with double t

Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-18 Thread Anuj Wadehra

Whats the GC overhead? Can you your share your GC collector and settings ?

Whats your query pattern? Do you use secondary indexes, batches, in clause etc?

Anuj

Sent from Yahoo Mail on Android 

  On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner wrote:   
Alain,
Thanks for the suggestions.

Sure, tpstats are here: https://gist.github.com/mheffner/a979ae1a0304480b052a. 
Looking at the metrics across the ring, there were no blocked tasks nor dropped 
messages.
Iowait metrics look fine, so it doesn't appear to be blocking on disk. 
Similarly, there are no long GC pauses.
We haven't noticed latency on any particular table higher than others or 
correlated around the occurrence of a timeout. We have noticed with further 
testing that running cassandra-stress against the ring, while our workload is 
writing to the same ring, will incur similar 10 second timeouts. If our 
workload is not writing to the ring, cassandra stress will run without hitting 
timeouts. This seems to imply that our workload pattern is causing something to 
block cluster-wide, since the stress tool writes to a different keyspace then 
our workload.
I mentioned in another reply that we've tracked it to something between 2.0.x 
and 2.1.x, so we are focusing on narrowing which point release it was 
introduced in.
Cheers,
Mike
On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ  wrote:

Hi Mike,
What about the output of tpstats ? I imagine you have dropped messages there. 
Any blocked threads ? Could you paste this output here ?
May this be due to some network hiccup to access the disks as they are EBS ? 
Can you think of anyway of checking this ? Do you have a lot of GC logs, how 
long are the pauses (use something like: grep -i 'GCInspector' 
/var/log/cassandra/system.log) ?
Something else you could check are local_writes stats to see if only one table 
if affected or this is keyspace / cluster wide. You can use metrics exposed by 
cassandra or if you have no dashboards I believe a: 'nodetool cfstats  | 
grep -e 'Table:' -e 'Local'' should give you a rough idea of local latencies.
Those are just things I would check, I have not a clue on what is happening 
here, hope this will help.
C*heers,-Alain RodriguezFrance
The Last Picklehttp://www.thelastpickle.com
2016-02-18 5:13 GMT+01:00 Mike Heffner :

Jaydeep,
No, we don't use any light weight transactions.
Mike
On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia  
wrote:

Are you guys using light weight transactions in your write path?
On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat  
wrote:

Are your commitlog and data on the same disk ? If yes, you should put
commitlogs on a separate disk which don't have a lot of IO.

Others IO may have great impact impact on your commitlog writing and
it may even block.

An example of impact IO may have, even for Async writes:
https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic

2016-02-11 0:31 GMT+01:00 Mike Heffner :
> Jeff,
>
> We have both commitlog and data on a 4TB EBS with 10k IOPS.
>
> Mike
>
> On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa 
> wrote:
>>
>> What disk size are you using?
>>
>>
>>
>> From: Mike Heffner
>> Reply-To: "user@cassandra.apache.org"
>> Date: Wednesday, February 10, 2016 at 2:24 PM
>> To: "user@cassandra.apache.org"
>> Cc: Peter Norton
>> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>>
>> Paulo,
>>
>> Thanks for the suggestion, we ran some tests against CMS and saw the same
>> timeouts. On that note though, we are going to try doubling the instance
>> sizes and testing with double the heap (even though current usage is low).
>>
>> Mike
>>
>> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta 
>> wrote:
>>>
>>> Are you using the same GC settings as the staging 2.0 cluster? If not,
>>> could you try using the default GC settings (CMS) and see if that changes
>>> anything? This is just a wild guess, but there were reports before of
>>> G1-caused instabilities with small heap sizes (< 16GB - see CASSANDRA-10403
>>> for more context). Please ignore if you already tried reverting back to CMS.
>>>
>>> 2016-02-10 16:51 GMT-03:00 Mike Heffner :

 Hi all,

 We've recently embarked on a project to update our Cassandra
 infrastructure running on EC2. We are long time users of 2.0.x and are
 testing out a move to version 2.2.5 running on VPC with EBS. Our test setup
 is a 3 node, RF=3 cluster supporting a small write load (mirror of our
 staging load).

 We are writing at QUORUM and while p95's look good compared to our
 staging 2.0.x cluster, we are seeing frequent write operations that time 
 out
 at the max write_request_timeout_in_ms (10 seconds). CPU across the cluster
 is < 10% and EBS write load is < 100

Re: Scenarios which need Repair

2016-02-09 Thread Anuj Wadehra

Hi,
Can someone take this?

ThanksAnuj


 
 
  On Mon, 8 Feb, 2016 at 11:44 pm, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote:  
 Hi,

Setup:

We are on 2.0.14. We have some deployments with just one DC(RF:3) while others 
with two DCs (RF:3,RF:3).
We ALWAYS use LOCAL_QUORUM for both Reads and Writes. 

Scenario:
-
We used to run nodetool repair on all tables every gc_grace_seconds. Recently, 
we decided to identify tables which really need repair and only run repair on 
those tables.

We have identified basically two kinds of tables which don't need repair:
TYPE-1 Tables which have only inserts, no upserts and delete by TTL.
TYPE-2 Tables with a counter column. I don't have much experience with counters 
but I can explain the use case.
We use a counter column to keep check on our traffic rate.Values are usually 
updated numerous times in a minute and we need not be ACCURATE with the 
value-few values here and there are OK.

Questions:
--
Can we COMPLETELY avoid maintenance repair on TYPE-1 and TYPE-2 tables? If yes, 
will there be any side effect of not repairing such data often in case of 
dropped mutations, failure scenarios etc.?
What will be the scenarios when repair would be needed on such tables?
   
  
Thanks
Anuj

Re: Moving Away from Compact Storage

2016-02-02 Thread Anuj Wadehra

Will it be possible to read dynamic columns data from compact storage and 
trasform them as collection e.g. map in new table?

ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Wed, 3 Feb, 2016 at 12:28 am, DuyHai Doan<doanduy...@gmail.com> wrote:   
So there is no "static" (in the sense of CQL static) column in your legacy 
table. 
Just define a Scala case class to match this table and use Spark to dump the 
content to a new non compact CQL table
On Tue, Feb 2, 2016 at 7:55 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Our old table looks like this from cqlsh:
CREATE TABLE table table1 (  key text,  "Col1" blob,  "Col2" text,  "Col3" 
text,  "Col4" text,  PRIMARY KEY (key)) WITH COMPACT STORAGE AND …
And it will have some dynamic text data which we are planning to add in 
collections..
Please let me know if you need more details..

ThanksAnujSent from Yahoo Mail on Android 
 
 On Wed, 3 Feb, 2016 at 12:14 am, DuyHai Doan<doanduy...@gmail.com> wrote:   
Can you give the CREATE TABLE script for you old compact storage table ? Or at 
least the cassandra-client creation script
On Tue, Feb 2, 2016 at 3:48 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Thanks DuyHai !! We were also thinking to do it the "Spark" way but I was not 
sure that its would be so simple :)

We have a compact storage cf with each row having some data in staticly defined 
columns while other data in dynamic columns. Is the approach mentioned in link 
adaptable to the scenario where we want to migrate the existing data to a 
Non-Compact CF with static columns and collections ?

Thanks
Anuj


On Tue, 2/2/16, DuyHai Doan <doanduy...@gmail.com> wrote:

 Subject: Re: Moving Away from Compact Storage
 To: user@cassandra.apache.org
 Date: Tuesday, 2 February, 2016, 12:57 AM

 Use Apache
 Spark to parallelize the data migration. Look at this piece
 of code 
https://github.com/doanduyhai/Cassandra-Spark-Demo/blob/master/src/main/scala/usecases/MigrateAlbumsData.scala#L58-L60
 If your source and target tables
 have the SAME structure (except for the COMPACT STORAGE
 clause), migration with Spark is a 2 lines of
 code
 On Mon, Feb 1, 2016 at 8:14
 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
 wrote:
 Hi
 Whats the fastest and reliable way
 to migrate data from a Compact Storage table to Non-Compact
 storage table?
 I was not
 able to find any command for dropping the compact storage
 directive..so I think migrating data is the only way...any
 suggestions?
 ThanksAnuj

Re: Moving Away from Compact Storage

2016-02-02 Thread Anuj Wadehra

Our old table looks like this from cqlsh:
CREATE TABLE table table1 (  key text,  "Col1" blob,  "Col2" text,  "Col3" 
text,  "Col4" text,  PRIMARY KEY (key)) WITH COMPACT STORAGE AND …
And it will have some dynamic text data which we are planning to add in 
collections..
Please let me know if you need more details..

ThanksAnujSent from Yahoo Mail on Android 
 
  On Wed, 3 Feb, 2016 at 12:14 am, DuyHai Doan<doanduy...@gmail.com> wrote:   
Can you give the CREATE TABLE script for you old compact storage table ? Or at 
least the cassandra-client creation script
On Tue, Feb 2, 2016 at 3:48 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Thanks DuyHai !! We were also thinking to do it the "Spark" way but I was not 
sure that its would be so simple :)

We have a compact storage cf with each row having some data in staticly defined 
columns while other data in dynamic columns. Is the approach mentioned in link 
adaptable to the scenario where we want to migrate the existing data to a 
Non-Compact CF with static columns and collections ?

Thanks
Anuj


On Tue, 2/2/16, DuyHai Doan <doanduy...@gmail.com> wrote:

 Subject: Re: Moving Away from Compact Storage
 To: user@cassandra.apache.org
 Date: Tuesday, 2 February, 2016, 12:57 AM

 Use Apache
 Spark to parallelize the data migration. Look at this piece
 of code 
https://github.com/doanduyhai/Cassandra-Spark-Demo/blob/master/src/main/scala/usecases/MigrateAlbumsData.scala#L58-L60
 If your source and target tables
 have the SAME structure (except for the COMPACT STORAGE
 clause), migration with Spark is a 2 lines of
 code
 On Mon, Feb 1, 2016 at 8:14
 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
 wrote:
 Hi
 Whats the fastest and reliable way
 to migrate data from a Compact Storage table to Non-Compact
 storage table?
 I was not
 able to find any command for dropping the compact storage
 directive..so I think migrating data is the only way...any
 suggestions?
 ThanksAnuj

Re: Moving Away from Compact Storage

2016-02-02 Thread Anuj Wadehra

Thanks DuyHai !! We were also thinking to do it the "Spark" way but I was not 
sure that its would be so simple :)
 
We have a compact storage cf with each row having some data in staticly defined 
columns while other data in dynamic columns. Is the approach mentioned in link 
adaptable to the scenario where we want to migrate the existing data to a 
Non-Compact CF with static columns and collections ?

Thanks
Anuj


On Tue, 2/2/16, DuyHai Doan <doanduy...@gmail.com> wrote:

 Subject: Re: Moving Away from Compact Storage
 To: user@cassandra.apache.org
 Date: Tuesday, 2 February, 2016, 12:57 AM
 
 Use Apache
 Spark to parallelize the data migration. Look at this piece
 of code 
https://github.com/doanduyhai/Cassandra-Spark-Demo/blob/master/src/main/scala/usecases/MigrateAlbumsData.scala#L58-L60
 If your source and target tables
 have the SAME structure (except for the COMPACT STORAGE
 clause), migration with Spark is a 2 lines of
 code
 On Mon, Feb 1, 2016 at 8:14
 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
 wrote:
 Hi
 Whats the fastest and reliable way
 to migrate data from a Compact Storage table to Non-Compact
 storage table?
 I was not
 able to find any command for dropping the compact storage
 directive..so I think migrating data is the only way...any
 suggestions?
 ThanksAnuj

Re: Cassandra's log is full of mesages reset by peers even without traffic

2016-02-02 Thread Anuj Wadehra

Hi Jean,
As mentioned in the DataStax link, your TCP connections will be marked dead 
after 300+75*9 =975 seconds. Make sure that your firewall idle timeout is more 
than 975 seconds. Otherwise firewall will drop connections and you may face 
issues.You can also try setting all three values same as mentioned in the link 
to see whether the problem gets resolved after doing that.
ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Mon, 1 Feb, 2016 at 9:18 pm, Jean Carlo<jean.jeancar...@gmail.com> wrote:  
 Hello Annuj,,

I checked my settings and this what I got.

root@node001[SPH][BENCH][PnS3]:~$ sysctl -A | grep net.ipv4 | grep 
net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_probes = 9
root@node001[SPH][BENCH][PnS3]:~$ sysctl -A | grep net.ipv4 | grep 
net.ipv4.tcp_keepalive_intvl 
net.ipv4.tcp_keepalive_intvl = 75
root@node001[SPH][BENCH][PnS3]:~$ sysctl -A | grep net.ipv4 | grep 
net.ipv4.tcp_keepalive_time 
net.ipv4.tcp_keepalive_time = 300

The tcp_keepalive_time is quite high in comparation to that written on the doc

https://docs.datastax.com/en/cassandra/2.1/cassandra/troubleshooting/trblshootIdleFirewall.html




Do you think that is ok?  

Best regards

Jean Carlo
"The best way to predict the future is to invent it" Alan Kay

On Fri, Jan 29, 2016 at 11:02 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Hi Jean,
Please make sure that your Firewall is not dropping TCP connections which are 
in use. Tcp keep alive on all nodes must be less than the firewall setting. 
Please refer to 
https://docs.datastax.com/en/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html
 for details on TCP settings.

ThanksAnuj

Sent from Yahoo Mail on Android 
 
 On Fri, 29 Jan, 2016 at 3:21 pm, Jean Carlo<jean.jeancar...@gmail.com> wrote:  
 Hello guys, 

I have a cluster cassandra 2.1.12 with 6 nodes. All the logs of my nodes are 
having this messages marked as INFO

INFO  [SharedPool-Worker-1] 2016-01-29 10:40:57,745 Message.java:532 - 
Unexpected exception during request; channel = [id: 0xff15eb8c, 
/172.16.162.4:9042]
java.io.IOException: Error while read(...): Connection reset by peer
    at io.netty.channel.epoll.Native.readAddress(Native Method) 
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at 
io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.doReadBytes(EpollSocketChannel.java:675)
 ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at 
io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:714)
 ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at 
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:326) 
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) 
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]

This happens either the cluster is stressed or not. Btw it is not production. 
The ip marked there (172.16.162.4) belongs to a node of the cluster, this is 
not the only node that appears, acctually we are having all the node's ip 
having that reset by peer problem.

Our cluster is having more reads than writes. like 50 reads per second. 

Any one got the same problem?


Best regards

Jean Carlo
"The best way to predict the future is to invent it" Alan Kay

Cqlsh hangs & closes automatically

2016-02-02 Thread Anuj Wadehra

My cqlsh prompt hangs and closes if I try to fetch just 100 rows using select * 
query. Cassandra-cli does the job. Any solution?



ThanksAnuj

Re: Moving Away from Compact Storage

2016-02-02 Thread Anuj Wadehra

By dynamic columns, I mean columns not defined in schema. In current scenario, 
every row has some data in columns which are defined in schema while rest of 
the data is in columns which are not defined in schema. We used Thrift for 
inserting data.
In new schema, we want to create a collection column and put all the data which 
was there in columns NOT defined in schema to the collection. 

ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Wed, 3 Feb, 2016 at 12:36 am, DuyHai Doan<doanduy...@gmail.com> wrote:   
You 'll need to do the transformation in Spark, although I don't understand 
what you mean by "dynamic columns". Given the CREATE TABLE script you gave 
earlier, there is nothing such as dynamic columns
On Tue, Feb 2, 2016 at 8:01 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Will it be possible to read dynamic columns data from compact storage and 
trasform them as collection e.g. map in new table?

ThanksAnuj

Sent from Yahoo Mail on Android 
 
 On Wed, 3 Feb, 2016 at 12:28 am, DuyHai Doan<doanduy...@gmail.com> wrote:   So 
there is no "static" (in the sense of CQL static) column in your legacy table. 
Just define a Scala case class to match this table and use Spark to dump the 
content to a new non compact CQL table
On Tue, Feb 2, 2016 at 7:55 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Our old table looks like this from cqlsh:
CREATE TABLE table table1 (  key text,  "Col1" blob,  "Col2" text,  "Col3" 
text,  "Col4" text,  PRIMARY KEY (key)) WITH COMPACT STORAGE AND …
And it will have some dynamic text data which we are planning to add in 
collections..
Please let me know if you need more details..

ThanksAnujSent from Yahoo Mail on Android 
 
 On Wed, 3 Feb, 2016 at 12:14 am, DuyHai Doan<doanduy...@gmail.com> wrote:   
Can you give the CREATE TABLE script for you old compact storage table ? Or at 
least the cassandra-client creation script
On Tue, Feb 2, 2016 at 3:48 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Thanks DuyHai !! We were also thinking to do it the "Spark" way but I was not 
sure that its would be so simple :)

We have a compact storage cf with each row having some data in staticly defined 
columns while other data in dynamic columns. Is the approach mentioned in link 
adaptable to the scenario where we want to migrate the existing data to a 
Non-Compact CF with static columns and collections ?

Thanks
Anuj


On Tue, 2/2/16, DuyHai Doan <doanduy...@gmail.com> wrote:

 Subject: Re: Moving Away from Compact Storage
 To: user@cassandra.apache.org
 Date: Tuesday, 2 February, 2016, 12:57 AM

 Use Apache
 Spark to parallelize the data migration. Look at this piece
 of code 
https://github.com/doanduyhai/Cassandra-Spark-Demo/blob/master/src/main/scala/usecases/MigrateAlbumsData.scala#L58-L60
 If your source and target tables
 have the SAME structure (except for the COMPACT STORAGE
 clause), migration with Spark is a 2 lines of
 code
 On Mon, Feb 1, 2016 at 8:14
 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
 wrote:
 Hi
 Whats the fastest and reliable way
 to migrate data from a Compact Storage table to Non-Compact
 storage table?
 I was not
 able to find any command for dropping the compact storage
 directive..so I think migrating data is the only way...any
 suggestions?
 ThanksAnuj

Moving Away from Compact Storage

2016-02-01 Thread Anuj Wadehra

Hi
Whats the fastest and reliable way to migrate data from a Compact Storage table 
to Non-Compact storage table?
I was not able to find any command for dropping the compact storage 
directive..so I think migrating data is the only way...any suggestions?
ThanksAnuj

Re: Cassandra's log is full of mesages reset by peers even without traffic

2016-01-29 Thread Anuj Wadehra

Hi Jean,
Please make sure that your Firewall is not dropping TCP connections which are 
in use. Tcp keep alive on all nodes must be less than the firewall setting. 
Please refer to 
https://docs.datastax.com/en/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html
 for details on TCP settings.

ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Fri, 29 Jan, 2016 at 3:21 pm, Jean Carlo wrote: 
  Hello guys, 

I have a cluster cassandra 2.1.12 with 6 nodes. All the logs of my nodes are 
having this messages marked as INFO

INFO  [SharedPool-Worker-1] 2016-01-29 10:40:57,745 Message.java:532 - 
Unexpected exception during request; channel = [id: 0xff15eb8c, 
/172.16.162.4:9042]
java.io.IOException: Error while read(...): Connection reset by peer
    at io.netty.channel.epoll.Native.readAddress(Native Method) 
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at 
io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.doReadBytes(EpollSocketChannel.java:675)
 ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at 
io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:714)
 ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at 
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:326) 
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) 
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
    at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]

This happens either the cluster is stressed or not. Btw it is not production. 
The ip marked there (172.16.162.4) belongs to a node of the cluster, this is 
not the only node that appears, acctually we are having all the node's ip 
having that reset by peer problem.

Our cluster is having more reads than writes. like 50 reads per second. 

Any one got the same problem?


Best regards

Jean Carlo
"The best way to predict the future is to invent it" Alan Kay

Re: Read operations freeze for a few second while adding a new node

2016-01-28 Thread Anuj Wadehra

Hi Lorand,
Do you see any different gc pattern during these 20 seconds?
In 2.0.x, memtable create lot of heap pressure. So in a way, reads are not 
isolated from writes.
Frankly speaking, I would have accepted 20 second slowness as scaling is one 
time activity. But may be your business case doesnt make that acceptable. 
Such tough requirements often drive improvements.. 

ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Thu, 28 Jan, 2016 at 9:41 pm, Lorand Kasler 
wrote:   Hi,
We are struggling with a problem that when adding nodes around 5% read 
operations freeze (aka time out after 1 second) for a few seconds (10-20 
seconds). It might not seems much, but at the order of 200k requests per second 
that's quite big of disruption.  It is well documented and known that adding 
nodes *has* impact on the latency or the completion of the requests but is 
there a way to lessen that? It is completely okay for write operations to fail 
or get blocked while adding nodes, but having the read path also impacted by 
this much (going from 30 millisecond 99 percentile latency to above 1 second) 
is what puzzles us.
We have a 36 node cluster, every node owning ~120 GB of data. We are using 
Cassandra version 2.0.14 with vnodes and we are in the process of increasing 
capacity of the cluster, by roughly doubling the nodes.  They have SSDs and 
have peak IO usage of ~30%. 
Apart from the latency metrics only FlushWrites are blocked 18% of the time 
(based on the tpstats counters), but that can only lead to blocking writes and 
not reads? 
Thank you

Re: Using TTL for data purge

2016-01-22 Thread Anuj Wadehra

On second thought, If you are anyways reading the user table on each website 
access and can afford extra IO, first option looks more appropriate as it will 
ease out the pain of manual purging maintenance and wont need full table scans.

ThanksAnuj


Sent from Yahoo Mail on Android 
 
  On Sat, 23 Jan, 2016 at 12:16 am, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: 
  
Give a deep thought on your use case. Different user tables/types may have 
different purge strategy based on how frequently a user account type is usually 
accessed, whats the user count for each user type and so on.
ThanksAnuj

Sent from Yahoo Mail on Android 
 
 On Fri, 22 Jan, 2016 at 11:37 pm, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote:  
Hi Joseph,

I am personally in favour of Second approach because I dont want to do lot of 
IO just because a user is accessing a site several times a day. 
Options I see:

1.If you are on SSDs, Test LCS and update TTL of all columns at each access. 
This will make sure that the system can tolerate the extra IO. Advantage: No 
scheduling job needed. Deletion is seemless. Improved read performace than STCS.
Disadvantage: To reinsert records with new TTL you would do read before write 
which is an Anti oattern and slow thing. Active users will cause unnecessary IO 
for just updating TTL.High IO due to LCS too.
2.Create a new table with user id key and last access time instead of relying 
on inbuilt secondary indexes. Overwrite the last access time at each access. 
Schedule jobs to read this table at regular intervals may be once a week and 
manually delete users from the main table based on the last access time. You 
can test using LCS with new table.
Advantage: Light weight writes for updating access time. Flexibility to update 
deletion logic.
Disadvantage: Manual scheduling job and code needs to be implemented. Scheduler 
would need a slow full table scan of users to know last access time. Full table 
scans could be done via token based parallel CQL queries for achieving 
performance. Using a Apache Spark job to find users to be purged would do that 
at tremendous speeds.
Secondary indexes are not suitable and dont scale well. I would suggest 
dropping them.



ThanksAnuj




 
 
 On Tue, 22 Dec, 2015 at 3:06 pm, jaalex.tech<jaalex.t...@gmail.com> wrote:  Hi,
I'm looking for suggestions/caveats on using TTL as a subsitute for a manual 
data purge job. 
We have few tables that hold user information - this could be guest or 
registered users, and there could be between 500K to 1M records created per day 
per table. Currently, these tables have a secondary indexed updated_date column 
which is populated on each update. However, we have been getting timeouts when 
running queries using updated_date when the number of records are high, so i 
don't think this would be a reliable option in the long term when we need to 
purge records that have not been used for the last X days. 
In this scenario, is it advisable to include a high enough TTL (i.e the amount 
of time we want these to last, could be 3 to 6 months) when inserting/updating 
records? 
There could be cases where the TTL may get reset after couple of days/weeks, 
when the user visits the site again.
The tables have fixed number of columns, except for one which has a clustering 
key, and may have max 10 entries per  partition key.
I need to know the overhead of having so many rows with TTL hanging around for 
a relatively longer duration (weeks/months), and the impacts it could have on 
performance/storage. If this is not a recommended approach, what would be an 
alternate design which could be used for a manual purge job, without using 
secondary indices.
We are using Cassandra 2.0.x.
Thanks,Joseph

Re: Using TTL for data purge

2016-01-22 Thread Anuj Wadehra


Give a deep thought on your use case. Different user tables/types may have 
different purge strategy based on how frequently a user account type is usually 
accessed, whats the user count for each user type and so on.
ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Fri, 22 Jan, 2016 at 11:37 pm, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: 
  Hi Joseph,

I am personally in favour of Second approach because I dont want to do lot of 
IO just because a user is accessing a site several times a day. 
Options I see:

1.If you are on SSDs, Test LCS and update TTL of all columns at each access. 
This will make sure that the system can tolerate the extra IO. Advantage: No 
scheduling job needed. Deletion is seemless. Improved read performace than STCS.
Disadvantage: To reinsert records with new TTL you would do read before write 
which is an Anti oattern and slow thing. Active users will cause unnecessary IO 
for just updating TTL.High IO due to LCS too.
2.Create a new table with user id key and last access time instead of relying 
on inbuilt secondary indexes. Overwrite the last access time at each access. 
Schedule jobs to read this table at regular intervals may be once a week and 
manually delete users from the main table based on the last access time. You 
can test using LCS with new table.
Advantage: Light weight writes for updating access time. Flexibility to update 
deletion logic.
Disadvantage: Manual scheduling job and code needs to be implemented. Scheduler 
would need a slow full table scan of users to know last access time. Full table 
scans could be done via token based parallel CQL queries for achieving 
performance. Using a Apache Spark job to find users to be purged would do that 
at tremendous speeds.
Secondary indexes are not suitable and dont scale well. I would suggest 
dropping them.



ThanksAnuj




 
 
  On Tue, 22 Dec, 2015 at 3:06 pm, jaalex.tech<jaalex.t...@gmail.com> wrote:   
Hi,
I'm looking for suggestions/caveats on using TTL as a subsitute for a manual 
data purge job. 
We have few tables that hold user information - this could be guest or 
registered users, and there could be between 500K to 1M records created per day 
per table. Currently, these tables have a secondary indexed updated_date column 
which is populated on each update. However, we have been getting timeouts when 
running queries using updated_date when the number of records are high, so i 
don't think this would be a reliable option in the long term when we need to 
purge records that have not been used for the last X days. 
In this scenario, is it advisable to include a high enough TTL (i.e the amount 
of time we want these to last, could be 3 to 6 months) when inserting/updating 
records? 
There could be cases where the TTL may get reset after couple of days/weeks, 
when the user visits the site again.
The tables have fixed number of columns, except for one which has a clustering 
key, and may have max 10 entries per  partition key.
I need to know the overhead of having so many rows with TTL hanging around for 
a relatively longer duration (weeks/months), and the impacts it could have on 
performance/storage. If this is not a recommended approach, what would be an 
alternate design which could be used for a manual purge job, without using 
secondary indices.
We are using Cassandra 2.0.x.
Thanks,Joseph

Re: Using TTL for data purge

2016-01-22 Thread Anuj Wadehra

Hi Joseph,

I am personally in favour of Second approach because I dont want to do lot of 
IO just because a user is accessing a site several times a day. 
Options I see:

1.If you are on SSDs, Test LCS and update TTL of all columns at each access. 
This will make sure that the system can tolerate the extra IO. Advantage: No 
scheduling job needed. Deletion is seemless. Improved read performace than STCS.
Disadvantage: To reinsert records with new TTL you would do read before write 
which is an Anti oattern and slow thing. Active users will cause unnecessary IO 
for just updating TTL.High IO due to LCS too.
2.Create a new table with user id key and last access time instead of relying 
on inbuilt secondary indexes. Overwrite the last access time at each access. 
Schedule jobs to read this table at regular intervals may be once a week and 
manually delete users from the main table based on the last access time. You 
can test using LCS with new table.
Advantage: Light weight writes for updating access time. Flexibility to update 
deletion logic.
Disadvantage: Manual scheduling job and code needs to be implemented. Scheduler 
would need a slow full table scan of users to know last access time. Full table 
scans could be done via token based parallel CQL queries for achieving 
performance. Using a Apache Spark job to find users to be purged would do that 
at tremendous speeds.
Secondary indexes are not suitable and dont scale well. I would suggest 
dropping them.



ThanksAnuj




 
 
  On Tue, 22 Dec, 2015 at 3:06 pm, jaalex.tech wrote:   
Hi,
I'm looking for suggestions/caveats on using TTL as a subsitute for a manual 
data purge job. 
We have few tables that hold user information - this could be guest or 
registered users, and there could be between 500K to 1M records created per day 
per table. Currently, these tables have a secondary indexed updated_date column 
which is populated on each update. However, we have been getting timeouts when 
running queries using updated_date when the number of records are high, so i 
don't think this would be a reliable option in the long term when we need to 
purge records that have not been used for the last X days. 
In this scenario, is it advisable to include a high enough TTL (i.e the amount 
of time we want these to last, could be 3 to 6 months) when inserting/updating 
records? 
There could be cases where the TTL may get reset after couple of days/weeks, 
when the user visits the site again.
The tables have fixed number of columns, except for one which has a clustering 
key, and may have max 10 entries per  partition key.
I need to know the overhead of having so many rows with TTL hanging around for 
a relatively longer duration (weeks/months), and the impacts it could have on 
performance/storage. If this is not a recommended approach, what would be an 
alternate design which could be used for a manual purge job, without using 
secondary indices.
We are using Cassandra 2.0.x.
Thanks,Joseph

Re: Production with Single Node

2016-01-22 Thread Anuj Wadehra

And I think in a 3 node cluster, RAID 0 would do the job instead of RAID 5 . So 
you will need less storage to get same disk space. But you will get protection 
against disk failures and infact entire node failure.
Anuj

Sent from Yahoo Mail on Android 
 
  On Sat, 23 Jan, 2016 at 10:30 am, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: 
  I think Jonathan said it earlier. You may be happy with the performance for 
now as you are using the same commitlog settings that you use in large 
clusters. Test the new setting recommended so that you know the real picture. 
Or be prepared to lose some data in case of failure.
Other than durability, you single node cluster would be Single Point of Failure 
for your site. RAID 5 will only protect you against a disk failure. But a 
server may be down for other reasons too. Question is :Are you ok with site 
going down?
I would suggest you to use hardware with smaller configuration to save on cost 
for smaller sites and go ahead with a 3 node minimum.That ways you will provide 
all the good features of your design irrespective of the site. Cassandra is 
known to work on commodity servers too. 


ThanksAnuj



Sent from Yahoo Mail on Android 
 
  On Sat, 23 Jan, 2016 at 4:23 am, Jack Krupansky<jack.krupan...@gmail.com> 
wrote:   You do of course have the simple technical matters, most of which need 
to be addressed with a proof of concept implementation, related to memory, 
storage, latency, and throughput. I mean, with a scaled cluster you can always 
add nodes to increase capacity and throughput, and reduce latency, but with a 
single node you have limited flexibility.
Just to be clear, Cassandra is still not recommended for "fat nodes" - even if 
you can fit tons of data on the node, you may not have the computes to satisfy 
throughput and latency requirements. And if you don't have enough system memory 
the amount of storage is irrelevant.
Back to my original question:How much data (rows, columns), what kind of load 
pattern (heavy write, heavy update, heavy query), and what types of queries 
(primary key-only, slices, filtering, secondary indexes, etc.)?

I do recall a customer who ran into problems because they had SSD but only a 
very limited amount so they were running out of storage. Having enough system 
memory for file system caching and offheap data is important as well.

-- Jack Krupansky
On Fri, Jan 22, 2016 at 5:07 PM, John Lammers <john.lamm...@karoshealth.com> 
wrote:

Thanks for your response Jack.
We are already sold on distributed databases, HA and scaling.  We just have 
some small deployments coming up where there's no money for servers to run 
multiple Cassandra nodes.
So, aside from the lack of HA, I'm asking if a single Cassandra node would be 
viable in a production environment.  (There would be RAID 5 and the RAID 
controller cache is backed by flash memory).
I'm asking because I'm concerned about using Cassandra in a way that it's not 
designed for.  That to me is the unsettling aspect.
If this is a bad idea, give me the ammo I need to shoot it down.  I need 
specific technical reasons.
Thanks!
--John
On Fri, Jan 22, 2016 at 4:47 PM, Jack Krupansky <jack.krupan...@gmail.com> 
wrote:

Is single-node Cassandra has the performance (and capacity) you need and the 
NoSQL data model and API are sufficient for your app, and your dev and ops and 
support teams are already familiar with and committed to Cassandra, and you 
don't need HA or scaling, then it sounds like you are set.
You asked about risks, and normally lack of HA and scaling are unacceptable 
risks when people are looking at distributed databases.
Most people on this list are dedicated to and passionate about distributed 
databases, HA, and scaling, so it is distinctly unsettling when somebody comes 
along who isn't interested in and committed to those same three qualities. But 
if single-node happens to work for you, then that's great.
-- Jack Krupansky

Re: Production with Single Node

2016-01-22 Thread Anuj Wadehra

I think Jonathan said it earlier. You may be happy with the performance for now 
as you are using the same commitlog settings that you use in large clusters. 
Test the new setting recommended so that you know the real picture. Or be 
prepared to lose some data in case of failure.
Other than durability, you single node cluster would be Single Point of Failure 
for your site. RAID 5 will only protect you against a disk failure. But a 
server may be down for other reasons too. Question is :Are you ok with site 
going down?
I would suggest you to use hardware with smaller configuration to save on cost 
for smaller sites and go ahead with a 3 node minimum.That ways you will provide 
all the good features of your design irrespective of the site. Cassandra is 
known to work on commodity servers too. 


ThanksAnuj



Sent from Yahoo Mail on Android 
 
  On Sat, 23 Jan, 2016 at 4:23 am, Jack Krupansky 
wrote:   You do of course have the simple technical matters, most of which need 
to be addressed with a proof of concept implementation, related to memory, 
storage, latency, and throughput. I mean, with a scaled cluster you can always 
add nodes to increase capacity and throughput, and reduce latency, but with a 
single node you have limited flexibility.
Just to be clear, Cassandra is still not recommended for "fat nodes" - even if 
you can fit tons of data on the node, you may not have the computes to satisfy 
throughput and latency requirements. And if you don't have enough system memory 
the amount of storage is irrelevant.
Back to my original question:How much data (rows, columns), what kind of load 
pattern (heavy write, heavy update, heavy query), and what types of queries 
(primary key-only, slices, filtering, secondary indexes, etc.)?

I do recall a customer who ran into problems because they had SSD but only a 
very limited amount so they were running out of storage. Having enough system 
memory for file system caching and offheap data is important as well.

-- Jack Krupansky
On Fri, Jan 22, 2016 at 5:07 PM, John Lammers  
wrote:

Thanks for your response Jack.
We are already sold on distributed databases, HA and scaling.  We just have 
some small deployments coming up where there's no money for servers to run 
multiple Cassandra nodes.
So, aside from the lack of HA, I'm asking if a single Cassandra node would be 
viable in a production environment.  (There would be RAID 5 and the RAID 
controller cache is backed by flash memory).
I'm asking because I'm concerned about using Cassandra in a way that it's not 
designed for.  That to me is the unsettling aspect.
If this is a bad idea, give me the ammo I need to shoot it down.  I need 
specific technical reasons.
Thanks!
--John
On Fri, Jan 22, 2016 at 4:47 PM, Jack Krupansky  
wrote:

Is single-node Cassandra has the performance (and capacity) you need and the 
NoSQL data model and API are sufficient for your app, and your dev and ops and 
support teams are already familiar with and committed to Cassandra, and you 
don't need HA or scaling, then it sounds like you are set.
You asked about risks, and normally lack of HA and scaling are unacceptable 
risks when people are looking at distributed databases.
Most people on this list are dedicated to and passionate about distributed 
databases, HA, and scaling, so it is distinctly unsettling when somebody comes 
along who isn't interested in and committed to those same three qualities. But 
if single-node happens to work for you, then that's great.
-- Jack Krupansky

Re: Run Repairs when a Node is Down

2016-01-20 Thread Anuj Wadehra

Thanks Paulo for sharing the JIRA !! I have added my comments there.
"It is not advisable to remain with a down node for a long time without 
replacing it (with risk of not being able to achieve consistency if another 
node goes down)."
I am referring to a generic scenario where a cluster may afford 2+ node 
failures based on RF but due to a single node failure, entire system health is 
in question as gc period is approaching and nodes are not getting repaired.
You and others who are interested can join the discussion on the JIRA page 
:https://issues.apache.org/jira/browse/CASSANDRA-10446
 ThanksAnuj
 
  On Tue, 19 Jan, 2016 at 1:21 am, Paulo Motta<pauloricard...@gmail.com> wrote: 
  Hello Anuj,

Repairing a range with down replicas may be valid if there is still QUORUM up 
replicas and using at least QUORUM for writes. My understanding is that it was 
disabled as default behavior on CASSANDRA-2290 to avoid misuse/confusion, and 
its not advisable to remain with a down node for a long time without replacing 
it (with risk of not being able to achieve consistency if another node goes 
down).

Issue https://issues.apache.org/jira/browse/CASSANDRA-10446 was created to 
allow repairing ranges with down replicas with a special flag (--force). If 
you're interested please add comments there and/or propose a patch.

Thanks,

Paulo



2016-01-17 1:33 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>:

Hi 
We are on 2.0.14,RF=3 in a 3 node cluster. We use repair -pr . Recently, we 
observed that repair -pr for all nodes fails if a node is down. Then I found 
the JIRA 
https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290
where an intentional decision was taken to abort the repair if a replica is 
down.
I need to understand the reasoning behind aborting the repair instead of 
proceeding with available replicas.
I have following concerns with the approach:
We say that we have a fault tolerant Cassandra system such that we can afford 
single node failure because RF=3 and we read/write at QUORUM.But when a node 
goes down and we are not sure how much time will be needed to restore the node, 
entire system health is in question as gc_grace_period is approaching and we 
are not able to run repair -pr on any of the nodes.
Then there is a dilemma:
Whether to remove the faulty node well before gc grace period so that we get 
enough time to save data by repairing other two nodes?
This may cause massive streaming which may be unnecessary if we are able to 
bring back the faulty node up before gc grace period.
OR
Wait and hope that the issue will be resolved before gc grace time and we will 
have some buffer to run repair -pr on all nodes.
OR
Increase the gc grace period temporarily. Then we should have capacity planning 
to accomodate the extra storage needed for extra gc grace that may be needed in 
case of node failure scenarios.

Besides knowing the reasoning behind the decision taken in CASSANDRA-2290, I 
need to understand the recommeded approach for maintaing a fault tolerant 
system which can handle node failures such that repair can be run smoothly and 
system health is maintained at all times.

ThanksAnuj Sent from Yahoo Mail on Android

Re: Run Repairs when a Node is Down

2016-01-20 Thread Anuj Wadehra

Thanks Paulo for sharing the JIRA !! I have added my comments there.
"It is not advisable to remain with a down node for a long time without 
replacing it (with risk of not being able to achieve consistency if another 
node goes down)."
I am referring to a generic scenario where a cluster may afford 2+ node 
failures based on RF but due to a single node failure, entire system health is 
in question as gc period is approaching and nodes are not getting repaired.
 I think the issue is important. I would suggest you and others interested in 
the issue to join the discussion on JIRA page 
:https://issues.apache.org/jira/browse/CASSANDRA-10446


ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Tue, 19 Jan, 2016 at 1:21 am, Paulo Motta<pauloricard...@gmail.com> wrote: 
  Hello Anuj,

Repairing a range with down replicas may be valid if there is still QUORUM up 
replicas and using at least QUORUM for writes. My understanding is that it was 
disabled as default behavior on CASSANDRA-2290 to avoid misuse/confusion, and 
its not advisable to remain with a down node for a long time without replacing 
it (with risk of not being able to achieve consistency if another node goes 
down).

Issue https://issues.apache.org/jira/browse/CASSANDRA-10446 was created to 
allow repairing ranges with down replicas with a special flag (--force). If 
you're interested please add comments there and/or propose a patch.

Thanks,

Paulo



2016-01-17 1:33 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>:

Hi 
We are on 2.0.14,RF=3 in a 3 node cluster. We use repair -pr . Recently, we 
observed that repair -pr for all nodes fails if a node is down. Then I found 
the JIRA 
https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290
where an intentional decision was taken to abort the repair if a replica is 
down.
I need to understand the reasoning behind aborting the repair instead of 
proceeding with available replicas.
I have following concerns with the approach:
We say that we have a fault tolerant Cassandra system such that we can afford 
single node failure because RF=3 and we read/write at QUORUM.But when a node 
goes down and we are not sure how much time will be needed to restore the node, 
entire system health is in question as gc_grace_period is approaching and we 
are not able to run repair -pr on any of the nodes.
Then there is a dilemma:
Whether to remove the faulty node well before gc grace period so that we get 
enough time to save data by repairing other two nodes?
This may cause massive streaming which may be unnecessary if we are able to 
bring back the faulty node up before gc grace period.
OR
Wait and hope that the issue will be resolved before gc grace time and we will 
have some buffer to run repair -pr on all nodes.
OR
Increase the gc grace period temporarily. Then we should have capacity planning 
to accomodate the extra storage needed for extra gc grace that may be needed in 
case of node failure scenarios.

Besides knowing the reasoning behind the decision taken in CASSANDRA-2290, I 
need to understand the recommeded approach for maintaing a fault tolerant 
system which can handle node failures such that repair can be run smoothly and 
system health is maintained at all times.

ThanksAnuj Sent from Yahoo Mail on Android

Run Repairs when a Node is Down

2016-01-16 Thread Anuj Wadehra

Hi 
We are on 2.0.14,RF=3 in a 3 node cluster. We use repair -pr . Recently, we 
observed that repair -pr for all nodes fails if a node is down. Then I found 
the JIRA 
https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290
where an intentional decision was taken to abort the repair if a replica is 
down.
I need to understand the reasoning behind aborting the repair instead of 
proceeding with available replicas.
I have following concerns with the approach:
We say that we have a fault tolerant Cassandra system such that we can afford 
single node failure because RF=3 and we read/write at QUORUM.But when a node 
goes down and we are not sure how much time will be needed to restore the node, 
entire system health is in question as gc_grace_period is approaching and we 
are not able to run repair -pr on any of the nodes.
Then there is a dilemma:
Whether to remove the faulty node well before gc grace period so that we get 
enough time to save data by repairing other two nodes?
This may cause massive streaming which may be unnecessary if we are able to 
bring back the faulty node up before gc grace period.
OR
Wait and hope that the issue will be resolved before gc grace time and we will 
have some buffer to run repair -pr on all nodes.
OR
Increase the gc grace period temporarily. Then we should have capacity planning 
to accomodate the extra storage needed for extra gc grace that may be needed in 
case of node failure scenarios.

Besides knowing the reasoning behind the decision taken in CASSANDRA-2290, I 
need to understand the recommeded approach for maintaing a fault tolerant 
system which can handle node failures such that repair can be run smoothly and 
system health is maintained at all times.

ThanksAnuj Sent from Yahoo Mail on Android

Impact of Changing Compaction Strategy

2016-01-15 Thread Anuj Wadehra

Hi,
I need to understand whether all existing sstables are recreated/updated when 
we change compaction strategy from STCS to DTCS?

Sstables are immutable by design but do we take an exception for such cases and 
update same files when an Alter statement is fired to change the compaction 
strategy?

ThanksAnuj
Sent from Yahoo Mail on Android

1 2 3 >

1 - 100 of 210 matches

Mail list logo